At Eclipse Code Recommenders, most of our recommendation engines use Bayesian Networks, which are a compact representation of probability distributions. They thus serve to express relationships between variables in a partially observable world. Our recommenders use these networks to predict what the developer wants to use next, based on what he has done previously.
When the Code Recommenders project first started, there was a need for a new open-source, pure-Java bayesian network library. As part of my bachelor thesis, I created such a library, called Jayes. Jayes has since become the backend of most Code Recommenders’ recommendation engines. Its development continues and a new version of Jayes will be included in the upcoming Code Recommenders 2.0.
This post describes how to use Jayes for your own inference tasks.
Guest Blogger: Michael
Michael Kutschke is currently completing his Master of Science at the Computer Science department of Technische Universität Darmstadt. His areas of interest include large scale data analysis, statistics and optimization. Michael contributes to the open source Code Recommenders project through his student work terms with Codetrails.
What Jayes is, and what it isn’t
Jayes is a library for Bayesian networks and the inference in such networks. At the moment, there is no learning component included. (We can, however, recommend the Apache Mahout library.)
Where can I get it?
There are two sources for getting your hands on Jayes’ source code:
The version in the Code Recommenders repository that we’re using for this post has not, at the time of writing, been merged into the Eclipse repository but will be soon. Also, it is still under development and does not yet carry the version number 2.0 (as we are just about to start the 2.0 branch of Code Recommenders). That means for the most current version of Jayes, my Github repository is the place to go. The Github repository also contains the classes used for evaluating and benchmarking Jayes, which will not move to Eclipse in the foreseeable future. These classes allow you to assess the runtime performance of Jayes.
How do I use it?
Before we start, you’ll need to have a rough idea of what a Bayesian network is and what it looks like. Wikipedia has a pretty good introduction. So let’s get started. For the use case of inference, there are only three classes you need to know:
The first two are used for setting up the model itself, while the third is the algorithm used for inference.
The BayesNet is a container for BayesNodes, which represent the random variables of the probability distribution you are modeling. The preferred way in Jayes 2.0 for creating BayesNodes is through BayesNet.createNode(String name). This is most likely the only method of BayesNet you need to use.
A BayesNode has outcomes, parents, and a conditional probability table. It is important to set the probabilities last, after setting the outcomes of the parent nodes. The following diagram shows the workflow:
The methods from BayesNode that perform these steps are
- BayesNode.setParents(List<BayesNode>), and
The probabilities in the conditional probability table need to be specified in a particular order. This is shown in the following code snippet:
BayesNode a = net.createNode("a");
BayesNode b = net.createNode("b");
b.addOutcomes("one", "two", "three");
0.1, 0.4, 0.5, // a == true
0.3, 0.4, 0.3, // a == false
BayesNode c = net.createNode("c");
// a == true
0.1, 0.9, // b == one
0.0, 1.0, // b == two
0.5, 0.5, // b == three
// a == false
0.2, 0.8, // b == one
0.0, 1.0, // b == two
0.7, 0.3, // b == three
We now have a network and want to perform inference. The class used for this task is JunctionTreeAlgorithm.
Map<BayesNode,String> evidence = new HashMap<BayesNode,String>();
double beliefsC = inferer.getBeliefs(c);
This gives us the probability distribution P(c | a = “false”, b =”three”).
Inference algorithms use an internal representation of the network that will not be updated when you update the BayesNet. Should your use case require changes to the BayesNet, you need to call IBayesInferer.setNetwork() again to update the internal representation.
We discourage using a BayesNode from a different network as a parent. The inference algorithms will access all BayesNodes through the BayesNet and this mixing of BayesNodes from different networks is very likely to lead to errors.
In Jayes 2.0, we added several advanced features that allowed us to trade-off between three major performance indicators for the inference engine: memory consumption, runtime performance, and numerical stability. For example, in terms of numerical stability, one limitation of Jayes is the network size. With increasing network size, any observed event is so unlikely that it becomes indistinguishable from an impossible event. This leads to an error because everything suddenly has zero probability – which is, of course, not true and Jayes consequently throws an exception to inform the user about this situation. Some of the advanced features described below have an influence on when this problem appears, and therefore how large the networks can become.
Out-of-the-box, Jayes allows for fine-tuning in several areas:
- Floating point representation
- Logarithmic values
- Graph elimination algorithm
- Factor decomposition
Floating point representation
Jayes can compute with double precision as well as single precision. Using single precision consumes less memory, but networks with more than ~200 variables are likely to suffer from numerical instability. Double precision can, on the other hand, easily support several thousand variables.
To set the floating point representation, use org.eclipse.recommenders.jayes.factor.FactorFactory.setFloatingPointType(Class). (Valid arguments are float.class and double.class). The default is double precision, although using single precision reduces the memory consumption by approximately 50% and has no measurable impact on runtime performance.
Important: You need to set floating point precision before the network is set in the inference algorithm.
Jayes can also use logarithmic values internally. This drastically improves numerical stability, but approximately doubles the time needed for inference. The FactorFactory is again the class that provides this option.
Graph elimination algorithm
Jayes has also added the capability to choose the graph elimination algorithm used for the generation of the junction tree used internally by JunctionTreeAlgorithm. This has influence on the time needed to set up the algorithm, as well as potentially the performance of the resulting inference engine, both in terms of memory consumption and runtime performance. There are two heuristic algorithms available which can be set in the JunctionTreeAlgorithm. Both reside in the org.eclipse.recommenders.jayes.util.triangulation package.
- MinFillIn: the best available quality, but is not suited for big networks with several hundred variables, as loading will take too long. Thus best suited for small, complex networks. This is the default.
- MinDegree: suitable for any size of network, but with complex networks the quality may suffer a bit. This could lead to a higher memory footprint, increased inference times and eventually the danger of numerical instability.
JunctionTreeBuilder builder = JunctionTreeBuilder.forHeuristic(new MinFillIn());
For probability distributions learned from real data, many parameters are zero because of a lack of data. However, in order to be able to predict in previously unseen cases, the distributions are smoothed, meaning some of the probability mass is distributed among the cases we did not see in our data.
Jayes is able to take advantage of sparse distributions. However, the smoothing intentionally leads to non-sparse distributions. Using linear algebra magic, Jayes provides algorithms to make the smoothing an explicit part of the model, in the form of new variables. This allows the distributions to be sparse, which saves memory – for our models ~20-30%. The extra variables make the model more complex, leading to increased inference times – for our models twice the time. So, this again is a memory/time trade-off.
The decomposition algorithms can be found in the org.eclipse.recommenders.jayes.transformation bundle. The algorithm to use for smoothed distributions is the SmoothedFactorDecomposition. This class has one public method, decompose(BayesNet,BayesNode), which will decompose the given BayesNode and augment the network with the results.
Here are the most important things to think about when using this feature:
- Evaluate the use of this feature for every model you use. For some models there will be no memory benefit.
- It is not the best strategy to decompose all nodes – therefore you should choose the nodes to decompose. The more a distribution needs to be smoothed, the better the decomposition will perform.
Jayes and You
I hope this article has given you an overview of what Jayes does and how you can use it. If you have any questions regarding Jayes, please contact us on the Eclipse Code Recommenders mailing-list or at firstname.lastname@example.org
If you like what the Code Recommenders and Codetrails team is doing and want to keep up to date: Follow us on Google+.