SPR is currently maintained by Andy Buckley <andy.buckley@cern.ch>. Contact me if you want to contribute or are interested in co-maintaining the package.

Online resources

SPR information and files can be obtained from the following Web pages:

From the https://sourceforge.net/projects/statpatrec/ page, you can subscribe to the statpatrec-base mailing list, for package announcements and updates.

Getting the package:

The latest release of SPR can be obtained out of Sourceforge SVN:

The head of the development trunk may not be stable. It is recommended to use tagged package releases for "real" classification tasks.

If you wish to contribute patches to SPR, please contact Andy Buckley (address above). For any substantial task, it is recommended that you either ask in advance for SVN commit privileges and a description of your planned modifications, or that you use the Bazaar version control system (http://bazaar-vcs.org) with the Bazaar-Subversion plugin (http://bazaar-vcs.org/BzrForeignBranches/Subversion) so that you can effectively check out SPR and make versioned changes and patches without needing commit privileges to the main repository.

List of all classifiers currently implemented in the package:

Contents

1. Documentation
2. Classifiers
2.1 Boosting
2.2 Bagging
2.3 Decision trees
2.4 Top-down trees
2.5 Bump hunting
2.6 Fisher and logistic regression
2.7 Neural networks
2.8 Combining classifiers
2.9 Recommendations: What classifiers to use and when
2.10 How to choose parameters for boosted and bagged decision trees
2.11 Cross-validation
2.12 Gene Expression Programming
2a. Multi class learning
2a.1 Algorithm by Allwein, Schapire and Singer
2a.2 Binary encoder
2a.3 Treatment of missing values
2a.4 Indicator matrix generation for Allwein-Schapire-Singer
3. Data and filters
4. Imposing simple cuts on input variables
5. Variable transformations and feature extraction
6. Treatment of input classes
7. Estimation of variable importance
7.1 Correlation with the class label
7.2 Counting decision splits
7.3 Variable importance for neural networks
7.4 Estimation of variable importance by random permutation of class labels
7.5 "Add N Remove R" method
7.6 Variable interactions
7.7 Which variable selection method is best?
8. Handling input variables
9. Data input and output
9.1 Supported I/O types: Ascii and Root
9.2 Splitting data into training and test sets
9.3 SprIOTestApp: displaying data content
10. Computing output for trained classifiers
11. Installation instructions
12. How to use trained classifiers in your C++ code
13. Working in Root-free environment
14. Working in Root environment
15. Missing values
16. Example: Cleveland heart data
17. Memory management
18. Summary
19. Release notes

1. Documentation

The mathematical formalism behind implemented classifiers, as well as idealistic and practical examples, can be found in two physics archive notes:

The first 4 references in physics/0507143 give a list of introductory books on pattern classification. Although familiarity with the material discussed in the books is not necessary to be able to use this package, I recommend them to every physicist.

One might also want to check out the SLUO lectures on Machine Learning:

The full set of links to papers and talks on StatPatternRecognition can be found on the web page, http://www.hep.caltech.edu/~narsky/spr.html .

Ilya Narsky's talk at the workshop at the BaBar Collaboration Meeting in December 2005 has useful tips for running the package:

(Warning: some of this info may be outdated as the package has gone through development cycles since then.)

2. Classifiers and classes

There are two types of classifiers in the package: trained and trainable.

All trainable classifiers inherit from SprAbsClassifier. The major method in this class is train(). The purpose of every trainable classifier is to optimize certain parameters for efficient signal-background separation. After training has been performed, a trainable classifier invokes the makeTrained() method to produce a corresponding trained classifier. A trained classifier inherits from SprAbsTrainedClassifier. The purpose of this class is to produce a numeric outcome for a data point to classify it using response() method.

At the moment the following trainable classifiers are implemented with corresponding trained counterparts:

At present most implemented classifiers deal only with two categories. The exception is a multi-class learner (see section 2.12). The default class assignment is 0 for background and 1 for signal. This choice, 0 and 1, is assumed throughout the package. The user can choose any other pair of classes or group classes in any two sets using syntax described in Section 6 and SprAbsFilter.hh.

The output of all implemented classifiers also by default ranges from 0 to 1. If you want to impose a naive cut that crudely chooses signal-like events over background-like events, you can simply cut at 0.5 on the classifier output. To impose a cut on the classifier output, use SprAbsTrainedClassifier::setCut() method. The accept() method of every trained classifier returns true if the output of the classifier satistfies the imposed cut and false otherwise. For all trained classifiers the cut on the output is by default set at 0.5. There are 3 classifiers in tha package that can optionally change their output range from [0,1] to (-infty,+infty); these are AdaBoost, Fisher and LogitR. The useStandard() method sets the range to (-infty,+infty) and useNormalized() sets it to [0,1]. These methods also change the cut on the output of the classifier: useStandard() sets it to 0 and useNormalized() sets it to 0.5.

The package also offers a set of transformations included in SprTransformation.hh for converting the classifier output to a desirable range. For example, to transform the conventional output of AdaBoost ranging from -infty to +infty to [0,1], logitDouble() transformation is applied.

SprTrainedBinarySplit produces a binary response, that is, 0 for background-like and 1 for signal-like events. The same holds for the bump hunter. Decision trees can return either discrete or continuous values depending on the chosen configuration. The response of SprTrainedAdaBoost is given by a sum of either binary (0 or 1) or continuous (from -infty to +infty) responses of individual classifiers multiplied by corresponding beta coefficients (see below section on AdaBoost). The response of SprTrainedBagger is given by a sum of either binary (0 or 1) or continuous (from 0 to 1) responses of individual classifiers divided by the total number of individual classifiers. For a large number of individual classifiers, the output of SprTrainedAdaBoost and SprTrainedBagger looks practically continuous, even if each classifier returns discrete output.

Classifiers implemented in the package are described below in more detail.

2.1 AdaBoost

AdaBoost combines individual classifiers by applying them consecutively. First AdaBoost reweights events to enhance weights of those that were misclassified by the last applied subclassifier and to reduce weights of events that were correctly classified by the last subclassifier. Then AdaBoost applies a new subclassifier to the reweighted events.

SprAdaBoost currently implements 3 different algorithms: Discrete AdaBoost, Real AdaBoost and epsilon-Boost. Specifics of event reweighting and computing the relative importance (beta coefficients) for the individual weak classifiers vary.

The implementation of AdaBoost in StatPatternRecognition can combine any classifiers that inherit from SprAbs(Trained)Classifier. The user can supply any combination of trained and trainable classifiers. AdaBoost goes through supplied trained classifiers first and calculates beta weights for those. These trained classifiers are not included in the number of training cycles; AdaBoost will always go through all trained classifiers supplied by the user, no matter how many training cycles have been requested (the special case is n=0 training cycles which is discussed below). After AdaBoost went through all trained classifiers, it goes through trainable classifiers. AdaBoost first trains each trainable classifier on the reweighted sample, makes a corresponding trained classifier and stores it in the final list of trained classifiers. AdaBoost goes through trainable classifiers consecutively in the order in which they were added to AdaBoost by the user and stops either when it is not possible to find a weak classifier improving the overall classification power (this stopping rule varies for different boosting flavors) or when the number of iterations through trainable classifiers reaches the number of training cycles requested by the user. For example, if the user supplies 2 trained and 3 trainable classifiers to AdaBoost and requests 10 training cycles, AdaBoost will first go through the two trained classifiers, then go through all 3 trainable classifiers 3 times consecutively (9 training cycles), then do one more training cycle for the 1st trainable classifier and exit. In this example, AdaBoost will produce a list of 12 trained classifiers with corresponding beta weights.

Currently 4 AdaBoost executables are implemented:

SprBoosterApp has been introduced in tag V05-00-00. In principle, SprBoosterApp alone should statisfy the needs of a regular user. It is capable of boosting any sequence of classifiers implemented in the package. It takes an input file with weak classifier configurations specified by the user. Use booster.config as an example. SprBoosterApp reproduces, to a large extent, the functionality of the 3 other executables. It offers even more flexibility as the user can combine any weak classifiers in any sequence. For example, if you put a decision tree on the first line of the input config line and a neural net on the second line, AdaBoost will use the decision tree for odd boosting cycles and the neural net for even boosting cycles.

The other 3 executables will be kept in the package. They offer some features that SprBoosterApp does not; for example, cross-validation or saving intermediate training weights. Also, for historical reasons - people who have been using the package for a while may have got used to the syntax.

Use "-M" option of the executable to choose among available boosting algorithms (default=Discrete AdaBoost). At the moment, 3 flavors of boosting are implemented: Discrete, Real and Epsilon. For description, please refer to the statistics literature. Discrete AdaBoost and epsilon-Boost can be used with any weak classifier. Real AdaBoost requires that the output of the supplied weak learner be continuous and range from 0 to 1. Since the output of all implemented classifiers by default ranges from 0 to 1, any weak classifier can be used with Real AdaBoost; however, the user should be careful not to convert the [0,1] output range of the weak classifier to (-infty,+infty).

setEpsilon() method of SprAdaBoost (-E command-line option for the executable) sets the epsilon value. This value is of no relevance for Discrete AdaBoost but affects Real and Epsilon algorithms. The meaning of epsilon is different for Real and Epsilon boosting. For the Epsilon boost, epsilon defines the rate at which weights are updated for misclassified events, specifically exp(2*epsilon). For the Real boost, epsilon defines the margin of the weak classifier response. The response of each weak classifier, p, for Real boosting must range from 0 to 1 and represent the probablity of an event at this point being signal. Weights of events are then multiplied by p/(1-p) for background and (1-p)/p for signal. For p=0 or p=1, the updated weight goes to infinity. epsilon defines the minimal and (1-epsilon) defines the maximal allowed value of p to avoid infinite weights. Increasing epsilon improves stability of the Real boost but may require more training cycles. It is the other way around for the Epsilon boost - decreasing epsilon improves stability and results in more training cycles.

When you train AdaBoost on multidimensional data (e.g., hundreds of dimensions) and large data sets, you will likely need to save an intermediate AdaBoost configuration to a file and then resume training later. The AdaBoost configuration file keeps only parameters of the trained classifiers but not the event weights adjusted due to AdaBoost application. Your resume job can therefore spend a significant amount of time reweighting events by applying the trained classifiers from the saved configuration. To reduce the amount of time spent on initial reweighting, one can store training data with updated weights into a new data file at the end of the training routine and then read the training data with already adjusted weights from this file before resuming. For example:

SprAdaBoostDecisionTreeApp -a 1 -n 50 -M 1 -l 2000 -f tree_n50.spr -u reweighted_gauss2_uniform_2d_train.pat gauss2_uniform_2d_train.pat

and then:

SprAdaBoostDecisionTreeApp -a 4 -M 1 -n 50 -l 2000 -r tree_n50.spr -f tree_n100.spr -e reweighted_gauss2_uniform_2d_train.pat

The "-e" option will force AdaBoost to skip the event reweighting; this option is necessary because events in reweighted_gauss2_uniform_2d_train.pat have been reweighted already. The final AdaBoost configuration stored in tree_n100.spr will have 50 classifiers copied from tree_n50.spr and then 50 new classifiers added. Note that now the "-a 4" option is necessary because the format of the input data has changed due to addition of weights.

AdaBoost output has a nice probabilistic interpretation:

P(Y=1|X)/P(Y=-1|X) = exp(2*F(X))

where P(Y=1|X) and P(Y=-1|X) are probabilities for an event with coordinates X to come from class 1 (signal) or class -1 (background), respectively, and F(X) is the conventional AdaBoost output ranging from -infty to +infty. By default trained AdaBoost returns the probability of an event at point X being signal: 1/(1+exp(-2*F(X))). If you want to use raw AdaBoost output unmapped to the [0,1] range, use "-s" option of the AdaBoost executables.

2.2 Bagging

Bagging (Bootstrap AGGregatING) means that many classifiers are trained on bootstrap replicas of a training data set. To classify a new event, responses of individual classifiers are summed up and divided by the total number of individual classifiers.

The user can bag an arbitrary sequence of classifiers using SprBaggerApp. To specify classifier configurations, use the same syntax as booster.config mentioned in the previous section.

For historical reasons, I will keep another executable in the package for bagging decision trees and random forest: SprBaggerDecisionTreeApp.

One can bootstrap not only training points but also input dimensions (random forest technology). This method can be invoked using the "-s" option of SprBaggerDecisionTreeApp or specifying an appropriate parameter in the input configuration file for SprBaggerApp.

If individual classifiers optimize a certain criterion, e.g., signal significance, bagging these classifiers is equivalent to optimizing the same criterion on bootstrap replicas of the training sample. This is a desirable feature if the user is interested only in optimization of this specific criterion.

A typical application of bagged decision trees would look like this:

SprBaggerDecisionTreeApp -a 1 -c 2 -n 50 -l 100 -o training.root -f tree.spr gauss2_uniform_2d_train.pat

(builds 50 decision trees, with at least 100 entries per terminal node, optimizing the signal significance, saves the classifier configuration into tree.spr and saves classified training data into training.root)

SprBaggerDecisionTreeApp -a 1 -n 0 -o validation.root -r tree.spr gauss2_uniform_2d_valid.pat

(reads the stored bagger from tree.spr, computes classifier response for the validation data and saves ntuple to validation.root)

The two most important training parameters are "-s" and "-l". The optimal number of sampled input features and the minimal size of leaf nodes should be determined by examining the bagger performance on (cross-)validation data. Breiman (http://www.stat.berkeley.edu/users/breiman/RandomForests/) recommends choosing the number of sampled input features small compared to the dimensionality of the problem. I would recommend trying various values of "-s" increasing exponentially. For example, if you have 100-dimensional data, you might try 1, 2, 4, 8, 16, 32, 64, and 100. The minimal leaf size needs to be optimized for each value independently. Typically, the minimal leaf size is fairly small - you can start with "-l 5" events per leaf and assume that this could be fairly close to optimal.

Note that this example, as well as the example in the AdaBoost section, is mostly qualitative. The power of boosted and bagged decision trees is most obvious in high-dimensional problems; an example with 2D data is not intended to demonstrate the power of these classifiers.

Another classifier implemented in SprBaggerDecisionTreeApp is a boosting algorithm, arc-x4. Due to some technicalities, the implementation of this algorithm is closer to that of the bagging algorithm than that of the boosting algorithm. This is why SprArcE4 inherits from SprBagger and the trained counterpart for this algorithm is just SprTrainedBagger. To use this algorithm, choose "-B" option of SprBaggerDecisionTreeApp or SprBaggerApp. Recommendations for the choice of "-s" and "-l" parameters often hold for this algorithm as well.

Another boosting algorithm included in SprBaggerApp is the range booster. In practical applications, the user is often interested in the region of high signal purity (and therefore low signal efficiency). The range booster lets you discard events that lie outside of the interesting range as you accumulate more training cycles. For example, instead of bagging 200 decision trees, you might opt to run 4 baggers each using 50 decision trees and discard events that lie outside of the interesting range after training each bagger. To do this, you would specify

Bagger 50 0
TopdownTree 5 0 2 0

in the configuration file for SprBaggerApp (50 bagged decision trees optimizing Gini index with 2 events per leaf) and use -R option of SprBaggerApp to select range boosting. Then specify -n 4 on the command line to train 4 baggers. The supplied training set is randomly split in two halves for each bagger with the first half used for training and the second half used for event reweighting. -S specifies the signal efficiency range that you are interested in; the weights of all events outside this range will be reduced after each bagger is finished. -E option specifies the factor by which the weight of every event outside this signal efficiency range is reduced. -T specifies the weight threshold below which the event is permanently discarded. For example, to apply an aggressive event reduction algorithm, specify -E value less than -T; this means that events outside of the interesting efficiency range will be discarded immediately.

Imagine, for example, that you only care about the classifier performance at the signal efficiency 0.2 or lower. In that case, you can specify -S 0.5 (it is recommended to keep some margin for safety), -E 0.01 and -T 0.1. This will let the classifier focus on the interesting range as it trains more baggers. This effectively implements the so-called "cascade" method of classification advertised in some recent HEP papers.

In general, I would not expect any improvement in performance out of this method; this is just a way to save CPU time if you need to train on huge datasets and don't want to waste CPU cycles on events well outside of the interesting range. Note, however, that one can play many games here. The syntax for the configuration file lets you specify as many classifiers as you like. You can use bagged decision trees for the first cycle and boosted decision trees (or a neural net, or anything else) at the second step of the "cascade" cycle etc. You can use as many cascade cycles as you like.

2.3 Decision tree

A decision tree recursively divides input space into rectangular regions (nodes). For each node the tree attempts to find the most optimal split by choosing among all possible splits for all input dimensions. The split yielding the largest figure of merit, according to the user-supplied optimization criterion, is used to divide this node into halves. If the largest figure of merit is less than the original FOM computed for the unsplit node, this node becomes a leaf (terminal node), classified either as signal (1) or background (0), depending on the FOM value. The output of the tree training procedure is a list of signal leaf nodes. After the leaves have been found, the tree sorts the leaves by signal purity in the descending order and, if it was requested by the user, merges the leaves to maximize the overall FOM. The leaves included in the overall maximal-FOM set are considered as signal regions, while the rest of the signal leaves not included in the overall FOM set are re-classified to background nodes. This selection procedure does not enforce continuity, and in the end the signal leaves may be far apart from each other.

Decision trees have been introduced in the statistics literature two decades ago and have become a popular data analysis tool. The predictive power of an individual decision tree is typically worse than that of flexible classifiers such as neural nets or boosted and bagged decision trees. The main advantage of the decision tree is its interpretability - rectangular regions can be easily visualized in many dimensions.

The decision tree implemented here is different from popular commercial implementations such as CART and C4.5. It is more simple-minded because (at least for now) it can deal only with two categories, signal and background. A conventional decision tree implementation also involves a pruning algorithm for merging nodes that belong to the same tree branch. A pruning algorithm typically introduces a penalty on the tree complexity and attempts to reduce the number of nodes at the cost of obtaining a slightly worse FOM with a decreased complexity penalty. No pruning is attempted in this implementation. On the other hand, the SPR implementation offers advanced features missing from conventional trees. A conventional tree treats both categories symmetrically and carries on training as long as the FOM with respect to any input category can be improved. The tree implemented here can work in two modes, symmetric and asymmetric. In the asymmetric mode, it only finds optimal signal nodes, but no optimization is attempted for background. This allows to use such inherently asymmetric FOM's as signal significance S/sqrt(S+B) or a 90% upper limit. In the symmetric mode, the tree works in the same way as a conventional tree would - it finds each decision split by minimizing a fraction of correctly classified events, Gini index or cross-entropy.

The stopping rule for the decision tree implemented here is set by the minimal number of events per tree node. This parameter has to be supplied by the user at tree construction. Depending on the chosen optimization criterion, the tree performance can crucially depend on this parameter. Try, for example

SprDecisionTreeApp -a 1 -c 3 -n 1 -m -v 1 gauss2_uniform_2d_train.pat

The tree will produce an insane number of nodes and in the end display a message:

"Included 5318 nodes in category 1 with overall FOM=1 with 10000
signal and 0 background weights."

It means that the tree separated 10,000 signal events from background events and grouped these signal events into 5318 nodes with 1.9 event per node on average. This may seem impressive but if you run this tree on validation data, you will find that the validation FOM is quite poor because the tree is over-trained.

If you do now

SprDecisionTreeApp -a 1 -c 3 -n 5000 -m -v 1 gauss2_uniform_2d_train.pat

you will get

"Included 1 nodes in category 1 with overall FOM=0.915234 with 4632 signal and 429 background weights."

Since FOM now is significantly less than 1, it likely means that the tree is under-trained. For this exercise, the optimal number of events per node happens to be around 200, which you can easily establish by looking at FOM computed on validation data:

SprDecisionTreeApp -a 1 -c 3 -n 200 -m -t gauss2_uniform_2d_valid.pat -v 1 gauss2_uniform_2d_train.pat

Note that without merging ("-m" option on the command line) these results would be different.

For some optimization criteria, specifying 1 as the minimal number of events per node can force the tree to create a separate node for each event and lead to memory exhaustion.

2.4 Top-down tree

The topdown tree (SprTopdownTree) implements the same training algorithm as SprDecisionTree described in 2.3 but stores the classifier configuration in a different way. While SprDecisionTree stores the trained classifier as a list of rectangular nodes, the topdown tree stores a full path from the tree root. This makes the topdown tree less interpretable but faster for classification of new events. To classify a new event with SprDecisionTree, it takes O(D*M) operations, where D is the dimensionality of the problem and M is the number of terminal nodes. The height of a balanced binary tree with M terminal nodes is log(M)+1. Hence, the topdown tree reduces the classification time by O(D*M/log(M)). In reality, this factor can be a lot less because a) the topdown tree is not necessarily balanced, and b) each terminal node may be defined by only a small subset of input variables (which would reduce the classification time for SprDecisionTree).

The topdown tree also offers optional continuous output instead of the default discrete output (0/1) for the regular decision tree. The continuous output is computed as w1/(w0+w1), where w1 and w0 are signal and background weights in a terminal node.

Topdown trees should be the preferred choice if the user is interested mostly in the classifier performance and not so much in the interpretability of results. Topdown trees are currently the default choice for the two main executables, SprAdaBoostDecisionTreeApp and SprBaggerDecisionTreeApp. If you wish to use slower regular decision trees, you can use "-j" option of the two executables. If you wish discrete tree output for the bagging algorithm, you can use "-k" option. The Discrete AdaBoost and epsilon-Boost algorithms generally require that each individual boosted decision tree produce discrete output; minimization of the exponential loss has been derived under the assumption of discreteness of the base classifier. For Real AdaBoost it is necessary, that each tree produce continuous output. The tree options are switched automatically by the executable.

The single decision tree executable, SprDecisionTreeApp, continues using regular decision trees by default. In this case, the primary concern is the ease of interpretation, not speed.

Saved configuration files with topdown trees for the Bagger do not retain information about whether decision trees were chosen to be discrete or continuous. The default choice for SprTrainedBagger is "continuous". If you wish to change this, you will need to do this manually after you read the saved configuration. For example:

SprTrainedBagger* bagger
  = SprClassifierReader::readTrained<SprTrainedBagger>("saved_bagger.txt", "Bagger");
assert( bagger != 0 );
bagger->setDiscrete();

2.5 Bump hunting

The idea behind bump hunting is very similar to that of a decision tree - we search for rectangular boxes by optimizing a chosen figure of merit. To find such a box, we look at all possible binary splits in all dimensions. However, instead of recursively splitting input space into multiple boxes of shrinking size (nodes), we attempt to find one box with an overall optimal FOM. This is done in two steps:

a) Shrinking. At this stage, we attempt to shrink the signal box in each dimension to optimize FOM. This shrinking crucially depends on the "peel" parameter, a fraction of events that can be peeled off the signal box in one iteration. The user should be very careful about choosing this "peel" parameter. A low "peel" parameter makes the algorithm slower and more conservative, while a high "peel" parameter gives a high-speed and less robust performance. Note that the method may fail to find statistically significant bumps both due to excessive conservatism (low peel) and the lack of such (high peel).

b) Expansion. After the signal box shrunk to the minimal size, we attempt to relax the bounds to increase FOM.

After the shrinking and expansion stages have been completed, we remove events in the found box from the data set and repeat the algorithm on the obtained subset to find a next box.

The outcome of this algorithm is a set of rectangular regions (bumps) with the associated FOMs. The user can request any number of bumps, and then the algorithm finds them consecutively. In the case of multiple bumps, the output of the bump hunter can be a bit confusing because the bumps can overlap and even be enclosed one inside another. Keep in mind that each found bump is removed from the data before training is resumed to find a next bump; that is, in reality the bumps don't overlap because each bump is merely a piece of empty space from the point of view of bumps with higher indices.

This method is intended for exploratory analysis, e.g., a blind search for new physics phenomena and any analysis where the user wants to find a rectangular signal box.

The ready-to-go executable is SprBumpHunterApp.

Both SprBumpHunterApp and SprDecisionTreeApp store individual box numbers into ntuples.

2.6 Fisher and logistic regression

The executable is SprFisherLogitApp. It can store both training and validation data into ntuples simultaneously. For example, you can do

SprFisherLogitApp -A -a 2 -m 3 -l -o training.out -t lambda-test.pat -p validation.out lambda-train.pat

(trains both linear and quadratic Fisher's (-m 3), as well as logistic regression (-l), on lambda-train.pat and stores the training tuple into training.out; validates Fisher on lambda-test.pat and stores ntuple into validation.out)

Command-line options -e, -u and -i are valid only for logistic regression. By default the starting point for logistic regression is set to coefficients obtained by LDA (aka Fisher). If LDA fails, the initial estimates for the regression coefficients are automatically reset to zero. You can avoid using LDA for the initial estimate by selecting "-i" option of the executable. If logistic regression fails to converge, try reducing the update factor by choosing "-u" less than 1 (for example, 0.5 is a good starting choice). This reduces the rate of convergence and makes the algorithm more stable.

For consistency with AdaBoost and in the general spirit of preferring classifier output normalized to the [0,1] interval, L(Q)DA and LogitR by default return output normalized to [0,1]: first the conventional output ranging from -infty to +infty is computed and then it is mapped onto [0,1] by applying 1/(1+exp(-X)) transformation. To see the conventional output, use "-s" option of the executable or useStandard() method.

2.7 Neural networks

The package implements a trainable backprop neural net and interfaces to two SNNS classifiers, the standard backprop net and the radial basis function net. The user does not have access to the training mechanism supplied by SNNS. But if you train these classifiers using the SNNS package, you can save the trained net into a *.net file. Then you can run either SprStdBackpropApp or SprRBFNetApp, depending on the type of the net you chose, to compute NN output for user-supplied data and store the data and the NN output into an ntuple.

SprTrainedRBF implements the radial basis function net. It can be used only for reading a saved SNNS configuration and applying it to data. Spr(Trained)StdBackprop provides an original backprop neural net implementation and also has the ability to read saved SNNS configurations.

SprStdBackpropApp implements both single backprop neural net and boosted backprop neural nets. You can access the single neural net by using "-n 1" option of the executable (one AdaBoost training cycle). If you use n>1, you will train n boosted neural nets. If you use n=0, the executable expects that you provide an input file with a stored AdaBoost configuration.

The most important parameters of a neural net are its structure (-N), learning rate (-L), and learning rate for initialization (-I). See SprStdBackprop.hh for details on structure specification.

At initialization the neural net is trained on randomly selected events using randomly generated learning rates. This is done to break the initial neural net symmetry.

Training events are by default randomly permuted before training.

Here are a few examples of how the executable can be used:

SprStdBackpropApp -a 1 -n 1 -N '2:4:2:1' -l 100 -L 0.1 -d 5 -t gauss2_uniform_2d_valid.pat -f net.spr gauss2_uniform_2d_train.pat

Trains a single NN with 100 cycles from scratch. The NN configuration is: 2 input nodes (must be always equal to the dimensionality of the data!!!), 4 hidden nodes in the 1st hidden layer, 2 hidden nodes in the 2nd hidden layer, and 1 output node (must be always 1 for now!!!). The learning rate is 0.1. Display quadratic loss on validation data every 5 cycles. Save NN configuration into net.spr.

Suppose you trained SNNS on these data earlier and now want to convert SNNS configuration into SPR configuration. Do:

SprStdBackpropApp -a 1 -n 1 -S net.snns -l 0 -f net.spr gauss2_uniform_2d_train.pat

This simply converts SNNS format net.snns into SPR format net.spr.

You can resume training from SNNS format directly:

SprStdBackpropApp -a 1 -n 1 -S net.snns -l 100 -L 0.01 -f net200.spr -d 5 -t gauss2_uniform_2d_valid.pat gauss2_uniform_2d_train.pat

or from the converted SPR net:

SprStdBackpropApp -a 1 -n 1 -R net.spr -l 100 -L 0.01 -f net200.spr -d 5 -t gauss2_uniform_2d_valid.pat gauss2_uniform_2d_train.pat

You can use the obtained NN configuration as a starting point for boosting:

SprStdBackpropApp -a 1 -n 10 -M 2 -E 0.1 -g 2 -R net200.spr -l 50 -L 0.01 -f boosted_net.spr -d 1 -t gauss2_uniform_2d_valid.pat gauss2_uniform_2d_train.pat

This takes the saved network net200.spr and uses it as a first classifier in the boosting sequence. 10 more neural nets are trained using the same network architecture and the 11 neural nets are saved to boosted_net.spr. We now monitor exponential loss (-g 2) because it is more appropriate for boosting.

Read the saved AdaBoost from a configuration file and run it on validation data (results are saved into save.out):

SprStdBackpropApp -a 1 -n 0 -r boosted_net.spr -o save.out gauss2_uniform_2d_valid.pat

Do the same thing for a single NN you obtained earlier:

SprStdBackpropApp -a 1 -n 1 -l 0 -R net200.spr -o save.out gauss2_uniform_2d_valid.pat

If you would like to resume training from a saved AdaBoost configuration, the executable forces you to choose a network structure:

SprStdBackpropApp -a 1 -n 10 -M 2 -g 2 -E 0.1 -r boosted_net.spr -f boosted_net_2.spr -L 0.01 -l 50 -N '2:6:3:1' -d 1 -t gauss2_uniform_2d_valid.pat gauss2_uniform_2d_train.pat

Note that the new structure 2:6:3:1 is different from 2:4:2:1 used earlier. 10 more neural nets will be trained using 2:6:3:1 and combined with the 11 nets trained earlier using 2:4:2:1. Also note that -M option is redundant - the AdaBoost mode is read from boosted_net.spr and set to Real, no matter what you specify on the command line.

2.8 Combining classifiers

After several classifiers have been trained independently, it is possible to combine them to obtain a better overall performance. Spr(Trained)Combiner provides an interface for this combining. The idea is to train a new powerful classifier on the outputs of the combined classifiers. In the statistics literature, this is sometimes referred to as "mixture of experts".

Starting with tag V06-00-00, the reworked combiner can combine an arbitrary number of classifiers trained on arbitrary subsets of input variables. For example, your input data is 6D with input variables named x0, x1, …, x5. You know that variables x0 and x2 need to be grouped together (for example, they come from one detector sub-system), variables x1, x3, and x5 form another group (another detector sub-system), and variable x4 should be ignored. The input data file train.pat has all variables filled out for each event with missing values modeled as values outside the allowed range. For instance, the meaningful range for variable x3 extends from -1 to 1 and from 5 to 10; for events where you cannot measure variable x3 (missing value), you model it as a number somewhere outside the range, e.g., -13.

The recommended sequence of actions looks like this:

1) Train your favorite classifiers on the subsets:

SprBaggerDecisionTreeApp -a 2 -n 10 -l 10 -V 'x0,x2' -f save_1.spr lambda-train.pat

For the 2nd classifier, you have to modify the executable to exclude unreasonable values of x3. On top of the executable code, put your functions (see example in src/exampleUserCuts.cc):

// Accept all classes for input data
vector<int> myClasses() {
  return vector<int>();
}
// Define a list of variables used for user selection requirements.
vector<string> myVars() {
  vector<string> vars(1);
  vars[0] = "x3";
  return vars;
}
// Define cuts for specified variables using the specified order of variables.
// Return true if the point passes the cuts, false otherwise.
bool myCuts(const vector<double>& v) {
  if( (v[0]>-1. && v[0]<1.) || (v[0]>5. && v[0]<10.) ) return true;
  return false;
}

The following example in src/exampleUserCuts.cc put the following in main():

// Prepare pre-filter
SprPreFilter pre;
if( !pre.setSelection(myVars,myCuts,myClasses) ) {
  cerr << "Unable to set pre-filter." << endl;
  return 1;
}

and modify the reader constructor to apply the constraint:

auto_ptr<SprAbsReader>
  reader(SprRWFactory::makeReader(inputType,readMode,&pre));

Rebuild the executable by running make install and run it:

SprBaggerDecisionTreeApp -a 2 -n 10 -l 10 -V 'x1,x3,x5' -f save_2.spr lambda-train.pat

Check the messages returned by the executable to make sure that you see as many input events after the pre-selection cuts are applied as you expect.

If you want to avoid the hassle of editing C++ code, you can simply prepare two separate input files, one with varaibles x0 and x2 for the 1st sub-classifier and another one with variables x1, x3 and x5 and with proper cuts on x3 imposed for the 2nd classifier. The two training sets do not need to have anything in common - they can use different input variables in a different order. The two essential requirements are: a) an input variable must have the same name in all input files for proper variable matching, and b) all variables used to train sub-classifiers must be present in the training and test datasets for the combiner.

2) Define input configuration files for the combiner. You need to define one config file for each trained classifier plus a config file for the global classifier to be trained on outputs of these trained classifiers:

unix> cat classifier_1.config
Path:        save_1.spr
Name:        C1
Default:     0.5
Constraints: 0

Path needs to specify the path to the saved classifier config file obtained in step 1. Name is an arbitrary name for the classifier. Default is the default response value returned by the classifier if input variables do not satisfy input constraints. The output of the bagger ranges from 0 to 1. Therefore, if the failure to satisfy the input constraints does not qualify this event as background or signal, choose 0.5 as the default value. In this case, there are no constraints, as you can see on the constraints line, and so the chosen default value is irrelevant.

For the 2nd classifier, you need to specify the constraint on variable x3:

unix> cat classifier_2.config
Path:        save_2.spr
Name:        C2
Default:     0
Constraints: 1
x3  2  -1. 1.   5. 10.

First, you specify the number of constraints (1 in this case because you impose constraints only on variable x3). On the next lines you need to specify the constraints. Put the name of the input variable first, then the number of allowed intervals and all allowed intervals on the same line. If you need to impose constraints on another input variable, change 1 to 2 and insert another line at bottom.

In this case, if the value of x3 is outside of the allowed range, this is considered as a symptom of the event being background. The default response value is set to 0.

Make a global config file for the combiner:

unix> cat global.config
StdBackprop 2:4:2:1 50 0.1 0 0.1

This uses the same syntax used by SprBoosterApp or SprBaggerApp. See booster.config for detail. In this case, you would like to train a neural net with 2 input units because you have trained 2 classifiers on data subsets.

3) Train the combiner:

SprCombinerApp -a 2 -f global.spr 'classifier_1.config,classifier_2.config' global.config lambda-train.pat

4) Study the performance of the combiner on test data:

SprOutputWriterApp -a 2 global.spr lambda-test.pat save.out

Note that you do not need to specify explicitly input variables in steps 3 and 4. SprCombinerApp and SprOutputWriterApp know that variable x4 has been excluded because neither of the trained sub-classifiers uses it.

Also note that in the example above the same training data, train.pat, has been used for training classifiers on subsets and the combiner. If the classifiers on subsets severely overtrain (which in fact may be the case for the bagger), training the combiner on the overtrained responses may be far from optimal. In general, for this exercise you would need to prepare 2 training sets, one to train on subsets and another one to train the combiner.

2.9 Recommendations: What classifiers to use and when

These should be viewed as highly approximate. It is certainly not my intention to impose my views of what is a "better" or "worse" classifier on other physicists. Moreover, these recommendations are correct on average but not necessarily for the specific problem at hand.

Every classifier has a set of features: predictive power, speed and robustness of the training procedure, interpretability of results, performance in high-dimensional problems etc. There is no such thing as the "best" classifier; every tool is suited for its purpose.

If you are mostly concerned with speed, I suggest boosted binary splits. This classifier trains very quickly and produces a decent, although hardly optimal, separation of signal from background.

If, due to whatever reason, you want to find a rectangular signal region, you should use the bump hunter. This method should be also good in situations when you perform a blind search for an excess of data above the simulated process.

If you are mostly concerned with the quality of signal/background separation, you should use a neural net, boosted decision trees or random forest (Bagger).

If your input data live in a low-dimensional space (dimensionality is less than 10), if all input variables are continuous and there are no strong correlations among them, and if the structure of the input data is more or less simple (i.e., there are no multiple peaks), a standard feedforward backpropagation neural net should do a great job for you. This is not to imply that the neural net should perform better than the boosted decision trees or random forest. I would make no predictions for what classifier of the mentioned three would perform better in a specific low-dimensional problem.

The neural net suffers from a variety of problems in high-dimensional settings. When you add more dimensions, you generally need to expand the hidden layer as well to capture the increasing complexity of the data. If the size of the hidden layer is proportional to that of the input layer, the training and response time of a neural net increase quadratically. With more than one hidden layer, this dependence is even worse. Also, the neural net can get confused if it is presented with strongly correlated variables or variables of the mixed type (continuous and discrete).

This is why in high-dimensional settings (with dimensionality 20 or greater), boosted decision trees and random forest are faster and more reliable classification tools. The training and response time of these classifiers generally scales linearly with dimensionality.

If you are mostly interested in obtaining a good average separation between signal and background, e.g., you would like to get two distinct one-dimensional likelihood shapes and use them in a fit to data, or if you would like to access the whole "signal purity vs signal magnitude" curve, I suggest that you use either boosted decision trees or random forest with symmetric optimization criteria 5 or 6, that is, Gini index and cross-entropy.

If you are only interested in one specific criterion such as the maximal signal significance or the minimal 90% upper limit, I suggest that you try SprBaggerDecisionTreeApp and optimize this criterion directly by specifying an appropriate "-c" option.

Do not overdo!!! For many problems, you can get fairly good results with quick and robust classifiers such as boosted binary splits or a single decision tree. For example, running boosted or random forest to separate two bivariate Gaussians from uniform background (gauss2_uniform_2d_train.pat) does not make much sense (even though this dataset is used for illustration here).

SprInteractiveAnalysisApp has been provided to help people quickly compare performance of various classifiers on relatively small datasets in not too many dimensions. Syntax should be self-explanatory.

2.10 How to choose parameters for boosted and bagged decision trees

The most important input parameter for boosted decision trees is the minimal number of events per terminal node of a decision tree, SprAdaBoostDecisionTreeApp gets this parameter from "-l" option on the command input line. SprBoosterApp reads this parameter from the configuration file. The easiest way to choose an optimal value for this parameter is to monitor the classifier performance on validation data.

AdaBoost is designed to minimize the exponential loss. To choose the optimal leaf size, monitor the exponential loss for the validation data:

unix> SprAdaBoostDecisionTreeApp -a 1 -n 20 -l 100 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat
Validation Loss=0.52714 at cycle 1
Validation Loss=0.46702 at cycle 2
Validation Loss=0.44809 at cycle 3
Validation Loss=0.449999 at cycle 4
Validation Loss=0.455101 at cycle 5
Validation Loss=0.462101 at cycle 6
Validation Loss=0.474901 at cycle 7
Validation Loss=0.488699 at cycle 8
Validation Loss=0.502885 at cycle 9
Validation Loss=0.50625 at cycle 10
Validation Loss=0.509496 at cycle 11
Validation Loss=0.535736 at cycle 12
Validation Loss=0.549812 at cycle 13
Validation Loss=0.565217 at cycle 14
Validation Loss=0.595657 at cycle 15
Validation Loss=0.611038 at cycle 16
Validation Loss=0.636845 at cycle 17
Validation Loss=0.659387 at cycle 18
Validation Loss=0.683962 at cycle 19
Validation Loss=0.700659 at cycle 20

The exponential loss increases. Probably L=100 is not a good choice. Let us try a larger L:

unix> SprAdaBoostDecisionTreeApp -a 1 -n 20 -l 8000 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat
Validation Loss=0.815399 at cycle 1
Validation Loss=0.803473 at cycle 2
Validation Loss=0.771938 at cycle 3
Validation Loss=0.756562 at cycle 4
Validation Loss=0.747749 at cycle 5
Validation Loss=0.741897 at cycle 6
Validation Loss=0.737569 at cycle 7
Validation Loss=0.73411 at cycle 8
Validation Loss=0.730116 at cycle 9
Validation Loss=0.727139 at cycle 10
Validation Loss=0.724745 at cycle 11
Validation Loss=0.723074 at cycle 12
Validation Loss=0.721264 at cycle 13
Validation Loss=0.720109 at cycle 14
Validation Loss=0.718357 at cycle 15
Validation Loss=0.716691 at cycle 16
Validation Loss=0.715792 at cycle 17
Validation Loss=0.714532 at cycle 18
Validation Loss=0.713479 at cycle 19
Validation Loss=0.712872 at cycle 20

Exponential loss decreases but we can do better. One more try:

unix> SprAdaBoostDecisionTreeApp -a 1 -n 20 -l 1500 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat
Validation Loss=0.650332 at cycle 1
Validation Loss=0.516325 at cycle 2
Validation Loss=0.482314 at cycle 3
Validation Loss=0.460987 at cycle 4
Validation Loss=0.454166 at cycle 5
Validation Loss=0.449057 at cycle 6
Validation Loss=0.435003 at cycle 7
Validation Loss=0.426111 at cycle 8
Validation Loss=0.422804 at cycle 9
Validation Loss=0.416456 at cycle 10
Validation Loss=0.40981 at cycle 11
Validation Loss=0.407647 at cycle 12
Validation Loss=0.406386 at cycle 13
Validation Loss=0.406077 at cycle 14
Validation Loss=0.405301 at cycle 15
Validation Loss=0.40665 at cycle 16
Validation Loss=0.40524 at cycle 17
Validation Loss=0.403948 at cycle 18
Validation Loss=0.403166 at cycle 19
Validation Loss=0.399208 at cycle 20

The loss values are better. By trying various values of "-l" parameter, one can find that L=1500 is about optimal.

For this problem, two bivariate Gaussians on top of uniform background, separating signal from background is easy. You certainly don't need many boosted trees for that - in fact, you can separate signal from background fairly well with a single decision tree! This example is merely for illustration.

Let us now find the optimal leaf size for the bagger. Suppose we are interested in the quadratic loss (-g 1 option of the executable). Naively, one could assume that the optimal value should be the same as for the booster:

unix> SprBaggerDecisionTreeApp -a 1 -g 1 -n 20 -l 1500 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat
Validation FOM=0.101408 at cycle 1
Validation FOM=0.0958078 at cycle 2
Validation FOM=0.0941453 at cycle 3
Validation FOM=0.0911666 at cycle 4
Validation FOM=0.0873188 at cycle 5
Validation FOM=0.0847348 at cycle 6
Validation FOM=0.0843446 at cycle 7
Validation FOM=0.0890755 at cycle 8
Validation FOM=0.0895145 at cycle 9
Validation FOM=0.0910927 at cycle 10
Validation FOM=0.089789 at cycle 11
Validation FOM=0.0879379 at cycle 12
Validation FOM=0.0879443 at cycle 13
Validation FOM=0.0864982 at cycle 14
Validation FOM=0.085486 at cycle 15
Validation FOM=0.0849834 at cycle 16
Validation FOM=0.0847154 at cycle 17
Validation FOM=0.0843778 at cycle 18
Validation FOM=0.0840525 at cycle 19
Validation FOM=0.0838303 at cycle 20

Let us try another choice for the leaf size: 50.

unix> SprBaggerDecisionTreeApp -a 1 -g 1 -n 20 -l 50 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat
Validation FOM=0.084895 at cycle 1
Validation FOM=0.0770321 at cycle 2
Validation FOM=0.0680597 at cycle 3
Validation FOM=0.0643376 at cycle 4
Validation FOM=0.0611588 at cycle 5
Validation FOM=0.0618565 at cycle 6
Validation FOM=0.0613833 at cycle 7
Validation FOM=0.0652366 at cycle 8
Validation FOM=0.0638901 at cycle 9
Validation FOM=0.0647727 at cycle 10
Validation FOM=0.0630297 at cycle 11
Validation FOM=0.0622102 at cycle 12
Validation FOM=0.0626738 at cycle 13
Validation FOM=0.0616809 at cycle 14
Validation FOM=0.0608163 at cycle 15
Validation FOM=0.0616452 at cycle 16
Validation FOM=0.0625021 at cycle 17
Validation FOM=0.0616811 at cycle 18
Validation FOM=0.0610225 at cycle 19
Validation FOM=0.0603946 at cycle 20

Quadratic loss is less than in the previous trial. Most certainly, 50 is a better option than 1500. Let us go even to smaller leaves:

unix> SprBaggerDecisionTreeApp -a 1 -g 1 -n 20 -l 5 -t gauss2_uniform_2d_valid.pat -d 1 gauss2_uniform_2d_train.pat
Validation FOM=0.0875863 at cycle 1
Validation FOM=0.080353 at cycle 2
Validation FOM=0.0714745 at cycle 3
Validation FOM=0.0670627 at cycle 4
Validation FOM=0.0638389 at cycle 5
Validation FOM=0.0640173 at cycle 6
Validation FOM=0.0635259 at cycle 7
Validation FOM=0.0666818 at cycle 8
Validation FOM=0.0652588 at cycle 9
Validation FOM=0.0661904 at cycle 10
Validation FOM=0.0645742 at cycle 11
Validation FOM=0.0637672 at cycle 12
Validation FOM=0.0640231 at cycle 13
Validation FOM=0.0630114 at cycle 14
Validation FOM=0.0623262 at cycle 15
Validation FOM=0.0630325 at cycle 16
Validation FOM=0.0637906 at cycle 17
Validation FOM=0.0629989 at cycle 18
Validation FOM=0.0623234 at cycle 19
Validation FOM=0.06177 at cycle 20

These FOM values are a little worse than in the previous example. So probably 50 is about optimal.

Note the dramatic difference between boosted and bagged decision trees. While boosted trees typically want to have large terminal nodes, bagged trees usually benefit from making the terminal node size fairly small.

2.11 Cross-validation

Cross-validation has been implemented for 4 executables: SprCrossValidatorApp, SprAdaBoostDecisionTreeApp, SprBaggerDecisionTreeApp, and SprDecisionTreeApp. The purpose of cross-validation is to choose an optimal set of the classifier parameters if the input dataset is small. In this case, the supplied data are split into several equal pieces (as specified by "-x" option). These pieces are then excluded one by one from the training sample. The classifier is trained on the training data with one piece excluded and then its performance is evaluated on the excluded piece. At the end, the FOM obtained by averaging the FOMs from all pieces is returned to the user. A typical application of the cross-validation algorithm would look like this:

SprBaggerDecisionTreeApp -a 1 -g 1 -n 20 -x 5 -q '5,10,20,50,100,1000,2000,5000' gauss2_uniform_2d_train.pat

This trains 20 bagged decision trees for each specified minimal node size. That is, the executable trains 8 baggers, each using 20 bagged decision trees. In the end you get (quadratic loss displayed):

Cross-validated FOMs:
Node size=       5      FOM= 0.0591947
Node size=      10      FOM= 0.0604726
Node size=      20      FOM= 0.0595813
Node size=      50      FOM= 0.0586453
Node size=     100      FOM= 0.0621193
Node size=    1000      FOM= 0.0830538
Node size=    2000      FOM=  0.104705
Node size=    5000      FOM=  0.130985

This cross-validation exercise tells you that L=50 is close to the optimal value.

The -g option of the two main executables, SprAdaBoostDecisionTreeApp and SprBaggerDecisionTreeApp, lets you choose an average per-event loss as the cross-validation figure-of-merit. If no -g option is specified, the FOM specified by the -C option is monitored on validation or cross-validation data.

2.12 Gene Expression Programming

Unfortunately, the authors do not have time to write documentation. SprGEPApp gives handles to all model parameters and supplies reasonable default values for those. Try it out and send feedback to Julian or Ilya (see AUTHORS).

2a Multi class learning

The executable for multi-class learning implemented in the package is SprMultiClassApp. It lets the user choose from implemented multiclass algorithms using -C option. The user can adjust class weights on the fly by applying -w option followed by a list of weights. The list of weights must have as many entries as there are classes in the input data; these entries must be separated by commas, for example: -w 1,5,2 sets the total weight of class 1 to 1, the total weight of class 2 to 5, and the total weight of class 3 to 2. The total weight settings apply to both training and validation (if supplied) data.

2a.1 Algorithm by Allwein, Schapire and Singer

Spr(Trained)MultiClassLearner implements this multi-class learning algorithm by training a series of binary classifiers for two-class problems. The reduction of the multi-class problem to a series of two-class problems is defined by an indicator matrix of the size K*L, where K is the number of classes and L is the number of binary classifiers to be built. Matrix elements can take only 3 values: -1, 0 and 1. For each column of the matrix, 1 means that this class is treated as signal, -1 means that this class is treated as background, and 0 means that this class does not participate in this binary problem. Two popular training strategies have been implemented: one-vs-all and one-vs-one. For the one-vs-all strategy, the indicator matrix is of size K*K with diagonal elements equal to 1 and off-diagonal elements equal to -1. For the one-vs-one strategy, the indicator matrix is of size K*L, where L=K*(K-1)/2; in each column there is one 1 and one -1, while all other elements are zeros. The user can specify any other indicator matrix. For example, if one has a problem with background (class 0) and two similar signals (classes 1 and 2), one could train one classifier to separate the joint signal from the background and then the two signals from each other. In this case the indicator matrix would look like this:

Row/Column |  0  1
------------------
 0         | -1  0
 1         |  1  1
 2         |  1 -1

To provide an arbitrary indicator matrix, the user must construct SprMutiClassLearner in the User mode. The two other modes, OneVsAll and OneVsOne, will force regeneration of the indicator matrix according to the described templates. Running SprMultiClassApp, the user needs to choose "-e 3" for the user-defined mode and use "-i" option to specify an indicator matrix in an input file. See indicator.config for an example.

After training is completed, new data are classified using the user-specified per-event loss for each row of the indicator matrix. Each of the L classifiers is used to compute response f_l for a given point; l=1,…,L. Then average loss is evaluated for each row of the indicator matrix. For example, for quadratic loss we have Loss_k = \sum_1L (Ikl-f_l)2, where Ikl is an element of the indicator matrix (k-th row and l-th column). The point is classified as category k if Loss_k is the least of loss values computed for all rows. SprDataFeeder allows the user to save loss values computed for each row of the indicator matrix as well as the overall integer classification label for each data point.

SprMultiClassApp lets you use any classifier for binary class training. The restriction at the moment is that this has to be the same classifier for each pair of classes, that is, you cannot use, for example, a neural net to separate class 1 from class 2 and then a decision tree to separate class 1 from class 3. You can either use a neural net in both cases or a decision tree in both cases. The classifier is specified in a config file, similar to booster.config except it only takes one topmost entry. Due to certain specifics, this executable does not allow to resume training from a saved configuration file. Therefore, "-r" command line option automatically assumes zero training cycles and can be used only to read from the saved configuration file and compute classifier response for new data. The "-y" option is mandatory - the user must specify all classes to be included in the multi-class learning algorithm.

As an example, try:

> SprMultiClassApp -a 1 -e 1 -y '0,1,2,3,4' -c multi.config -f save.spr -t gauss4_uniform_2d_valid.pat -v 1 gauss4_uniform_2d_train.pat

where multi.config consists of 2 lines:

AdaBoost 20 1 0 0.01
TopdownTree 5 0 1000 0

This will train 20 boosted decision trees for each column of the indicator matrix. AdaBoost and Bagger require 2 input lines in the config file instead of one - the 2nd line must specify the weak classifier to be used with AdaBoost.

Also, try

> SprMultiClassApp -a 1 -e 2 -y '0,1,2,3,4' -c multi.config -f save.spr -t gauss4_uniform_2d_valid.pat -v 1 gauss4_uniform_2d_train.pat

and then

> SprMultiClassApp -a 1 -y '0,1,2,3,4' -r save.spr -o save.out -v 1 gauss4_uniform_2d_valid.pat

to see distributions for validation data.

Two options, -j and -W, are included to provide more flexibility for multiclass separation. -j options tells the trained multiclass learner to ignore classes with 0 elements of the indicator matrix for computation of the loss contribution associated with this classifier. If this option is chosen, loss values for each row of the indicator matrix are computed by summing weighted classifier responses:

Loss_k = \sum_{l=1}^L I(Ikl<>0)*Wl*(Ikl-f_l)^2 / \sum{l=1}^L Wl

where I(Ikl<>0) gives 1 for every element of the indicator matrix Ikl not equal to 0 and gives 0 otherwise, and Wl is the weight of the classifier for column l. By default, Wl is determined by the total weight of all classes used for training this classifier, that is, all classes labeled as -1 or 1 in column l. If the user does not like the default setting, he can supply his own weights through -W option. There must be as many weights in the list supplied by -W as there are classifiers, or equivalently columns of the indicator matrix.

The package also allows to train binary classifiers separately and combine them in a multiclass learner using SprAddColumnsForMCLApp. The user can train different binary classifiers for different columns of the matrix and fine-tune their parameters separately. As an example, do:

> SprDecisionTreeApp -a 1 -y '0:1' -n 100 -T -F 0vs1.spr -t gauss4_uniform_2d_valid.pat gauss4_uniform_2d_train.pat
> SprDecisionTreeApp -a 1 -y '0:2' -n 100 -T -F 0vs2.spr -t gauss4_uniform_2d_valid.pat gauss4_uniform_2d_train.pat
> SprDecisionTreeApp -a 1 -y '1:2' -n 1000 -T -F 1vs2.spr -t gauss4_uniform_2d_valid.pat gauss4_uniform_2d_train.pat

Now prepare a file add.config with following lines:

0 1 2
0:1   1   0vs1.spr
0:2   1   0vs2.spr
1:2  10   1vs2.spr

The first line shows all classes. The following lines specify classifiers. The first entry on each line specifies what classes are used; the usual syntax with commas and colons applies - you can use, for example, 0,1:2 to separate classes 0 and 1 from class 2. The 2nd entry is the weight of this classifier. The 3rd entry is the configuration file to which this classifier was saved. The example above implements one-vs-one strategy but you can use any complex strategy of your choice.

Now run

> SprAddColumnsForMCLApp add.config mc_3classes.spr

The obtained mc_3classes.spr stores the multiclass learner configuration. You can apply it to test data:

> SprMultiClassApp -a 1 -y '0,1,2' -r mc_3classes.spr gauss4_uniform_2d_valid.pat

to display the loss table.

2a.2 Binary encoder

Another multiclass algorithm included in the package is "binary encoder". It encodes multiclass data with D features, K classes and N instances as binary data with D+1 features, 2 classes and N*K instances. For example, a vector X of class 2 for a problem with 3 classes is encoded as 3 vectors:

X,1    X,2    X,3

where X,2 is labeled as class 1 and the two others are labeled as class 0. To compensate for the fact that for every binary signal event this method supplies (K-1) binary background events, the weights of all binary signal events are multiplied by K.

The disadvantage of this method is the K-fold increase in memory consumption. The advantage is that the user can monitor the performance of the multiclass learner on validation data versus the number of training cycles using the machinery of binary classifiers. Just specify a validation dataset and the frequency of print-outs of the validation FOM, in the same way as it is done for binary classifiers:

> SprMultiClassApp -a 1 -e 1 -y '0,1,2,3,4' -c multi.config -f save.spr -t gauss4_uniform_2d_valid.pat -v 1 -C 2 -d 1 gauss4_uniform_2d_train.pat

-C 2 invokes the binary encoder and -d 1 monitors validation loss at each boosting cycle.

The trained binary encoder computes the class label for a new event by assuming every class from the multiple class list in turn and choosing the one for which the underlying binary classifier gives the largest, most signal-like response value.

Note: due to subtleties of the implementation, this method does not work with binary splits.

2a.3 Treatment of missing values

The two multiclass learners implemented in the package are capable of processing missing values for each input class separately. This functionality is unique to the multiclass learners; it has not been implemented for binary classifiers. Running SprMultiClassApp, the user has two choices for dealing with missing values:

1) Just like with any other method, multiclass or binary, the user can compute replacements for the missing values by training a Missing transformation through SprTransformationApp. This transformation computes replacement values (medians or averages) irrespective of the class. -Q option for SprMultiClassApp then applies this class-blind transformation to the input data.

2) The user can apply -L option of SprMultiClassApp without -Q option. In this case, all values below the cutoff specified by -L are treated as missing. The replacements for these missing values are computed for each input class separately. The multiclass learner (whichever is chosen by the user) is trained on data with missing values replaced by the computed values for the specific class, that is, if a point is from class 2 and has a missing value in coordinate x5, it is replaced by the median in variable x5 computed for points in the training data from class 2. The trained multiclass learner, of course, does not know the true class of a new point. To compute the class label for this point, the trained learner assumes missing values for every input class in turn (assuming the point has some values missing) and records computed loss for the assumed class with the corresponding missing values. It then chooses the class which gives the minimal loss.

In the 2nd approach, missing values in the training data are replaced by medians of the 1D marginal distributions in the multiclass learner constructor. These changes cannot be applied to the original supplied data for the mere fact that these data can be re-used for other tasks, e.g., cross-validation. The multiclass learner therefore creates a temporary copy of the input data in memory with missing values replaced. The amount of consumed memory therefore doubles. For that reason, the user should be careful applying this algorithm. If it is used in conjunction with the binary encoder, the total amount of memory consumption increases by a factor of (K+1) where K is the number of input classes: first the data are doubled and then the replica of the data is multiplied by a factor of K.

2a.4 Indicator matrix generation for Allwein-Schapire-Singer

Although one-vs-all and one-vs-one multi-class separation strategies are popular, they are not necessarily most powerful. Work rooted in error-correcting output code (ECOC) algorithms suggests that matrices with maximized Hamming distance between rows (classes) or maximized diversity between columns (classifiers) can be more efficient. For a brief review, we recommend Ludmila Kuncheva "Using diversity measures for generating error-correcting output codes in classifier ensembles", Pattern Recognition Letters 26 (2005), pp. 83-90.

SprIndicatorMatrixApp implements several approaches for generation of such matrices. All input parameters for the executable are set in an input configuration file; see samplematrixindicator.cfg for an example. The generated matrix can be plugged into SprMultiClassApp using -i with -e 3 option.

The "ovoova" algorithm simply generates one-vs-all and one-vs-one matrices. This functionality is already provided by SprMultiClassApp. The "random" algorithm generates a user-specified number of random matrices of a given size and chooses the one with the best measure of row and/or column separation. The "random hill" algorithm implements simulated annealing, aka random hill climbing. Changing the sign of one randomly chosen matrix element defines one optimization step and the probability of accepting the change that leads to a worse measure of separation is given by probKeepBadChange. The "exhaustive" algorithm generates all possible class assignments for K classes and produces a K*L matrix, where L=2^(K-1)-1 is the number of classifiers. Obviously, it is not recommended for a large number of classes.

3. Data and filters

Each data point, SprPoint, is described by a vector of its coordinates, its class and its index (a non-negative number assigned by the user to distinguish points). SprData is a collection of SprPoint's.

In many applications we need to weight data points and impose cuts. This can be accomplished by imposing a filter. Every filter that can be used with the package must inherit from SprAbsFilter. Note that pretty much all classes, e.g., classifiers, that need access to data do not talk to data directly; instead they get access to a filter applied to this data. If you don't need any filtering, you can use SprEmptyFilter.

SprAbsFilter itself is capable of filtering on 1) event class 2) event index The index here applies not to the user-defined index of each point but to the flat index in the list of data points ranging from 0 to (length-1) given by the order in which the points were added to SprData. User-defined indexes of SprPoint's are used simply to give each point a unique id number.

Filter can also apply weights to events. These weights are not normalized to unit sum. If you wish to keep weights normalized, you need either to call normalizeWeights() method after filter construction and after every filter() operation or supply normalized weights to the filter constructor.

SprAbsFilter::pass() is a pure virtual method that must be implemented by specific filters. It determines if a data point satisfies the requirements imposed by the filter.

At the moment, four filters are implemented:

1) SprEmptyFilter which does not impose any cuts and simply reproduces the base SprAbsFilter functionality.

2) SprBoxFilter which imposes rectangular cuts on the input space

3) SprTransformerFilter which applies variable transformations defined by SprAbsVarTransformer to input data

4) SprVarSelectorFilter which selects a subset of input variables

To put filtering in effect, the user needs to run filter() method. The applied requirements and weights can then be removed with clear(). Note that clear() in SprAbsFilter restores the filter to its original state, that is, all requirements are removed and the weights are reset to the weights with which the filter was constructed. The exception to this design is SprTransformerFilter. After transform() is applied, there is no way of going back to the original variables.

To reduce memory consumption, the user can apply irreversibleFilter() method. If irreversibleFilter() is used, events that do not satisfy the imposed requirements are removed permanently and cannot be restored by clearing the filter. This method cannot be applied if SprData owned by the filter owns SprPoint's.

Each filter implementation should also have a copy-constructor. However, for SprAbsFilter the copy-constructor does not do the usual deep copying. Its purpose in this case is to provide a consecutive filter. SprAbsFilter's copy-constructor copies the current filter's state, that is, weights and all imposed requirements. Only events that satisfy these requirements are fed into the new filter. If you create filter 2 out of filter 1 using a copy-constructor, apply some extra filtering to filter 2 and then call clear() on filter 2, filter 2 will restore weights and requirements used at its construction. That is, you will never get access to the original weights and data used by filter 1 - unless, of course, you call clear() on filter 1 before you copy-construct filter 2.

Note that the original SprEmptyFilter produced by SprSimpleReader owns the input data. If you impose consecutive filters, they will hold pointers to the data owned by the original filter. Therefore, you are not allowed to delete the original filter - as long as you need access to the input data. If you delete the original filter, consecutive filters will point to a non-existent piece of memory.

4. Imposing simple cuts on input variables

Cuts are imposed using an SprCut typedef specified in SprDefs. SprCut is a vector of pairs. Each pair in the vector represents lower and upper bounds. The idea is that the cut on a variable is satisfied if the value of a variable lies between the lower and upper bounds for any pair in the vector. A simple one-sided cut is modeled as a vector with only one pair which has either the lower or upper bound equal to the numeric limit on double (see SprUtils).

5. Variable transformation and feature extraction

There are two ways of applying variable transformations in the package:

1) While input data is being read from disk to memory. This functionality is provided by SprPreFilter and discussed in Section 8.

2) After input data has been read into memory. This functionality is provided by SprTransformerFilter and SprAbsVarTransformer and discussed here.

SprAbsVarTransformer provides interface for variable transformations. Conceptually, variable transformation and feature extraction are the same thing. This is why all transformers include the train() method used to find an optimal transformation.

The user can apply an arbitrary sequence of transformations using the SprTransformationApp executable. The list of desired transformations is specified as an input string, for example, PCA,Normalize. The transformations are applied from left to right, that is, in this case PCA will be followed by Normalize. After training, the overall PCA,Normalize transformation can be saved into a file and then read back using -Q command-line option of most executables.

The package implements 3 specific transformations:

If you wish to use transformations interactively from a Root session, see root/spr_transform.C for examples.

To use a trained variable transformation in your C++ code, use syntax similar to that of trained classifiers:

const SprAbsVarTransformer* t
  = SprVarTransformerReader::read("saved_trans.spr");
assert( t != 0 );
if( !t->ready() ) {
  cerr << "Transformer not ready." << endl;
  abort();
}
vector<string> oldVars, newVars;
t->oldVars(oldVars); // variables before transformation
t->newVars(newVars); // variables after transformation
vector<double> v = ....;
vector<double> transformed;
t->transform(v,transformed);
delete t;// delete when you are done

6. Treatment of input classes

Classes in data input must be coded as integers. For the purpose of classification, however, the user can group classes in any way s/he likes. This functionality is implemented in SprClass. Each SprClass has a vector of integer numbers and a negation flag. For instance, input data has 5 classes numbered from 0 to 4. If the user wants to treat classes 1 and 3 as signal and all the others as background, s/he should make two SprClass objects. There are several ways of making them:

a) 1st SprClass object has integers 1 and 3 and the negation flag is set to false. 2nd SprClass object has integers 0, 2, and 4, and the negation flag is also false.

b) 1st SprClass object has integers 1 and 3 and the negation flag is set to false. 2nd SprClass object has integers 1 and 3, and the negation flag is set to true.

c) …

In case (b), the meaning of the 2nd SprClass object is "any class but 1 and 3". For 5 input classes, this is equivalent to "classes 0, 2, and 4".

Most executables have "-y" command-line option which lets the user make this choice interactively. The corresponding C++ method is SprAbsFilter::chooseClassesFromString(). For description of syntax, see SprAbsFilter.hh.

Content of each SprClass object is printed out using the "<<" operator. It shows all classes as a list of integers separated by commas and the negation flag in parentheses as either +1 (no negation) or -1 (negation is in effect).

In this approach, original event classes specified in input data are stored, unchanged, into an output ntuple. This allows the user to see what specific processes contribute to specific regions of the classifier output. The ntuple does not carry information about what class was treated as background and what class was treated as signal; in most situations it is fairly easy to figure this out just by looking at the correlation between the class number and the classifier output, but in principle the user has to keep an independent record of how the classes were split into signal and background for the purpose of this classification job.

7. Estimation of variable importance

The ultimate goal of the variable selection analysis is to select a subset of lowest dimensionality without significant loss in the predictive power.

Tools for variable selection fall in two groups: 1) methods that work with any classifier and 2) methods that work only with one specific classifier.

7.1 Correlation with the class label

The classifier-independent tool implemented in the package (SprExploratoryAnalysisApp) computes correlations between variables and the class label. SprDataMoments implements two methods: correlClassLabel() to compute corr(X,Y) and absCorrelClassLabel() to compute corr(|X-EX|,Y), where X is the variable in question, EX is its expectation and Y is the class label. In principle, the larger is the magnitude of the correlation, the more powerful is this variable for classification. In reality, this method for variable selection is not very reliable. It is easy to construct an example where corr(X,Y)=0 but in fact X is a powerful classification variable. It is harder (but possible) to construct an example where both corr(X,Y) and corr(|X-EX|,Y) are close to zero but variable X is powerful nevertheless. The reverse is not true: If one of these correlations is close to 1 or -1, this is most likely a powerful classification variable.

7.2 Counting decision splits

For decision trees, you can count the number of decision splits on each input variable and estimate the change in the split criterion FOM due to this variable. The change in the FOM is a more reliable measure of the variable importance than the split counting. Both these options are invoked by the "-i" command-line option with the decision-tree-based executables.

7.3 Variable importance for neural networks

For neural networks, you can estimate the importance of each input variable by summing weights of links connected to the corresponding input node over all hidden nodes. Imagine a simple neural network with one hidden layer and one output node. Let us denote the weight of a link from an input node d to a hidden node h as w(d,h). Let us denote the weight of a link from this hidden node to the output node as w(h,o). The importance of input variable d is then estimated as

\sum_h abs(w(d,h)*w(h,o))

where the sum is taken over all hidden nodes. For a more complex case with several hidden layers, this formula would be modified accordingly, in a straightforward manner.

This method is applicable only if the input data have been normalized to the same input range. It is strongly recommended to apply the Normalize transformation to data before estimating variable importance using this recipe.

To invoke this method, use "-j" option of SprStdBackpropApp.

7.4 Estimation of variable importance by random permutation of class labels

SprVariableImportanceApp estimates the importance of each input variable for any applied classifier. It randomly permutes class labels in the provided dataset for each input variable in turn and estimates an increase in the classification loss due to this mislabeling. The more important the variable, the greater the increase in the classification loss. The user can specify the number of permutations per variable over which the increase in loss is averaged. The default is one permutation but it is recommended to use more than one for more accurate results. The quadratic loss is chosen as default since classifiers whose output ranges from -infty to +infty automatically map the output onto [0,1]. For a multi-class learner, the default loss is set to misidentified fraction of events.

It is recommended to run this method on data not used for training this classifier. Because classifiers like random forest or boosted decision trees achieve perfect separation on training data, the increase in loss computed for the training data can be misleading.

7.5 "Add N Remove R" method

A more powerful approach is the "add n remove r" method implemented in SprAddNRemoveRApp. This executable uses the user-defined classifier and loss to find an optimal combination of input variables by adding n and removing r variables at each optimization step. This method is CPU-intensive and should be only tried on small datasets. For really small datasets, the user can improve the accuracy of variable selection by cross-validating on the training set (a test data set is not supplied in this case).

This method works very well with decision-tree-based classifiers. But this method cannot be used with neural networks because one needs to change the neural network structure whenever variables are added or removed. This part of the algorithm could be, of course, automatized too, but I am reluctant to provide too many magic red buttons that spit out some numbers as soon as they are pressed.

Set the verbosity of SprAddNRemoveRApp to "-v 1" to see in what order variables are added and eliminated.

For the multi-class algorithms, the "add n remove r" method is implemented as part of SprMultiClassApp. Again, use it with "-v 1".

7.6 Variable interactions

SprVariableImportanceApp can also estimate an interaction between two subsets of input variables, S1 and S2, for the selected classifier. The interaction is defined as

corr(F(S1),F(S2))

where F(S1) is the response of the classifier at a given point averaged over all variables not included in S1, similarly for F(S2). A weak correlation indicates that one can build a classifier on S1 and an independent classifier on S2, while building a common classifier on the union of S1+S2 will likely not increase the overall classification power.

The input syntax for defining variable subsets is similar to the syntax used for grouping integers into classes. The two subsets of variables are separated by a colon, and variables within one subset are separated by commas. A single dot . means "all variables but those included in the other group"; for example, x1:. means that the method will compute an interaction between x1 and all variables excluding x1. If an empty string is entered, the method computes interaction between each variable known to the trained classifier and all other variables.

The -s option of SprVariableImportanceApp invokes an algorithm that selects variables with largest interactions sequentially. First, it selects the variable that interacts most with all other variables, then selects the 2nd variable that interacts most with all other variables except the variable selected at step 1, then selects 3rd variable that interacts most with all other variables except variables 1 and 2 etc. The outcome is a list of variable subsets with corresponding interactions. The variable subsets in this list are ordered by size: the 1st subset consists of one variable, the 2nd subset consists of 2 variables etc, with each subset being the most powerful subset among the subsets of the same size based on the strength of interaction. This method is less CPU-intensive than the Add N Remove R algorithm. The hope is that this method can rank reliably input variables to let the user choose a subset of lowest dimensionality and thus can be a CPU-cheap substitute for the Add N Remove R algorithm. This method is currently being tested on real-world datasets.

7.7 Which variable selection method is best?

The methods above only provide crude estimates. The best (and most time-consuming) approach is to try all or most variable combinations for each classifier and select the combination that gives the best performance.

Because boosted decision trees and random forest perform well in high dimensions, the usual strategy is to start with all variables for these classifiers and throw them away as long as the classification power does not deteriorate past the level you can tolerate.

For neural networks, one usually deploys the opposite strategy: start with a few powerful and well-understood variables and add more as long as you see a noticeable gain in performance.

Note that, unlike some other implementations, SPR makes no attempt to normalize variable importance values to add up to one. This lets the user monitor how the importance of a certain variable changes on the absolute scale due to addition or removal of other variables.

8. Handling input variables

If input data are read from a Root file, the user specifies explicitly what variables are taken from the input ntuple. If input data are read from an Ascii file, all variables included in the Ascii file are by default read in. If you want to exclude some of them, you can do so by using "-z" option of the executables.

The -V command-line option of most executables lets the user include only specified variables. Again, it works only for Ascii input.

Generally, it is the responsibility of the user to keep track of what variables were used for training and supply the same variables in the same order to a trained classifier. That is, if you exclude certain variables from training using "-z" option, you have to remember to exclude the same variables when you compute classification values for this dataset. Note that the list of variables used for training is saved at the bottom of each trained classifier configuration file.

To make your life easier, some executables have "-M" option. This option allows to map variables used for training (included at the bottom of the classifier configuration file) onto variables available in input data. If "-M" option is provided, you don't have to remember what variables were used for training. You have to make sure that identical variables have identical names in training and test datasets. This option is always turned on for classifiers used via SprRootAdapter.

Sometimes you want to exclude variables from training but store them into the output root (or Ascii) file. This can be done using "-Z" option. A typical application of the "-Z" option would look like this:

unix> SprAdaBoostDecisionTreeApp -a 1 -n 100 -z 'var1,var2' -f save.spr train.pat

and then

unix> SprAdaBoostDecisionTreeApp -a 1 -n 0 -Z 'var1,var2' -r save.spr -o test.root test.pat

This will train boosted decision trees using all input variables except var1 and var2 and then compute the classifier response for the test data and store these variables into test.root as well. For comparison,

unix> SprAdaBoostDecisionTreeApp -a 1 -n 0 -z 'var1,var2' -r save.spr -o test.root test.pat

would do the same thing without storing var1 and var2 into the output ntuple.

Starting with SPR-6, the user can filter data, apply variable transformation and compute the class label for each event defined as a function of the event coordinates. All these operations are performed as the data is being read from a file. An example is shown in exampleUserCuts.cc. Keep in mind that selection requirements are applied before transformations - you always cut on the untransformed values. These methods are implemented in SprPreFilter and they have nothing to do with generic filtering described in Section "3. Data and filters". SprAbsFilter described in section 3 is applied to data that has been stored in memory and converted into SPR format. Most of SprAbsFilter methods hide pieces of data from the user but do not remove them permanently from memory; these can be reversed and the user can get access to the full dataset again. SprPreFilter selection is applied while data is being read from input files. SprPreFilter requirements and transformations have a permanent effect - for example, events that do not pass these requirements never get stored in memory.

  1. Data input and output



9.1 Supported I/O types: Ascii and Root

Input is implemented from Ascii and Root files.

The Ascii input format is close to that of SNNS. There are two essential rules:

For details and acceptable Ascii formats, see SprSimpleReader.hh and examples in data/.

Ascii input format 7 allows to read input data from output files obtained by SprOutputWriterApp. This is useful for sequential training and splitting data into training and test sets.

For details on Root input, see SprRootReader.hh and examples in data/.

If the package is built against Root, the user can choose between input and output from/to Ascii and Root files. For Ascii input, the user must specify -a flag for all executables with an appropriate integer. If -a is omitted or "-a 0" is specified, the executable assumes that input data must be fetched from Root files. For output, the default choice is Root. To output data into Ascii files, the user must specify -A flag on the command line.

9.2 Splitting data into training and test sets

SprSplitterApp lets the user split input data into training and test sets in the specified proportion using the specified random seed. If you save data in Ascii format, it will be saved in format 7. You can read the splitted data back by running any executable with "-a 7". If you save the splitted data in Root, you have to prepare a new *.pat file for input from Root (see examples in data/). This should be straightforward.

Note that most executables include -K option for splitting input data into training and test sets on the fly. This option will split input data keeping events in the same order, for example, -K 0.5 will treat the top 50% of events of each class as training data and the bottom 50% as test data. -D in combination with -K randomizes splitting. However, the user does not have any control over the random seed. Depending on subtleties of the implementation, the behavior can be non-deterministic, that is, if you apply -K and -D options tomorrow, you may get a different splitting. In contrast, results obtained by SprSplitterApp with a user-defined random seed are expected to be reproducible (to the extent of reproducibility of the random number generator output for different compilers and platforms).

Splitting into training and test datasets with different random seeds is a valuable tool for analysis of small datasets, if you want to investigate how sensitive are your conclusions about the classifier performance to training data.

9.3 SprIOTestApp: displaying data content

SprIOTestApp displays info about all classes found in the input data and their content. It is also a simple executable to play with if you just want to test the format of your input files or storing data in one of the available formats.

10. Computing output for trained classifiers

Data, event weights and classifier outputs can be stored into ntuples using SprDataFeeder. The feeder can work with any SprAbsWriter. At present there are 2 implemented writers, SprRootWriter, capable of storing data into Root, and SprAsciiWriter for data storage into Ascii files. The feeder will feed all data points obtained from an input filter with classifier outputs to the writer. You can add as many classifiers to the output as you like using addClassifier() method of the feeder.

Spr{Root,Ascii}Writer will save input coordinates for each point with column names extracted from the input data file and with the output of all classifiers. Besides, it will always fill the following 3 columns:

SprOutputWriterApp reads the saved configuration of any trained classifier, computes classifier responses for the supplied data and saves output into a file. Prior to tag V05-00-00 this fuctionality was distributed among several executables. For example, to read the saved AdaBoost configuration from a file and apply it to data, the user would have to specify zero training cycles "-n 0" and a file to read from using "-r" option of the corresponding AdaBoost executable. SprOutputWriterApp is a unified recommended replacement.

11. Installation instructions

See INSTALL.

12. How to use trained classifiers in your C++ code

This syntax has changed in tag V05-00-00. Now all trained classifiers are read in using the unified interface of SprClassifierReader. For example, to apply the saved AdaBoost configuration to data, the user would have to do:

SprAbsTrainedClassifier* ada
  = SprClassifierReader::readTrained("saved_adaboost.spr");
assert( ada != 0 );
vector<double> input = ...;
double ada_output = ada->response(input);
....
delete ada;// delete after you are done

Similarly, for Bagger.

If you know what classifier is saved in the configuration file, you might want to use the type-safe syntax:

SprTrainedAdaBoost* ada
  = SprClassifierReader::readTrained<SprTrainedAdaBoost>(
                           "saved_adaboost.spr","AdaBoost");
assert( ada != 0 );

In this case the read will succeed only if the saved classifier is indeed AdaBoost. (For this to work, make sure that template instantiations in SprClassifierReader are uncommented!! Since they break package compilation for some compilers, I may keep them commented out in the source code.)

For the multi-class learner, you need to use SprMultiClassReader:

SprAbsTrainedMultiClassLearner* multiclass =
  SprMultiClassReader::readTrained("saved_multiclass.spr");
assert( multiclass != 0 );
vector<double> input = ...;
map<int,double> output;
int category = multiclass->response(input,output);
....
delete multiclass;// delete after you are done

13. Working in Root-free environment

SprInteractiveAnalysisApp and SprOutputAnalyzerApp are the two executables that allow one to avoid using, to a large extent, any graphics packages. If you built the package without Root, you can use the following sequence to assess the performance of classifiers:

a) Train a bunch of classifiers, for example:

SprBaggerDecisionTreeApp -a 1 -n 100 -l 5 -f bagger.spr train.pat
SprAdaBoostDecisionTreeApp -a 1 -n 100 -l 1000 -f ada.spr train.pat

b) Compute output for test data:

SprOutputWriterApp -a 1 -A -C 'ada,bag' 'ada.spr,bagger.spr' test.pat test.out

c) Analyze the Ascii output:

SprOutputAnalyzerApp -C 'ada,bag' test.out

See comments in SprOutputAnalyzerApp.cc.

If you built the package with Root, you can't use step (c) but, of course, you could and should use steps (a) and (b) (without the "-A" flag).

For small data sets in not too many dimensions, you can combine all these steps in one and compare several classifiers using SprInteractiveAnalysisApp:

SprInteractiveAnalysisApp -a 1 train.pat test.pat

14. Working in Root environment

Starting with SPR release 7, the user can run SPR interactively from a Root session. See root/spr_tutorial.C and other examples in root/. The package needs to be built against Root. This functionality is provided by the SprRootAdapter class. See the interface in include/StatPatternRecognition for a list of all available options. This interface implements wrappers to most SPR methods and gives the user quick access to Root graphics.

I strongly advise against running CPU-intensive jobs within Root environment, whether interactive or batch. The SPR Root wrapper has been provided to let the user run interactive analysis on small datasets in not too many dimensions and give the user access to graphics. If you need to train on datasets of appreciable size, do it by running SPR executables. To enjoy graphics, you can then load test data and the saved classifier configuration into an interactive Root session using loadData..() and loadClassifier() methods.

15. Missing values

By convention missing values must be encoded as values below the allowed physical range. There are two ways of dealing with missing values:

Replacement of missing values works like any other input variable transformation. First, you need to train it by running SprTransformationApp. The Missing transformation requires a lower cutoff and optionally takes the transformation mode. All values below the specified cutoff are treated as missing. The mode lets the user choose between computing medians or averages. After training is completed and the transformation is stored into a file, it can be applied to any data using -Q option of almost any executable.

Note that the missing values are always replaced by medians or averages computed for all classes chosen by the user from the input data. The exception is muticlass algorithms accessed through SprMultiClassApp. By specifying -L option for SprMultiClassApp, the user invokes class-specific computation of the medians of the marginal distributions. The missing values are then replaced by the computed medians assuming every class hypothesis in turn and the classification label is given by the class hypothesis that provides the smallest loss for a given row of the indicator matrix. For more detail, see the section on multiclass learning.

Different packages have different conventions for coding missing values, for example: "?", "NaN" etc. If someone gives me a convincing argument why this is better than coding them as large negative numbers, I may one day implement a similar convention as well. The advantage of my approach is that 1) it does not require any extra code for I/O, and 2) missing values can be easily visualized.

16. Example: Cleveland heart data

I'd like to give an example of the full analysis chain using Ascii I/O. This is intended as a quick overview of what can be done using the package.

See how much data we got in total:

[narsky@hepsubmit share]$ ../bin/SprIOTestApp -a 1 cleveland.data
Read data from file cleveland.data for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 303
Found 5 classes in data:
  Class:          0(1)    Points:        164
  Class:          2(1)    Points:         36
  Class:          1(1)    Points:         55
  Class:          3(1)    Points:         35
  Class:          4(1)    Points:         13
No input classes specified in string  Will use 0 and 1 by default.
Training data filtered by class.
Points in class 0(1):   164
Points in class 1(1):   55

This gives you the breakdown of classes in the input data. (The last 4 lines are irrelevant.)

Let us group the classes in two groups. Class 0 (no disease) is treated as background, and classes 1-4 (various stages of disease) are treated as signal:

[narsky@cithep97 share]$ ../bin/SprIOTestApp -a 1 -y '0:.' cleveland.data
Read data from file cleveland.data for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 303
Training data filtered by class.
Points in class 0(1):   164
Points in class 0(-1):   139

Split input data into training and test subsets in equal proportion:

[narsky@cithep97 share]$ ../bin/SprSplitterApp -a 1 -A -K 0.5 -S 72727 -y '0:.' cleveland.data clev_train.pat clev_test.pat
Read data from file cleveland.data for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 303
Input data filtered by class.
Points in class 0(1):   164
Points in class 0(-1):   139
Splitting input data with factor 0.5
Data re-filtered:
Training data:
Points in class 0(1):   82
Points in class 0(-1):   69
Test data:
Points in class 0(1):   82
Points in class 0(-1):   70
Warning: file clev_train.pat will be deleted.
Warning: file clev_test.pat will be deleted.
Warning: no classifiers specified for SprDataFeeder. Will save data only.
Feeder storing point 0 out of 151
Writer successfully closed.
Warning: no classifiers specified for SprDataFeeder. Will save data only.
Feeder storing point 0 out of 152
Writer successfully closed.

Normalize inputs for application of a neural net. There are a few missing values in the data that are encoded as -11.

[narsky@cithep97 share]$ ../bin/SprTransformationApp -a 1 -y '0:.' -L '-10' 'Missing,Normalize' cleveland.data clev_norm.spr
Read data from file cleveland.data for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 303
Training data filtered by class.
Points in class 0(1):   164
Points in class 0(-1):   139
Transformation sequence has been trained.
Stored the trained transformation

Train a neural network (I omit the part where I find an optimal configuration of the neural net):

[narsky@cithep97 share]$ ../bin/SprStdBackpropApp -a 7 -y '0:.' -Q clev_norm.spr -N '13:6:3:1' -n 1 -l 400 -L 0.1 -i 10 -g 1 -d 10 -t clev_test.pat clev_train.pat
Read data from file clev_train.pat for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 151
Training data filtered by class.
Points in class 0(1):   82
Points in class 0(-1):   69
Read validation data from file clev_test.pat for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 152
Points in class 0: 82 1: 70
Validation data filtered by class.
Points in class 0(1):   82
Points in class 0(-1):   70
Variable transformation from file clev_norm.spr has been applied to training and validation data.
Per-event loss set to Quadratic loss (y-f(x))^2
Will train Discrete AdaBoost.
Classes for StdBackprop are set to 0(1) 0(-1)
StdBackprop initialized with classes 0(1) 0(-1) nCycles=400 structure=13:6:3:1 LearningRate=0.1
Validation Loss=0.249513 at cycle 0
Validation Loss=0.248784 at cycle 10
Validation Loss=0.248775 at cycle 20
Validation Loss=0.248765 at cycle 30
Validation Loss=0.248752 at cycle 40
Validation Loss=0.248734 at cycle 50
Validation Loss=0.248704 at cycle 60
Validation Loss=0.24864 at cycle 70
Validation Loss=0.248459 at cycle 80
Validation Loss=0.247667 at cycle 90
Validation Loss=0.240805 at cycle 100
Validation Loss=0.186157 at cycle 110
Validation Loss=0.146592 at cycle 120
Validation Loss=0.14351 at cycle 130
Validation Loss=0.145346 at cycle 140
Validation Loss=0.147809 at cycle 150
Validation Loss=0.14999 at cycle 160
Validation Loss=0.152153 at cycle 170
Validation Loss=0.154612 at cycle 180
Validation Loss=0.157342 at cycle 190
Validation Loss=0.159825 at cycle 200
Validation Loss=0.161757 at cycle 210
Validation Loss=0.163274 at cycle 220
Validation Loss=0.164617 at cycle 230
Validation Loss=0.166054 at cycle 240
Validation Loss=0.167951 at cycle 250
Validation Loss=0.170806 at cycle 260
Validation Loss=0.174859 at cycle 270
Validation Loss=0.179328 at cycle 280
Validation Loss=0.183079 at cycle 290
Validation Loss=0.186059 at cycle 300
Validation Loss=0.188555 at cycle 310
Validation Loss=0.190724 at cycle 320
Validation Loss=0.192646 at cycle 330
Validation Loss=0.19437 at cycle 340
Validation Loss=0.195929 at cycle 350
Validation Loss=0.197348 at cycle 360
Validation Loss=0.198646 at cycle 370
Validation Loss=0.199838 at cycle 380
Validation Loss=0.200934 at cycle 390
Validation Loss=0.201945 at cycle 400
Training done.

The minimum of the quadratic loss for the test data is somewhere around 150 training cycles. Let us use that and save the trained neural net into a file:

[narsky@cithep97 share]$ ../bin/SprStdBackpropApp -a 7 -y '0:.' -Q clev_norm.spr -N '13:6:3:1' -n 1 -l 150 -f clev_nn.spr -L 0.1 -i 10 -g 1 -d 10 -t clev_test.pat clev_train.pat

Train boosted decision trees now. Discrete and Real AdaBoost do not show stable performance for this dataset. Epsilon-Boost with eps=0.01 performs reasonably well. Let us find the optimal leaf size by cross-validation:

[narsky@cithep97 share]$ ../bin/SprAdaBoostDecisionTreeApp -a 1 -y '0:.' -Q clev_norm.spr -g 2 -n 200 -M 3 -x 3 -q '5,10,20,30,40,50,60,70,80,90,100,110,120,130' cleveland.data
......
Cross-validated FOMs:
Node size=       5      FOM=  0.973268 +- 0.00993318
Node size=      10      FOM=  0.972704 +-  0.0278197
Node size=      20      FOM=  0.952098 +-  0.0457402
Node size=      30      FOM=   0.94095 +-  0.0463437
Node size=      40      FOM=   0.94597 +-  0.0368492
Node size=      50      FOM=  0.953193 +-  0.0464553
Node size=      60      FOM=  0.951837 +-  0.0612336
Node size=      70      FOM=  0.944358 +-  0.0629308
Node size=      80      FOM=  0.959559 +-  0.0516297
Node size=      90      FOM=   0.99134 +-  0.0474166
Node size=     100      FOM=   1.07064 +-  0.0638777
Node size=     110      FOM=   1.14793 +-  0.0244583
Node size=     120      FOM=   1.14793 +-  0.0244583
Node size=     130      FOM=   1.14793 +-  0.0244583

Use 30 as the optimal node size. We need to multiply it by 3/2 (because we split the data into 3 parts for cross-validation and used 2 out of 3 for training0 and divide by 2 (because the training set is 1/2 of the whole input dataset). Therefore, the optimal leaf size for training is roughly 20:

[narsky@cithep97 share]$ ../bin/SprAdaBoostDecisionTreeApp -Q clev_norm.spr -a 7 -y '0:.' -g 2 -t clev_test.pat -n 200 -d 10 -l 20 -M 3 -f clev_bdt.spr clev_train.pat
........
Validation Loss=0.963527 at cycle 10
Validation Loss=0.930953 at cycle 20
Validation Loss=0.899423 at cycle 30
Validation Loss=0.872137 at cycle 40
Validation Loss=0.849918 at cycle 50
Validation Loss=0.82902 at cycle 60
Validation Loss=0.813038 at cycle 70
Validation Loss=0.80077 at cycle 80
Validation Loss=0.788923 at cycle 90
Validation Loss=0.780752 at cycle 100
Validation Loss=0.77765 at cycle 110
Validation Loss=0.774673 at cycle 120
Validation Loss=0.766489 at cycle 130
Validation Loss=0.764717 at cycle 140
Validation Loss=0.764703 at cycle 150
Validation Loss=0.76869 at cycle 160
Validation Loss=0.77181 at cycle 170
Validation Loss=0.771408 at cycle 180
Validation Loss=0.775252 at cycle 190
Validation Loss=0.77887 at cycle 200
AdaBoost finished training with 200 classifiers.

Try bagging decision trees now. Again, cross-validate to get the optimal leaf size:

../bin/SprBaggerDecisionTreeApp -Q clev_norm.spr -a 1 -y '0:.' -n 100 -g 1 -x 3 -q '1,2,3,4,5,10,15,20,25,30' -v 1 cleveland.data
............
Cross-validated FOMs:
Node size=       1      FOM=  0.149607 +-  0.0190036
Node size=       2      FOM=  0.144665 +- 0.00993419
Node size=       3      FOM=  0.146089 +-  0.0121881
Node size=       4      FOM=  0.141243 +-  0.0101155
Node size=       5      FOM=  0.141316 +- 0.00127799
Node size=      10      FOM=  0.137164 +- 0.00451656
Node size=      15      FOM=   0.14209 +-  0.0054432
Node size=      20      FOM=  0.147212 +- 0.00248721
Node size=      25      FOM=  0.145218 +- 0.00316222
Node size=      30      FOM=  0.150421 +- 0.00406612

Let us choose leaf size 10 and use 8 for training:

../bin/SprBaggerDecisionTreeApp -Q clev_norm.spr -a 7 -y '0:.' -n 100 -l 8 -g 1 -d 5 -t clev_test.pat -f clev_rf.spr clev_train.pat
............
Validation FOM=0.170303 at cycle 5
Validation FOM=0.164707 at cycle 10
Validation FOM=0.159462 at cycle 15
Validation FOM=0.155348 at cycle 20
Validation FOM=0.154199 at cycle 25
Validation FOM=0.154731 at cycle 30
Validation FOM=0.15558 at cycle 35
Validation FOM=0.154422 at cycle 40
Validation FOM=0.153872 at cycle 45
Validation FOM=0.154465 at cycle 50
Validation FOM=0.152767 at cycle 55
Validation FOM=0.152698 at cycle 60
Validation FOM=0.152226 at cycle 65
Validation FOM=0.152953 at cycle 70
Validation FOM=0.151939 at cycle 75
Validation FOM=0.151536 at cycle 80
Validation FOM=0.150761 at cycle 85
Validation FOM=0.150968 at cycle 90
Validation FOM=0.150515 at cycle 95
Validation FOM=0.150619 at cycle 100
Bagger finished training with 100 classifiers.

Compare the performance of the 3 classifiers now:

[narsky@cithep97 share]$ ../bin/SprOutputWriterApp -a 7 -A -y '0:.' -Q clev_norm.spr -C 'nn,bdt,rf' 'clev_nn.spr,clev_bdt.spr,clev_rf.spr' clev_test.pat save.out
[narsky@cithep97 share]$ ../bin/SprOutputAnalyzerApp -y '0:.' -C 'nn,bdt,rf' -n save.out
Optimization criterion set to Gini index  -1+p^2+q^2
Output of classifier "nn" found in position 17
Output of classifier "bdt" found in position 18
Output of classifier "rf" found in position 19
Read 152 points from input file save.out   Bgrnd Nevents=82   Bgrnd Weight=82   Signal Nevents=70   Signal Weight=70
Input signal efficiency values for which background will be estimated [ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ]
===========================================
Table of surviving background fractions (* shows minimal value in a row; Cut on classifier output and FOM are shown in parentheses)
===========================================
Signal Eff   \  Classifiers  |                           bdt |                            nn |                            rf |
------------------------------------------------------------------------------------------------------------------------------
              0.1000         | *     0.00000 +-      0.00000 |       0.00000 +-      0.00000 |       0.00000 +-      0.00000 |
              0.2000         | *     0.00000 +-      0.00000 |       0.00000 +-      0.00000 |       0.00000 +-      0.00000 |
              0.3000         |       0.02439 +-      0.01725 |       0.01220 +-      0.01220 | *     0.00000 +-      0.00000 |
              0.4000         | *     0.03659 +-      0.02112 |       0.03659 +-      0.02112 |       0.03659 +-      0.02112 |
              0.5000         |       0.06098 +-      0.02727 | *     0.04878 +-      0.02439 |       0.04878 +-      0.02439 |
              0.6000         |       0.08537 +-      0.03227 |       0.07317 +-      0.02987 | *     0.04878 +-      0.02439 |
              0.7000         |       0.13415 +-      0.04045 | *     0.09756 +-      0.03449 |       0.09756 +-      0.03449 |
              0.8000         |       0.30488 +-      0.06098 | *     0.23171 +-      0.05316 |       0.28049 +-      0.05849 |
              0.9000         |       0.35366 +-      0.06567 | *     0.34146 +-      0.06453 |       0.50000 +-      0.07809 |
===========================================
Histogram classifier output? y/n [y] n
Continue? y/n [y] n

Looks like bagged decision trees have an edge in the region of high signal purity but the neural net performs better in the region of high signal efficiency. Might be interesting to bag the neural network and see if that can bring a noticeable improvement. But I am not doing this now.

Let us estimate input variable importance for the bagged decision trees using two methods: one specific for the decision tree and another one classifier-independent.

To count decision splits for the bagged trees, we need to re-train using "-i" option:

[narsky@cithep97 share]$ ../bin/SprBaggerDecisionTreeApp -Q clev_norm.spr -a 7 -y '0:.' -n 100 -l 8 -g 1 -d 5 -t clev_test.pat -i clev_train.pat | sort -grk 7,7
Variable                        thalach    Splits         169    Delta FOM    28.61687
Variable                             ca    Splits          85    Delta FOM    24.33387
Variable                           thal    Splits          75    Delta FOM    20.36938
Variable                            age    Splits         146    Delta FOM    18.85921
Variable                             cp    Splits          54    Delta FOM    15.83812
Variable                        oldpeak    Splits          67    Delta FOM    13.76227
Variable                           chol    Splits          96    Delta FOM    11.71196
Variable                          exang    Splits          34    Delta FOM     6.68017
Variable                       trestbps    Splits          48    Delta FOM     5.76838
Variable                            sex    Splits          13    Delta FOM     1.58061
Variable                          slope    Splits           5    Delta FOM     0.97370
Variable                        restecg    Splits           9    Delta FOM     0.89813
Variable                            fbs    Splits           2    Delta FOM     0.19316

Now estimate the decrease in quadratic loss due to each variable:

[narsky@cithep97 share]$ ../bin/SprVariableImportanceApp -y '0:.' -a 7 -Q clev_norm.spr -n 10 -g 2 clev_rf.spr clev_test.pat | sort -grk 2,2
                               thal         0.0302867259 +-    0.0093382963         0.0000000000 +-    0.0000000000
                                 ca         0.0300654846 +-    0.0059497151         0.0000000000 +-    0.0000000000
                                 cp         0.0138796084 +-    0.0029744952         0.0000000000 +-    0.0000000000
                            oldpeak         0.0065700839 +-    0.0018971035         0.0000000000 +-    0.0000000000
                            thalach         0.0064980518 +-    0.0065570920         0.0000000000 +-    0.0000000000
                                age         0.0043183276 +-    0.0031195336         0.0000000000 +-    0.0000000000
                              exang         0.0034915037 +-    0.0021649321         0.0000000000 +-    0.0000000000
                               chol         0.0018462538 +-    0.0015761285         0.0000000000 +-    0.0000000000
                                sex         0.0010740878 +-    0.0003132361         0.0000000000 +-    0.0000000000
                              slope         0.0001578955 +-    0.0002165354         0.0000000000 +-    0.0000000000
                                fbs        -0.0000150207 +-    0.0000932678         0.0000000000 +-    0.0000000000
                            restecg        -0.0000162495 +-    0.0002991600         0.0000000000 +-    0.0000000000
                           trestbps        -0.0005304460 +-    0.0006093024         0.0000000000 +-    0.0000000000

My guess is that slope, fbs, restecg and trestbps can be safely removed without losing any predictive power. To verify that, let us run an "add 2 remove 1" test. I prepared file my.config with two lines in it:

Bagger 100 0
TopdownTree 5 0 10 0

Because I am going to cross-validate on the full input data using 3 pieces, I set the optimal leaf size to 10.

[narsky@cithep97 share]$ ../bin/SprAddNRemoveRApp -a 1 -Q clev_norm.spr -y '0:.' -g 2 -n 2 -r 1 -x 3 -v 1 my.config cleveland.data
N_of_variables      Loss
    1                  0.196326 +-  0.0028473
    2                   0.16178 +-  0.0140857
    3                    0.1455 +-  0.0151866
    4                  0.138903 +-  0.0292912
    5                  0.131255 +-  0.0191462
    6                  0.129631 +-  0.0150008
    7                  0.128226 +-  0.0126571
    8                  0.127219 +- 0.00732298
    9                  0.129644 +-  0.0122463
   10                  0.125565 +-  0.0127497
   11                  0.127843 +- 0.00206423
   12                  0.129265 +- 0.00707255
   13                   0.14691 +-  0.0129123
Variables:
    1        thal
    2        ca   thal
    3        ca   cp   thal
    4        ca   cp   sex   thal
    5        ca   cp   oldpeak   sex   thal
    6        ca   cp   exang   oldpeak   sex   thal
    7        ca   cp   exang   oldpeak   restecg   sex   thal
    8        ca   cp   exang   oldpeak   restecg   sex   thal   trestbps
    9        ca   chol   cp   exang   fbs   oldpeak   restecg   sex   thal
   10        age   ca   cp   exang   fbs   oldpeak   restecg   sex   slope   thal
   11        age   ca   chol   cp   exang   fbs   restecg   sex   slope   thal   trestbps
   12        age   ca   chol   cp   exang   fbs   oldpeak   restecg   sex   slope   thal   trestbps
   13        age   ca   chol   cp   exang   fbs   oldpeak   restecg   sex   slope   thal   thalach   trestbps

There is an obvious improvement in performance when we include 2nd and 3rd variable. Then the improvement is questionable.

Sort variables by interaction now:

[narsky@cithep97 share]$ ../bin/SprVariableImportanceApp -s -a 1 -Q clev_norm.spr -y '0:.' -v 1 clev_rf.spr cleveland.data
N_of_variables      Interaction
    1                  0.514851 +-          0
    2                  0.454187 +-          0
    3                  0.403137 +-          0
    4                  0.345686 +-          0
    5                  0.359334 +-          0
    6                  0.351943 +-          0
    7                  0.283103 +-          0
    8                  0.167429 +-          0
    9                  0.108733 +-          0
   10                 0.0499255 +-          0
   11                 0.0201312 +-          0
   12                -0.0659972 +-          0
Variables:
    1        thalach
    2        thalach   cp
    3        thalach   cp   oldpeak
    4        thalach   cp   oldpeak   ca
    5        thalach   cp   oldpeak   ca   slope
    6        thalach   cp   oldpeak   ca   slope   exang
    7        thalach   cp   oldpeak   ca   slope   exang   sex
    8        thalach   cp   oldpeak   ca   slope   exang   sex   trestbps
    9        thalach   cp   oldpeak   ca   slope   exang   sex   trestbps   age
   10        thalach   cp   oldpeak   ca   slope   exang   sex   trestbps   age   restecg
   11        thalach   cp   oldpeak   ca   slope   exang   sex   trestbps   age   restecg   chol
   12        thalach   cp   oldpeak   ca   slope   exang   sex   trestbps   age   restecg   chol   thal

To summarize, Add 2 Remove 1 algorithm predicts that only 3-4 variables are needed to get the full predictive power. Here are predictions for the 4 most powerful variables:

Add 2 Remove 1         ca        cp   sex       thal
Delta Gini             thalach   ca   thal      age
Class permutation      thal      ca   cp        oldpeak
Interaction            thalach   cp   oldpeak   ca

Estimate the performance of random forest on these subsets using 3-fold cross-validation:

../bin/SprBaggerDecisionTreeApp -n 100 -q 10 -a 1 -Q clev_norm.spr -y '0:.' -x 3 -v 1 -g 1 cleveland.data
Node size=      10      FOM=  0.142855 +-  0.0278957
../bin/SprBaggerDecisionTreeApp -n 100 -q 10 -a 1 -Q clev_norm.spr -y '0:.' -x 3 -v 1 -g 1 -V 'ca,cp,sex,thal' cleveland.data
Node size=      10      FOM=  0.140762 +-  0.0134891
../bin/SprBaggerDecisionTreeApp -n 100 -q 10 -a 1 -Q clev_norm.spr -y '0:.' -x 3 -v 1 -g 1 -V 'thalach,ca,thal,age' cleveland.data
Node size=      10      FOM=  0.157264 +-   0.020884
../bin/SprBaggerDecisionTreeApp -n 100 -q 10 -a 1 -Q clev_norm.spr -y '0:.' -x 3 -v 1 -g 1 -V 'thal,ca,cp,oldpeak' cleveland.data
Node size=      10      FOM=  0.134336 +-  0.0140065
../bin/SprBaggerDecisionTreeApp -n 100 -q 10 -a 1 -Q clev_norm.spr -y '0:.' -x 3 -v 1 -g 1 -V 'thalach,cp,oldpeak,ca' cleveland.data
Node size=      10      FOM=  0.159046 +-  0.0237228

The conclusion is that any variable subset gives pretty much the same predictive power within statistical uncertainty.

*

The performance of the binary classifiers above was estimated looking at the full ROC on a test dataset. Because the test dataset is small, statitical errors are substantial. As an alternative, we can display a gross FOM (correctly classified fraction) by cross-validation:

../bin/SprBaggerDecisionTreeApp -Q clev_norm.spr -a 1 -y '0:.' -n 100 -x 3 -l 10 -C 1 -v 1 cleveland.data
Optimization criterion set to Gini index  -1+p^2+q^2
Monitoring criterion set to Fraction of correctly classified events
No per-event loss is chosen. Will use the default.
Decision tree initialized with minimal number of events per node 10
Using a Topdown tree.
Will cross-validate by dividing training data into 3 subsamples.
Will cross-validate for trees with minimal node sizes: 10
Decision tree initialized with minimal number of events per node 10
Using a Topdown tree.
Classes for Bagger are set to 0(1) 0(-1)
Bagger initialized with classes 0(1) 0(-1) with cycles 100
Will cross-validate using 3 subsamples:
Subsample 0  W1=45  W0=54  N1=45  N0=54
Subsample 1  W1=44  W0=55  N1=44  N0=55
Subsample 2  W1=50  W0=55  N1=50  N0=55
Cross-validator processing classifier 0
Cross-validator processing sub-sample 0 for classifier 0
Cross-validator processing sub-sample 1 for classifier 0
Cross-validator processing sub-sample 2 for classifier 0
Cross-validated FOMs:
Node size=      10      FOM=  0.799038 +-  0.0186373

Let us try 10-fold CV and double the number of decision trees:

Bagger initialized with classes 0(1) 0(-1) with cycles 200
Will cross-validate using 10 subsamples:
Subsample 0  W1=11  W0=16  N1=11  N0=16
Subsample 1  W1=12  W0=14  N1=12  N0=14
Subsample 2  W1=12  W0=16  N1=12  N0=16
Subsample 3  W1=12  W0=16  N1=12  N0=16
Subsample 4  W1=14  W0=15  N1=14  N0=15
Subsample 5  W1=13  W0=17  N1=13  N0=17
Subsample 6  W1=14  W0=17  N1=14  N0=17
Subsample 7  W1=15  W0=16  N1=15  N0=16
Subsample 8  W1=14  W0=18  N1=14  N0=18
Subsample 9  W1=22  W0=19  N1=22  N0=19
Cross-validated FOMs:
Node size=      10      FOM=  0.835614 +-   0.068813

As expected, this produces a performance estimator which is less biased but has a larger variance. 80-84% accuracy is the ballpark we are looking at.

*

So far, we have been dealing with the binary problem "no disease vs some kind of disease.' For fun, let us try to approach this as a multiclass problem with 5 classes and see what kind of classification power we can get.

Let us start with boosted binary splits and try the two popular multiclass algorithms, one vs one and one vs all. Prepare a file multi.config with 2 lines in it:

AdaBoost 1300 1 0 0.01
BinarySplit

Try 10-fold cross-validation using one-vs-all strategy:

[narsky@cithep97 share]$ ../bin/SprMultiClassApp -a 1 -y '0,1,2,3,4' -e 1 -x 10 -v 1 -c multi.config -G 1 cleveland.data
Read data from file cleveland.data for variables "age" "sex" "cp" "trestbps" "chol" "fbs" "restecg" "thalach" "exang" "oldpeak" "slope" "ca" "thal"
Total number of points read: 303
Training data filtered by class.
Points in class 0(1):   164
Points in class 1(1):   55
Points in class 2(1):   36
Points in class 3(1):   35
Points in class 4(1):   13
Per-event monitoring loss set to Misid rate int(y==f(x))
Reading configuration for classifier AdaBoost
Classes for AdaBoost are set to 0(1) 1(1)
AdaBoost initialized with classes 0(1) 1(1) with cycles 1300
Adding AdaBoost with: nCycle=1300 AdaBoostMode=1 BagInput=0 Epsilon=0.01
Finished reading 1 classifiers from file multi.config
Multi class learning mode set to OneVsAll.
Indicator matrix:
 Classes/Classifiers : 5 5
=========================================================
                   0 :  1 -1 -1 -1 -1
                   1 : -1  1 -1 -1 -1
                   2 : -1 -1  1 -1 -1
                   3 : -1 -1 -1  1 -1
                   4 : -1 -1 -1 -1  1
=========================================================
Multiclass algorithm set to Allwein-Schapire-Singer.
Will cross-validate using 10 subsamples:
=====================================
Overall validation loss = 0.471947
=====================================
Classification table: Fractions of total class weight
True Class \ Classification |     0      |     1      |     2      |     3      |     4      |   Total weight in class |
------------------------------------------------------------------------------------------------------------------------
    0                       |     0.8415 |     0.1280 |     0.0183 |     0.0122 |     0.0000 |                164.0000 |
    1                       |     0.4909 |     0.1455 |     0.1273 |     0.1818 |     0.0545 |                 55.0000 |
    2                       |     0.3611 |     0.3056 |     0.1389 |     0.1667 |     0.0278 |                 36.0000 |
    3                       |     0.1143 |     0.3143 |     0.2571 |     0.2286 |     0.0857 |                 35.0000 |
    4                       |     0.2308 |     0.2308 |     0.1538 |     0.3077 |     0.0769 |                 13.0000 |
------------------------------------------------------------------------------------------------------------------------

Everything is miclassified as class 0. Try one-vs-one algorithm:

../bin/SprMultiClassApp -a 1 -y '0,1,2,3,4' -e 2 -x 10 -v 1 -c multi.config -G 1 cleveland.data
=====================================
Overall validation loss = 0.448845
=====================================
Classification table: Fractions of total class weight
True Class \ Classification |     0      |     1      |     2      |     3      |     4      |   Total weight in class |
------------------------------------------------------------------------------------------------------------------------
    0                       |     0.8293 |     0.1280 |     0.0244 |     0.0122 |     0.0061 |                164.0000 |
    1                       |     0.4909 |     0.2000 |     0.0909 |     0.1818 |     0.0364 |                 55.0000 |
    2                       |     0.2500 |     0.2778 |     0.1944 |     0.1944 |     0.0833 |                 36.0000 |
    3                       |     0.0571 |     0.2857 |     0.1714 |     0.3429 |     0.1429 |                 35.0000 |
    4                       |     0.1538 |     0.2308 |     0.2308 |     0.3077 |     0.0769 |                 13.0000 |
------------------------------------------------------------------------------------------------------------------------

Somewhat better but still not impressive. Use random forest instead of boosted splits. Put 2 lines in multi.config:

Bagger 100 0
TopdownTree 5 0 10 0

and run

../bin/SprMultiClassApp -a 1 -y '0,1,2,3,4' -e 2 -x 10 -v 1 -c multi.config -G 1 cleveland.data
=====================================
Overall validation loss = 0.409241
=====================================
Classification table: Fractions of total class weight
True Class \ Classification |     0      |     1      |     2      |     3      |     4      |   Total weight in class |
------------------------------------------------------------------------------------------------------------------------
    0                       |     0.9268 |     0.0488 |     0.0244 |     0.0000 |     0.0000 |                164.0000 |
    1                       |     0.6000 |     0.1455 |     0.0909 |     0.1636 |     0.0000 |                 55.0000 |
    2                       |     0.2778 |     0.2778 |     0.2222 |     0.2222 |     0.0000 |                 36.0000 |
    3                       |     0.1714 |     0.3429 |     0.1714 |     0.3143 |     0.0000 |                 35.0000 |
    4                       |     0.0769 |     0.3077 |     0.3077 |     0.3077 |     0.0000 |                 13.0000 |
------------------------------------------------------------------------------------------------------------------------

See if we can eliminate some variables. Reduce the number of decision trees in multi.config from 100 to 50 to make the code run faster. Then run

../bin/SprMultiClassApp -a 1 -y '0,1,2,3,4' -e 2 -x 10 -v 1 -c multi.config -G 1 -N 2 -R 1 cleveland.data
=========================================================
N_of_variables      Loss
    1                  0.429043 +-          0
    2                  0.389439 +-          0
    3                  0.386139 +-          0
    4                   0.39604 +-          0
    5                  0.379538 +-          0
    6                  0.372937 +-          0
    7                  0.386139 +-          0
    8                  0.389439 +-          0
    9                   0.39604 +-          0
   10                  0.376238 +-          0
   11                  0.392739 +-          0
   12                   0.40264 +-          0
   13                   0.39604 +-          0
Variables:
    1        ca
    2        ca   cp
    3        ca   cp   slope
    4        ca   cp   sex   slope
    5        ca   cp   sex   slope   thal
    6        ca   cp   restecg   sex   slope   thal
    7        ca   cp   exang   restecg   sex   thal   thalach
    8        ca   cp   exang   oldpeak   restecg   sex   thal   thalach
    9        age   ca   cp   exang   oldpeak   sex   slope   thal   thalach
   10        age   ca   chol   cp   exang   oldpeak   restecg   sex   slope   thal
   11        ca   chol   cp   exang   fbs   oldpeak   restecg   sex   slope   thal   thalach
   12        ca   chol   cp   exang   fbs   oldpeak   restecg   sex   slope   thal   thalach   trestbps
   13        age   ca   chol   cp   exang   fbs   oldpeak   restecg   sex   slope   thal   thalach   trestbps
=========================================================

Looks like 5 variables should be enough to achieve the full separation power. As a cross-check, let us run the multiclass learner again with the 5 best variables selected and 100 decision trees in each forest:

../bin/SprMultiClassApp -a 1 -y '0,1,2,3,4' -e 2 -x 10 -v 1 -c multi.config -G 1 -V 'ca,cp,sex,slope,thal' cleveland.data
=====================================
Overall validation loss = 0.389439
=====================================
Classification table: Fractions of total class weight
True Class \ Classification |     0      |     1      |     2      |     3      |     4      |   Total weight in class |
------------------------------------------------------------------------------------------------------------------------
    0                       |     0.9390 |     0.0366 |     0.0061 |     0.0183 |     0.0000 |                164.0000 |
    1                       |     0.5273 |     0.2727 |     0.0909 |     0.1091 |     0.0000 |                 55.0000 |
    2                       |     0.3611 |     0.3333 |     0.0556 |     0.2500 |     0.0000 |                 36.0000 |
    3                       |     0.1143 |     0.4286 |     0.0571 |     0.4000 |     0.0000 |                 35.0000 |
    4                       |     0.1538 |     0.3077 |     0.1538 |     0.3846 |     0.0000 |                 13.0000 |
------------------------------------------------------------------------------------------------------------------------

61% accuracy is probably fairly close to the best we can squeeze out of the multiclass learner with 5 classes. Note that only classes 0 and 3 are identified with some success, and class 4 is totally hopeless.

I most certainly do not want to give anyone an impression that 61% is the best multiclass accuracy. Various things can be tried to improve the separation further: customized indicator matrices, grouping classes together, getting rid of class 4 etc.

Note that cross-validation of the multiclass learner does not produce error estimates. This is a limitation of the package, perhaps to be fixed in the future.

17. Memory management

In SPR-8 I added a new constant to SprDefs: SprEnforceLowMemory. By default it is set to true. This tells the package to minimize memory consumption at the expense of not cloning trained classifiers after optimization is completed. If you are planning to work with the package just by using the provided executables or through SprRootAdapter, use this default setting. If you plan to write your own C++ code that involves training classifiers, I advise to build the package with this option set to false. If you keep this option at its default value, you will need to keep track of the ownership of trained classifiers and destroy them using the correct sequence; that is, you will need to understand the SPR implementation in detail. This option has no effect on memory consumption by input data or by classifiers in the process of training - it only affects the amount of consumed memory after training is completed.

18. Summary

Feedback on the package use is very much appreciated. Send email to the address shown on top of this file.

19. Release notes

04.02.2008 Tag V08-00-00 changes format used for storing trained multiclass learner configurations. The user will not be able to use multiclass learner configuration files produced with earlier versions. Please re-train.

02.04.2007 Tag V05-00-00 introduces a new interface for reading saved classifier configurations from files. This required multiple changes to the format. This change is backwards-incompatible. The user won't be able to read from configuration files produced prior to this tag. All classifiers need to be re-trained and saved into files again. See README for a detaied description of the new syntax and new package features that became possible due to the introduction of the unified classifier configuration format.

10.14.2006 Due to addition of Real and Epsilon AdaBoosts, the output format for AdaBoost configuration files has changed. If the user wants to read configuration files created by releases prior to V03-05-00 with a post-V03-05-00 package release, s/he needs to add the following line to the configuration file:

Mode: 1

This must be included as a 2nd line in the file, right after the "Trained AdaBoost:" line. To keep the code clean and maintainable in the future, I decided not to implement the backwards compatibility in the code.

Also, the default output in the range [0,1] from AdaBoost has changed. It is now computed as raw AdaBoost output and then transformed to [0,1] using 1/(1+exp(-2X)). This directly returns the probability of an event at X being signal. The earlier algorithm used before V03-05-00 produced more smooth distributions between 0 and 1; the new distributions are more peaking near 0 and 1.

09.07.2006 From now on SPR will be released as a standalone package.

06.29.2006 If you are running on not too big data samples in not too many dimensions, try SprInteractiveAnalysisApp first. It lets you quickly compare various classifiers.

01.18.2006 Tag V03-00-00 introduces a more flexible treatment of input classes which allows arbitrary grouping. See Section 6 of this file and SprAbsFilter.hh for details.

11.12.2005 Starting with tag V01-18-00, the default options for the two main executables, SprBaggerDecisionTreeApp and SprAdaBoostDecisionTreeApp, have been reversed. Now by default the two executables use topdown trees (SprTopdownTree) instead of regular decision trees (SprDecisionTree) due to their faster performance. The bagger executable also by default invokes continuous tree output instead of discrete output. To obtain the old behavior, use "-j" option for SprAdaBoostDecisionTreeApp and both "-j" and "-k" for SprBaggerDecisionTreeApp. Users won't be able to read from old classifier configuration files unless "-j" option is used.

09.11.2005 Tag V01-00-00 introduced a new format for configuration files used to store boosted and bagged decision trees. The new SprAdaBoostDecisionTreeApp and SprBaggerDecisionTreeApp will not be able to read from old configuration files.