A TMVA example in pyROOT

Posted on August 27, 2011. Filed under: Uncategorized | Tags: , , , , , , |

This is an example showing how to use TMVA using python/pyROOT. TMVA is a toolkit for multivariate analysis in ROOT which is widely used in High Energy Physics data analysis.

The example code in this post is made available under the Apache License Version 2.0. If you want to experiment with the code examples shown, it’s best to do so in a newly created directory. It has been tested with ROOT 5.30/00 which includes TMVA 4.1.2 . Copying and pasting of the example code below directly into the interactive python shell is strongly discouraged as crashes of the python interpreter were observed when doing this. Also, the indentation is not taken into account correctly, leading to syntax errors. Copying the code into a python file and then running this file with python -i however should work.

Generating the samples

For simplicity and to allow for easy visualization, we will work in two dimensions. First, we’ll generate two samples (‘signal’ and ‘background’) drawn from Gaussian distributions with different means and fill them into a TNtuple:

import ROOT

# create a TNtuple
ntuple = ROOT.TNtuple("ntuple","ntuple","x:y:signal")

# generate 'signal' and 'background' distributions
for i in range(10000):
    # throw a signal event centered at (1,1)
    ntuple.Fill(ROOT.gRandom.Gaus(1,1), # x
                ROOT.gRandom.Gaus(1,1), # y
                1)                      # signal
    
    # throw a background event centered at (-1,-1)
    ntuple.Fill(ROOT.gRandom.Gaus(-1,1), # x
                ROOT.gRandom.Gaus(-1,1), # y
                0)                       # background

In order to visualize the generated distributions, we can do the following:

# keeps objects otherwise removed by garbage collected in a list
gcSaver = []

# create a new TCanvas
gcSaver.append(ROOT.TCanvas())

# draw an empty 2D histogram for the axes
histo = ROOT.TH2F("histo","",1,-5,5,1,-5,5)
histo.Draw()

# draw the signal events in red
ntuple.SetMarkerColor(ROOT.kRed)
ntuple.Draw("y:x","signal > 0.5","same")

# draw the background events in blue
ntuple.SetMarkerColor(ROOT.kBlue)
ntuple.Draw("y:x","signal <= 0.5","same")

In my case, the resulting plot looked like this:

where the red points correspond to ‘signal’ and the blue points correspond to ‘background’. We can start training a classifier which attempts to label individual points as ‘signal’ or ‘background’ based on the value of their coordinates x and y.

The following code is inspired by the standard TMVA classification example ($ROOTSYS/tmva/test/TMVAClassification.C). First, we’ll initialize TMVA and create a factory object:

ROOT.TMVA.Tools.Instance()

# note that it seems to be mandatory to have an
# output file, just passing None to TMVA::Factory(..)
# does not work. Make sure you don't overwrite an
# existing file.
fout = ROOT.TFile("test.root","RECREATE")

factory = ROOT.TMVA.Factory("TMVAClassification", fout,
                            ":".join([
                                "!V",
                                "!Silent",
                                "Color",
                                "DrawProgressBar",
                                "Transformations=I;D;P;G,D",
                                "AnalysisType=Classification"]
                                     ))

The parameters given to the constructor of ROOT.TMVA.Factory are described in section 3.1 of the current TMVA user’s guide. Verbosity is disabled with the !V (‘not verbose’) option while !Silent (‘not silent’) still allows for some level of reporting. Color enables the use of color and DrawProgressBar enables displaying the progress of training. Currently, some of these options actually correspond to the default setting so need not necessarily be specified. The values given to the Transformations parameter only affect testing and visualization, not the training (according to the manual).

Now we declare which variables should be used for classification (the ‘independent’ variables) and add the tree for signal and background events (which are stored in the same tree in this example, a cut on the variable signal is used to distinguish between signal and background events):

factory.AddVariable("x","F")
factory.AddVariable("y","F") 

factory.AddSignalTree(ntuple)
factory.AddBackgroundTree(ntuple)

# cuts defining the signal and background sample
sigCut = ROOT.TCut("signal > 0.5")
bgCut = ROOT.TCut("signal <= 0.5")

factory.PrepareTrainingAndTestTree(sigCut,   # signal events
                                   bgCut,    # background events
                                   ":".join([
                                        "nTrain_Signal=0",
                                        "nTrain_Background=0",
                                        "SplitMode=Random",
                                        "NormMode=NumEvents",
                                        "!V"
                                       ]))

The parameters given to PrepareTrainingAndTestTree are described in section 3.1.4 of the current TMVA user’s guide. Essentially, the available events are separated randomly into two equally large sets for training and testing and verbosity is disabled.

We then configure a classifier to learn to distinguish the two samples generated above. As an example, we use a boosted decision tree:

method = factory.BookMethod(ROOT.TMVA.Types.kBDT, "BDT",
                   ":".join([
                       "!H",
                       "!V",
                       "NTrees=850",
                       "nEventsMin=150",
                       "MaxDepth=3",
                       "BoostType=AdaBoost",
                       "AdaBoostBeta=0.5",
                       "SeparationType=GiniIndex",
                       "nCuts=20",
                       "PruneMethod=NoPruning",
                       ]))

factory.TrainAllMethods()
factory.TestAllMethods()
factory.EvaluateAllMethods()

The options (described in section 8.12.2 of the current TMVA user’s guide) can be summarized as follows:

  • the help text should not be printed
  • verbosity is disabled
  • 850 trees should be used
  • leaf nodes must contain at least 150 events
  • the depth of the trees is limited to 3
  • adaptive boosting should be used
  • the Gini index should be used to select the best variable to be used in each tree node
  • 20 steps should be used when scanning cuts on a variable and
  • no pruning of trees after they have been constructed should be applied.

Examining the trained classifier

After running this, TMVA should have created a file weights/TMVAClassification_BDT.weights.xml which contains the structure of trained classifier. In order to evaluate the classification function at arbitrary coordinates, we create an instance of ROOT.TMVA.Reader:

reader = ROOT.TMVA.Reader()

To calculate the value of the classifier for a given input coordinate (x,y), we first create two arrays (with one element each) for the variables x and y such that later on we can take (C++) references of the first element which we pass to the reader (see also this discussion on the ROOT bulletin board):

import array
varx = array.array('f',[0]) ; reader.AddVariable("x",varx)
vary = array.array('f',[0]) ; reader.AddVariable("y",vary)

Now that we have given the reader the variables x and y, we can read the weights file:

reader.BookMVA("BDT","weights/TMVAClassification_BDT.weights.xml")

In order to plot the BDT output as function of x and y, we can use the following code snippet:

# create a new 2D histogram with fine binning
histo2 = ROOT.TH2F("histo2","",200,-5,5,200,-5,5)

# loop over the bins of a 2D histogram
for i in range(1,histo2.GetNbinsX() + 1):
    for j in range(1,histo2.GetNbinsY() + 1):
        
        # find the bin center coordinates
        varx[0] = histo2.GetXaxis().GetBinCenter(i)
        vary[0] = histo2.GetYaxis().GetBinCenter(j)
        
        # calculate the value of the classifier
        # function at the given coordinate
        bdtOutput = reader.EvaluateMVA("BDT")
        
        # set the bin content equal to the classifier output
        histo2.SetBinContent(i,j,bdtOutput)

gcSaver.append(ROOT.TCanvas())
histo2.Draw("colz")

# draw sigma contours around means
for mean, color in (
    ((1,1), ROOT.kRed), # signal
    ((-1,-1), ROOT.kBlue), # background
    ):
    
    # draw contours at 1 and 2 sigmas
    for numSigmas in (1,2):
        circle = ROOT.TEllipse(mean[0], mean[1], numSigmas)
        circle.SetFillStyle(0)
        circle.SetLineColor(color)
        circle.SetLineWidth(2)
        circle.Draw()
        gcSaver.append(circle)

ROOT.gPad.Modified()

The output should look like this:

The overlaid ellipses are the one and two sigma contours around the mean signal and background. Note that these are ellipses because the x and y axes have a different scale. One can nicely see how the classifier approximates the Gaussian distributions with a set of trees representing rectangular regions.

Typically, one also wants to look at the distribution of the classifier output for signal and background, e.g. to get an idea how well the two samples can be separated (although the ROC curve is more suitable for comparing the performance of different classifiers). TMVA fills a tree for the events in the test sample which we can use for this purpose:

# fill histograms for signal and background from the test sample tree
ROOT.TestTree.Draw("BDT>>hSig(22,-1.1,1.1)","classID == 0","goff")  # signal
ROOT.TestTree.Draw("BDT>>hBg(22,-1.1,1.1)","classID == 1", "goff")  # background

ROOT.hSig.SetLineColor(ROOT.kRed); ROOT.hSig.SetLineWidth(2)  # signal histogram
ROOT.hBg.SetLineColor(ROOT.kBlue); ROOT.hBg.SetLineWidth(2)   # background histogram

# use a THStack to show both histograms
hs = ROOT.THStack("hs","")
hs.Add(ROOT.hSig)
hs.Add(ROOT.hBg)

# show the histograms
gcSaver.append(ROOT.TCanvas())
hs.Draw()

which for me produced the following output:

Enjoy !

Advertisements
Read Full Post | Make a Comment ( 5 so far )

Liked it here?
Why not try sites on the blogroll...