ACAT05 ACAT05 May May 22 22 -- 27, 27, 2005 2005 DESY, DESY, Zeuthen, Zeuthen, Germany Germany The The use use of of Clustering Clustering Techniques Techniques for for the the Classification Classification of of High High Energy Energy Physics Physics Data Data Mostafa Mostafa MJAHED MJAHED Ecole Ecole Royale Royale de de lAir, lAir, Mathematics Mathematics and and Systems

Systems Dept. Dept. Marrakech, Marrakech, Morocco Morocco The The use use of of Clustering Clustering Techniques Techniques for forthe the Classification Classification of of High High Energy Energy Physics Physics Data Data Production of jets in e Methodology + e- The use of Clustering Techniques for the Classification of physics processes in e+e- Conclusion Mjahed ACAT 05, DESY, Zeuthen, 25 May 2005 2

+ Production Production of of jets jets in in ee +ee - Jet of hadrons Perturbative Region Annihilation e+e - W +W-,ZZ, ZH (H:Higgs) (LEP2 and beyond) Decay of produced bosons: */ Z0 qq , W + q1 q2 ,W - q3q4 H0 qq ) W+ e+ eW- Fragmentation of quarks and gluons and production of unstable particles Decay of unstable particles to observed hadrons Decay of unstable particles

qi q j Jet of hadrons Confinement Region Fragmentation of quarks and gluons + Production Production of of jets jets in in ee +ee - LEP2, observation of processes with dominants jets topologies: Production of pairs W +W - : e+e- W +W - qqll , qqqq Emergence of new particles as the Higgs Boson: e+e- ZH qqbb, bb (+ -qq , qq+ -) Production of new processes: e+e- ZZ qq ll , qqqq,... Mjahed ACAT 05, DESY, Zeuthen, 25 May 2005 4 Higgs Higgsboson bosonProduction Production Higgs-strahlung:

e+ e- ZH Fusion WW Decay Modes: decay into quarks: H bb and H cc leptonic decay H + gluonic decay Hgg decay into virtual W boson pair: H W +W - Cross Section Branching Ratio + Production Production of of jets jets in in ee +ee - HZ ALEPH candidate e+ e- H Z qqbb Mjahed ACAT 05, DESY, Zeuthen, 25 May 2005 6 ++ - Jets analysis in e Jets analysis in e ee

Analysis of W bosons pairs and research of new particles as the Higgs boson. Measure of the masse of W Measure of the Triple Gauge Coupling (TGC); coupling between 3 bosons Prediction of limits concerning the mass of the Higgs boson These analyses are subjected to the identification of the different processes, with dominant jets topologies with a very high efficiency Need to use Pattern Recognition methods Pattern Pattern Recognition Recognition f: X Y xi X yj Y x11 x 21 X ( x ij ) ... x n 1 x12 ... x 22 ... ... ...

xn 2 ... x1 p y1 y x 2 p Y ( y j ) 2 ... ... x np yk Characterisation of events: research and selection of p variables or Interpretation: definition of k classes Learning: association ( xi yj ) f Decision ( xi yj ) using f for any xi attributes Mjahed ACAT 05, DESY, Zeuthen, 25 May 2005 8

Pattern Pattern Recognition Recognition Methods Methods Mjahed Statistical Methods Principal Components Analysis PCA Decision Trees Discriminant Analysis Clustering (Hierarchical, K-means, ) Connectionist Methods Neural Networks Genetic Algorithms Other Methods Fuzzy Logic, Wavelets ...

ACAT 05, DESY, Zeuthen, 25 May 2005 9 Hierarchical Hierarchical Clustering Clustering Technique Technique x11 x 21 X ( x ij ) ... x n 1 x12 ... x 22 ... ... ... xn 2 ... x1 p x 2 p ... x np C D

C3 p D( xi , x j ) C2 C1 C4 C5 ... ... C6 2 ( x x ) im jm m 1 C7 C8 C7 ... ...

1. The distances between all the pairs of events x i and xj are computed 2. Choice of the two most distant events: C (C , C ) 3. Assignation of all xi to the closer class C1 or C2 4. Repeat the steps 2 and 3 for C (C , C ) and C (C , C ) 1 1 5. Repeat the step 4 for Ci (xj , xk ) 3 4 2 2 5 6 ... K-Means K-Means Clustering Clustering Technique Technique Given K, the K-means algorithm is implemented in 4 steps: Partition events into K non empty subsets Compute seed points as the centroids (mean point) of the cluster Assign each event to the cluster with the nearest seed point Go back to step 2, stop when no more new assignment Parameters:

of distances Choice Supervised or unsupervised Learning Mjahed ACAT 05, DESY, Zeuthen, 25 May 2005 11 Clustering Clustering by by aa Peano Peano Scanning Scanning Technique Technique Example of an analytical Peano square-filling curve Decomposition of data into p-dimensional unit hyper-cube Ip = [0, 1] [0, 1] [0, 1] Construction of a Space Filling Curve (SFC) Fp (t): I1 Ip Compute the position of X (data) on the SFC, i.e., t = (x) Find the set K of nearest neighbours of t in the transformed learning set T Classify the test sample to the nearest class in set K Efficiency Efficiency and and Purity Purity of of aa Pattern Pattern Recognition Recognition Method Method

Validation Test events Efficiency of classification for events of class Ci Mjahed Purity of classification for events of class Ci N Ei ii Ni ACAT 05, DESY, Zeuthen, 25 May 2005 N ii Pi Mi 13 Application Application 4 jets e+ e- HZ bbqq e+e- W+W- qqqq e+e- ZZ qqqq e+ e- /Z qqqq Characterisation of the Higgs boson in the 4 jets channel, Mjahed

e+e- ZH qqbb , by clustering techniques ACAT 05, DESY, Zeuthen, 25 May 2005 14 Characterisation Characterisationof ofthe theHiggs Higgsboson bosonin in44jets jetschannel channel ee++ee- - ZH ZH qqbb qqbb by bythe theuse useof ofclustering clusteringtechniques techniques 44jets jetsevent event HZ HZevent event Background Background +

/Z, /Z,W W+W W ,-,ZZ ZZ Events generated by the LUND MC (JETSET 7.4 and PYTHIA 5.7) at s = 300 GeV, in the 4 jets channel e+ e- HZ qqbb (signal: Higgs boson events), MH = 125 GeV/c2 e+e- W +W - qqqq, e+ e- Z/ qqgg, qqqq , e+ e- ZZ qqqq (Background events) Research of discriminating variables variables characterising the presence of b quarks Variables Variables Thrust Mincos: Min (cos ij + cos kl ): N T max ( p i .n ) max i 1 Sphericity S S min S (n ) N i 1 N

3 S ( n ) 2 Boosted Aplanarity: BAP BAP The minimal sum of cosines by using all the permutations ijkl. p i // p iT / n i 1 N Max (Mjet), Max (Ejet): 2 pi i 1 N 3 min 2 2

p iT out i 1 N 2 pi the 3th value of the jet masses and jet energies in each event Bed: Event broadening Bed = Min Bhemi Bhemi Rapidity-impulsion weighted Moments Mnm : nt p iT i 1 nt p i

i 1 Mmin , Emin : the 4th value of the jet masses and jet energies in each event i 1 Max3 (Mjet), Max3 (Ejet): 2 the maximal value of the jet masses and jet energies in each event M nm max Jet ii rapidity: n i . piT iJet 1 E pi // i . Log ( i ) 2 E i pi // m Discriminating

Discriminating Power Power of of variables variables Test Function F j , (n k ) B j Fj j=1, , 17. ( k 1) W j Bj ,Wj: Between and Within-classes Variance Matrix for variable j. n total number of events (signal+ background), k number of classes (2) The discriminating power of each variable Vj is proportional to the values of Fj (j=1, , 17). Hierarchical Hierarchical Clustering Clustering Classification Classification The most separating distance DHZ/Back between the C classes CHZ and CBack is searched and the corresponding cut DHZ/Back * is computed. The classification of a test event x0 is then obtained

CHZ according to the algorithm: if DHZ/Back (xo) DHZ/Back* then xo CHZ else xo CBack DHZ/Back = 0.01 Mincos +0.32 MaxE + 0.11 Max3E + 0.52Emin + 0.36 BAP + 0.87 Bed + 0.41 M 11 + 0.38 M 31 DHZ/Back* = 2.51 Classification of test events CBack K-Means K-Means Clustering Clustering Classification Classification For K=2, the K-means algorithm is implemented in 4 steps: C Partition events into 2 non empty subsets Compute seed points as the centroids (mean point) of the cluster Assign each event to the cluster with the nearest seed point Go back to step 2, stop when no more new assignment Classification of test events CHZ CBack Peano Peano space space filling filling curve curve Clustering

Clustering Classification Classification By using the training sample: X = (xi (M11, M21, M31 , M41, M51 , M61, T, S, BAP, Bed, Mincos, MaxE, MaxM, Max3E, Max3M, Emin, Mmin), i=1,, N=4000) and the known class labels: CHZ, Cback an approximate Peano space filling curve is obtained, allowing to transform the 17-dimensional space into unit interval. Classification of test events COMPARISON COMPARISON Comparison between the 3 clustering methods Purity of classification vs cuts values D in hierarchical clustering * DHZ/Back = 0.01 Mincos +0.32 MaxE + 0.11 Max3E + .52Emin + 0.36 BAP + 0.87 Bed + 0.41 M 11 + 0.38 M 31 DHZ/Back* = 2.51 DHZ/Back* = [1.65, 1.7, 1.75, , 2.51, , 2.65, ]. Purity(%) = [50, 51, 52, ,80] Hierarchical Clustering Conclusion Conclusion 4 jets

Variables e+e- ZZ e+e- HZ e+e- WW e+e- /Z Characterisation of Higgs Boson events: The most discriminating variables are: Mincos, MaxE, Max3E, Emin, BAP, Bed. They show the importance of information allowing to separate between b quark and udsc-quarks (separation between HZ events and background: H bb Other variables as E min ). , Mmin, BAP, Bed, Mincos, may be used to identify events emerging from the background (i.e. e+e- Z / 4 jets). Discrimination ( /Z ) / WW / ZZ: using dijets properties: charge, broadness, presence of b quarks ... Conclusion Conclusion (continued) (continued) Methods Importance of Pattern Recognition Methods The improvement of an any identification is subjected to the multiplication of multidimensional effect offered by PR methods and the discriminating power of the proposed variable.

The hierarchical clustering method is more efficient than the other clustering techniques: its performances are in average 1 to 3 % higher than those obtained with the two other methods. Other cut's values DHZ/Back* give other efficiencies and purities: We can reach values of purity permitting to identify the HZ events more efficiently Clustering techniques: comparative to other statistical methods : Discriminant Analysis, Decision trees,... Clustering techniques: less effective than neural networks and non linear discriminant analysis methods