Treelet kernel incorporating cyclic, stereo and inter pattern information in chemoinformatics

Benoit Gaüzère a,n, Pierre-Anthony Grenier a, Luc Brun a, Didier Villemin b a GREYC UMR CNRS 6072, Caen, France b LCMT UMR CNRS 6507, Caen, France a r t i c l e i n f o

Article history:

Received 30 August 2013

Received in revised form 26 July 2014

Accepted 29 July 2014

Available online 14 August 2014

Keywords:

Chemoinformatics

Graph kernel

Machine learning a b s t r a c t

Chemoinformatics is a research field concerned with the study of physical or biological molecular properties through computer science's research fields such as machine learning and graph theory. From this point of view, graph kernels provide a nice framework which allows to naturally combine machine learning and graph theory techniques. Graph kernels based on bags of patterns have proven their efficiency on several problems both in terms of accuracy and computational time. Treelet kernel is a graph kernel based on a bag of small subtrees. We propose in this paper several extensions of this kernel devoted to chemoinformatics problems. These extensions aim to weight each pattern according to its influence, to include the comparison of non-isomorphic patterns, to include stereo information and finally to explicitly encode cyclic information into kernel computation. & 2014 Elsevier Ltd. All rights reserved. 1. Introduction

Chemoinformatics is a science field lying at the edge of chemical and computer science which aims to study, analyze, store and predict chemical data through informational techniques.

Two problematics addressed by chemoinformatics correspond to

Quantitative Structure–Activity Relationship (QSAR) and Quantitative Structure–Property Relationship (QSPR) problems. QSAR and

QSPR are based on the similarity principle [1] which states that two structurally similar molecules should have similar activities and properties.

A molecule is usually encoded by its molecular graph. A molecular graph is defined as a labeled graph G¼ ðV ; E; μ; νÞ, where the unlabeled graph (V, E) encodes the structure of the molecule while μ maps each vertex to an atom's label corresponding to atom's chemical element. Labeling function ν characterizes each edge by a type of bond between two atoms (single, double, triple or aromatic). Considering this molecular representation and the similarity principle, QSAR/QSPR problems rely on defining a similarity measure between molecular graphs.

A first family of methods introduced within QSAR/QSPR field is based on the correlation between a set of molecular descriptors and a given molecular property (e.g. molecule's boiling point).

Descriptor sets are encoded by vectors and may be computed from structural information [2], physical properties or biological activities. These vectors may be used within any statistical machine learning algorithm in order to predict molecule's properties. Such a scheme allows to benefit from the large set of tools available within the statistical machine learning framework. However, the definition of a fixed size vector from a molecule, i.e. a molecular graph, induces a loss of information. Moreover, for each application, the definition of a vectorial description of each molecule is based on chemical expert's heuristics. A slightly different approach is based on graph embedding. Within this framework, a vectorial description of a graph is automatically built from its encoding using, for example, the spectral analysis of graphs [3]. In this last case, the embedding is deduced from the analysis of the eigenvectors and eigenvalues of the adjacency matrix.

A second family of methods, based on graph theory [4], may be decomposed in two subfamilies. The first subfamily [5], related to the data mining field, aims to discover subgraphs having a large difference of frequencies between sets of positive and negative examples. We can note that this first subfamily is mainly restricted to classification problems. The second subfamily [6] bases molecular graph similarity measures on graph edit distance. However, since graph edit distance does not define an euclidean distance [7], similarity measures based on graph edit distance do not correspond to a direct explicit nor implicit embedding. An embedding may nevertheless be obtained by regularizing the edit distance or by using distances to a set of prototype graphs [6]. Graph edit distance can also be directly combined with a restricted set of machine learning methods such as k-nearest neighbors or kmedians algorithms.

Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr

Pattern Recognition http://dx.doi.org/10.1016/j.patcog.2014.07.029 0031-3203/& 2014 Elsevier Ltd. All rights reserved. n Corresponding author.

E-mail addresses: benoit.gauzere@ensicaen.fr (B. Gaüzère), pierre-anthony.grenier@ensicaen.fr (P.-A. Grenier), luc.brun@ensicaen.fr (L. Brun), didier.villemin@ensicaen.fr (D. Villemin).

Pattern Recognition 48 (2015) 356–367

Graph kernels can be understood as symmetric graph similarity measures. Using a positive definite kernel k, the value kðG;G0Þ, where G and G0 encode two graphs, corresponds to a scalar product between two vectors ψðGÞ and ψðG0Þ in a Hilbert space.

Therefore, such a similarity measure may be used in conjunction with machine learning methods which may access to input data only through scalar products (such as SVM). Graph kernel framework provides thus a natural connection between structural pattern recognition and graph theory on one hand and statistical pattern recognition on the other hand.

A large family of graph kernels used in chemoinformatics is based on the definition of a bag of patterns for each graph and deduces graph similarity from similarity between bags. Some bags of patterns are defined by linear patterns. For example, marginalized kernel [8] defines a graph kernel based on a comparison between sets of random walks extracted from each graph. Tree pattern kernel [9] defines a graph kernel using a set of tree patterns instead of walks. This last kernel allows to encode more structural information than kernels based on linear patterns such as random walks.