Home/Magazine Archive/November 2009 (Vol. 52, No. 11)/Steering Self-Learning Distance Algorithms/Full Text

Virtual extension
## Steering Self-Learning Distance Algorithms

The concept of distance expresses the distortion measure between any pair of entities lying in a common space. Distances are at the very heart of geometry, and are ubiquitous in science, needless to say in computational science. From the vantage point of physics, distances may be interpreted as the smallest amount of energy required to go from one location to the other, or to morph from one state to the other. Unfortunately, there is a lot of confusion in popular press about what is exactly meant by using the wording "distance." For example, people quite often interchange "distance" with "metric" without caring much about the implicitly underlying mathematical properties: In this case, to know whether the triangle inequality axiom is satisfied or not. A great deal of efforts was achieved by Deza and Deza^{6} in 2006 by publishing the first dictionary of distances presenting succinctly but unambiguously the various properties of distortion measures (such as, metrics, semimetrics, distances, quasi-distances, divergences, etc.), and listing an extensive although non-exhaustive catalog of principal distances with their domain of applications encountered in both natural sciences (biology, chemistry, physics, and cosmology), and computer sciences (coding theory, data mining, and audio/video processing).

Algorithm designers and researchers in computational sciences daily face the daunting task of choosing the most appropriate distance functions for solving their specific problems at hand. It is clearly understood nowadays that the usual flatland Euclidean distance is rarely appropriate for solving tasks on high-dimensional heterogeneous datasets that are rather lying on curved manifolds. A simple toy argument is to consider the task of averaging rotation matrices. Rotation matrices are orthonormal matrices of unit determinant. Unfortunately, the center of mass of a set of matrices, for example, the centroid defined as the arithmetic mean of rotation matrices is not a rotation matrix so that a regularization method is required to cast the average matrix to the closest rotation matrix. Consider yet another example: partial 3D shape retrieval. In partial shape retrieval, a user queries a database of 3D objects with a given part. Solving this problem requires to consider an oriented distance to break the symmetry rule. That is, one would like the distance part to object to be greater than the distance object to part, for all parts belonging to the given object. Indeed, to clarify this point, consider the distance of a 3D wagon to a 3D train model consisting of a locomotive attached to several wagon units. This distance wagon to train should be strictly greater than the distance of the same 3D train to the said wagon. This kind of asymmetric property is fulfilled by the relative entropy distance, known also as the Kullback-Leibler divergence that acts on statistical distributions. Liu et al.^{9} built an efficient and accurate 3D part search engine inspired by probabilistic text analysis technique by considering 3D objects as documents covering a small number of topics called "shape topics." They experimentally showed that the relative entropy distance behaves significantly better than the Euclidean or vector space model weighted cosine distances. It is natural to ask oneself whether this Kullback-Leibler distance is the best ultimate distance function for 3D search engines or not?

It turns out that this subtle question cannot be settled in a static way as it depends on the considered input database and on the not-yet-known online queries to be processed in the future. Otherwise, adversorial input sets could be purposely designed to prove the sub-optimality of any prescribed distance function. That is, distances need to be tuned up for every single input set by a built-in learning process. Further, these algorithmic distances need to be dynamically maintained as objects are added, edited, or deleted in the database. This dynamic paradigm of selecting distances bears much similarity with the recent concept of self-improved algorithms^{1} that devote some of their computational time to learn distribution characteristics of the input data sets to be able to speed up the overall process.

Since the space of potential distance functions is uncountably infinite, designing self-learning distance algorithms need to proceed first by choosing a small set of axiom rules (such as symmetry, or triangle inequality) specifying the type of distances, and obtaining parameterized distance families for each class. For example, back to 1991, Csiszár^{1} axiomatically derived the 1D parametric family of so-called Bregman-Csiszár distances by generalizing the principles of orthogonal projection measures in least-square-type optimization problems. This generalization let us discovered some counterintuitive facts a priori, such as the nonnecessarily commutative property of orthogonal projections. The Bregman-Csiszár parametric family includes the Kullback-Leibler and the Itakura-Saito divergences at its extremities. Thus learning the best Bregman-Csiszár distance for a given input amounts to find its best member subject to problem-specific constraints.

In information retrieval (IR) systems such as the former partial shape search engine, a set of features playing the role of signature are first extracted from every single input element, and an overall appropriate distance function is properly defined on the signature space. Then, given a query object, its signature data point is first computed (feature extractor) and its nearest-neighbor is searched for among all input signatures. In practice, better classification methods such as the k-NN rule that consists in taking the majority class of the k nearest neighbors (NN), or using kernel machines such as popular support vector machines (SVMs) are employed. Geometrically speaking, the input signatures yields a partition of the signature space into discrete elementary proximity volumes, called Voronoi cells that represent the locus of signature points closer to the cell's signature than to any other input signature. Such a discrete Voronoi diagram implicitly encodes the shape of signature data points. Interestingly, Voronoi diagrams have been recently generalized to the parameterized family of Bregman divergences^{10} as well, unifying both the classic ordinary Euclidean diagram with entropic statistical diagrams into a single unifying framework.

Furthermore, we seek for efficiency reasons to reduce the number of input signatures to keep only its most representative elements. This is achieved by using a technique called vector quantization that clusters the points into groups such as to minimize the overall intra-cluster distance, while maximizing the inter-cluster distance. The seminal centroid-based k-means cluster algorithm originally established in 1957 by Lloyd has also been recently generalized to its provably most generic class of distance measures by a breakthrough result of Banerjee et al.^{2} in 2005: Namely, the class of Bregman divergences.

One can legitimately ponder whether such self-learning distance algorithms are indeed the best suited strategy to get the optimal solution. These tweaked algorithms are indeed presumably the best whenever the objective or loss functions are unambiguously defined from the input datasets. But no one would doubt on the subjective part of defining the "closest" 3D shape to a given collection of 3D shapes. Human perception then plays a determinant role, and answers are all but subjective, reflecting the different tastes of individuals. Therefore another recent line of research is to let users steer themselves the distance learning process by loosely entering preferences. These personal user preferences are entered either by clicking on the best subjective ranked item in a list of top matches, or by providing prior information such as "I find these two images quite similar but these two others are rather far apart, etc." that are handled as like/dislike constraints. These semi-supervised learning problems become a hot topic in machine learning as attested by the increasing number of publications related to this area.

For example, Bar-Hillel and Weinshall^{3} described such a semi-supervised learning algorithm where users give positive/negative equivalence constraints denoting intra-cluster/intercluster pairs of points. The problem one faces then is to extract as precisely (for example, numerically) and reliably as possible the information provided by users. The thesis of Hertz^{7} provides an excellent review of distance learning techniques starting from the most common Mahalanobis metric learning algorithms that generalize the usual Euclidean metric to more flexible non-parametric distance learning methods. For example, the non-parametric "DistBoost" distance^{4} is derived from signed margin values of binary classifiers combined altogether in the spirit of a machine learning technique called boosting. The recent edited book of Basu et al^{4} on constrained clustering further describes many other semi-supervised clustering techniques that support user feedback.

Learning distances play also a crucial role in the field of collaborative filtering. The seminal idea of collaborative filtering was originally presented by Goldberg et al.^{7} in 1992 at Xerox Palo Alto Research Center in their experimental mail system called Tapestry. Nowadays, collaborative filtering is used in many commercial systems including Amazon book store and eBay auction site. The underlying idea of collaborative filtering is that information filtering is much more effective when humans are part of the filtering task too. That is, both user-steered content-based filtering and group-steered collaborative filtering are used to quickly retrieve insightful documents given a huge set of annotations gleaned from many users.

In the near future, we envision a whole new generation of scaleable personalized information retrieval systems driven by novel algorithms incorporating self-learning built-in distance modules, and providing light user interfaces. These brand new search engines would be able to better listen to the voice of their users, and more importantly give adequate feedbacks to the full information retrieval engine about users/groups subjective tastes, all at the clicks of a mouse.

1. Ailon, N., Chazelle, B., Comandur, S., and Liu, D. Self-improving algorithms. In *Proceedings of the 17 ^{th} ACM-SIAM Symposium on Discrete Algorithms* (SIAM press, Philadelphia, PA), 261270, 2006.

2. Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clustering with Bregman divergences, *J. Machine Learning Research 6*, (2005), 1705 1749.

3. Bar-Hillel, A., and Weinshall, D. Learning distance function by coding similarity. In *Proceedings of 24 ^{th} International Conference on Machine Learning* (Madison, WI, 2007), Omnipress, 6572.

4. Basu, S., Davidson, I., and Wagstaff, K. *Constrained Clustering: Advances in Algorithms, Theory and Applications*, Chapman & Hall/CRC, 2008.

5. Csiszár, I. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. *Annals of Statistics 19*, 4, (1991) 20322066.

6. Deza, E., Deza, M. M. *Dictionary of distances*, Elsevier, 2006.

7. Goldberg, D., Nichols, D., Oki, B. M., and Terry, D. Using collaborative filtering to weave an information tapestry. *Comm. ACM 35*, 12, (Dec. 1992), 6170.

8. Hertz, T. Learning distance functions: Algorithms and applications. PhD thesis, 2006.

9. Liu, Y., Zha, H., and Qin, H. Shape topics: A compact representation and new algorithms for 3D partial shape retrieval. In *Proceedings of IEEE Computer Vision and Pattern Recognition Conference* (CA 2006), 20252032.

10. Nielsen, F., Boissonnat, J.-D., and Nock, R. *Bregman Voronoi Diagrams: Properties, Algorithms and Applications.* INRIA Research Report 6154 (Sophia-Antipolis, France 2007).

Figure. Learning Distance In modern information retrieval (IR) systems, users are able to dynamically steer the distance learning process of search algorithms to reflect accurately and efficiently their subjective tastes: First, a user query the IR engine (1), and a list of top ranked items are returned (2), from which the user selects his/her best matches (3), thus allowing the system to personalize accordingly the distance function for future queries.

**©2009 ACM 0001-0782/09/1100 $10.00**

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2009 ACM, Inc.

No entries found