Spectral Relational Clustering, Multi-type Relational Data, Collective Factorization on Related Matrices

Most clustering approaches in the literature focus on "flat" data in which each data object is represented as a fixed-length feature vector. However, many real-world data sets are much richer in structure, involving objects of multiple types that are related to each other, such as Web pages, search queries and Web users in a Web search system, and papers, key words; authors and conferences in a scientific publication domain. In such scenarios, using traditional methods to cluster each type of objects independently may not work well due to the following reasons.    First, to make use of relation information under the traditional clustering framework, the relation information needs to be transformed into features. In general, this transformation causes information loss and/or very high dimensional and sparse data. For example, if we represent the relations between Web pages and Web users as well as search queries as the features for the Web pages, this leads to a huge number of features with sparse values for each Web page. Second, traditional clustering approaches are unable to tackle with the interactions among the hidden structures of different types of objects, since they cluster data of single type based on static features. Note that the interactions could pass along the relations, i.e., there exists influence propagation in multi-type relational data. Third, in some machine learning applications, users are not only interested in the hidden structure for each type of objects, but also the global structure involving multi-types of objects. For example, in document clustering, except for document clusters and word clusters, the relationship between document clusters and word clusters is also useful information. It is difficult to discover such global structures by clustering each type of objects individually.   Therefore, multi-type relational data has presented a great challenge for traditional clustering approaches. This invention is based on a novel general model, the collective factorization on related matrices, to discover the hidden structures of multi-types of objects based on both feature information and relation information. By clustering the multi-types of objects simultaneously, the model performs adaptive dimensionality reduction for each type of data. Through the related factorizations which share factors, the hidden structures of different types of objects could interact under the model. In addition to the cluster structures for each type of data, the model also provides information about the relation between clusters of different types of objects. Second, under this model, we derive a novel algorithm, the Spectral Relational Clustering (SRC), to cluster multi-type interrelated data objects simultaneously. By iteratively embedding each type of data objects into low dimensional spaces, the algorithm benefits from the interactions among the hidden structures of different types of data objects. The algorithm has the simplicity of spectral clustering approaches but at the same time also applicable to relational data with various structures. Theoretic analysis and experimental results demonstrate the promise and effectiveness of the algorithm. Third, we show that the existing spectral clustering algorithms can be considered as the special cases of the proposed model and algorithm. This provides an unified view to understand the connections among these algorithms.

Advantages:

  • It deals with the problem of high dimensionality more efficiently.
  • It deals with the problem of sparsity more efficiently.
  • Unlike the existing technologies which are applicable to only the relational data with special structures, it is applicable to relational data with various structures.
  • Unlike the traditional approaches that provide only local hidden structures, it provides both local and global hidden structures.

Intellectual Property:

U.S. 8,185,481; 8,700,547

 

Binghamton University RB240

Patent Information:
Technology/Start-up ID: