Knowledge Discovery from Citation Networks

Knowledge discovery from citation networks (i.e., textual data with links such as scientific articles, legal documents, webpages, and emails) provides insight into vast areas since huge repositories are made available by internet and digital databases. Digital libraries allow for the organization of an expansive amount of publications in a structured way in order to extract information of a user’s interest.   Unsupervised Learning from documents is an issue in machine learning, which aims at modeling and understanding the topics of documents and providing a meaningful description of the documents while preserving the basic statistical information about the corpus. For example, in a corpus of scientific articles (i.e., a digital library), documents are connected by citations, and one document plays two different roles in the corpus: document itself and a citation of other documents.    The present technology provides a Bernoulli Process Topic (BPT) model which models the corpus at two levels: document level and citation level. Each document has two different representations in the latent topic space associated with its roles. Moreover, the multilevel hierarchical structure of the citation network is captured by a generative process involving a Bernoulli process. The comparisons against other methods demonstrate a very promising performance.


  • Explicitly differentiates two different roles of citation networks: document itself and citations of other documents.
  • Model can be used in several data mining tasks which cannot be achieved by alternative technologies, such as: literature recommendation, novel research topics detection, and research areas trend discovery.

Intellectual Property:

U.S.  8,630,975; 8,930,304; 9,269,051


Binghamton University RB372

Patent Information:
Technology/Start-up ID: