Tuesday, January 5, 2021

Paper Summary: Jon. M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment."

 Paper: Jon. M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment."

 Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997.

Available from: https://www.cs.cornell.edu/home/kleinber/auth.pdf

Summary

Analysis of the hyperlink structure of the WWW pages provides latent information about human judgement and can be used to formulate the notion of authority. The creator of page p by including a link to page q, confers authority on q. High in-degree of a node is therefore correlated with authoritative sources. The paper also defines the notion of hubs -- pages that link to many authoritative sources. It is observed that hubs and authorities usually have a natural equilibrium in the graph. An algorithm is presented which helps to find authoritative pages from the web -- it takes a query string and the output from a text based search engine as inputs and generates the set of results from the search engine for the query string. 

A link is called transverse  if it is between pages with different domain names and intrinsic if it is between pages with same domain name. Intrinsic links are usually not meaningful (such as navigation links) and are not considered in the graph. 

A simple approach to constructing hubs and authorities would be to use in-degree of nodes. If the links are all meaningful, such an approach is expected to work well in practice. However, there is often a tension between pages that are authoritative and those that are "universally popular". These popular pages are also expected to be of high in-degree, but may not be authoritative. The paper makes a claim that authoritative pages should not only have a high in-degree but also have considerable overlap of materials. 

An iterative algorithm for finding hubs and authorities is presented in the paper. This makes use of the "I" and "O" operations -- if a page points to many pages with high authority weights, then it should be a hub; similarly if a page points to many pages which are hubs, it is likely to be an authority. To find an "equilibrium" setting, the weights of the "I" and "O" operations are adjusted alternatively and one needs to check if a fixed point is reached. It can be shown that the sequence of iterates of the hub and authority weights converge to an optimal solution after several iterations of the algorithm are executed. 

The paper also introduces the notion of diffusion and generalization. When many different topics are being discussed in the set of web pages being studied and the search query is broad, there is a possibility that each topic is centered around a collection of hubs and authorities. When the initial query string specifies a topic that is not sufficiently broad -- there may not be enough pages from which to learn a relevant hub or authority. In such a case, "broader" topics become competing and the process is said to have "diffused". Diffusion can be useful if it can be used to abstract a query to a broader, related concept. Alternatively, a query can be too specific -- all the pages belong to a single hub or authority. In this case, generalization may be difficult. 

Implementations
Here are two implementations of the HITS algorithm in R
https://rdrr.io/cran/networkR/src/R/hits.R
https://yifanyang.wordpress.com/2016/03/02/hits-algorithm-in-r/

Feel free to try them out!

No comments:

Post a Comment