Blog post #1:
Rescaled PageRank made simple

(added on 07 November 2016)

Rescaled PageRank is a time-balanced metric built on the classical Google's PageRank metric. In this first post, we briefly review the PageRank algorithm and introduce the main idea behind the novel metric.

PageRank

PageRank is a network-based algorithm which aims to rank the network's nodes according to their importance, or centrality, in the network. The main assumption of the algorithm is that

“a node is important if it is pointed to by other important nodes.”

In the case of the citation network of scientific papers, this thesis can be rephrased as

“a paper is important if it is cited by other important papers.”

PageRank then uses an equation which directly reflects this idea, and through which each paper's score is recursively propagated to its references. Papers that received many citations from influential papers with few references are thus considered influential as well. Originally devised by Brin and Page to rank web pages in the World Wide Web [1], the PageRank algorithm and its variants have been applied to many real-world networks [2]. Technical details on the calculation of the PageRank scores can be found in our second blog post.

However, the algorithm has a fundamental shortcoming when applied to citation networks of scientific papers. Papers can only cite older papers, which results in a strong bias of the algorithm towards old papers. As a consequence, it is virtually impossible for a recent paper to score well by PageRank.

Rescaled PageRank

To solve the time bias of PageRank, we introduced the rescaled PageRank metric which explicitly suppresses the temporal bias. The scoring algorithm consists of two steps:

  1. Compute the papers' PageRank scores.
  2. Rescale the scores in order to remove the temporal bias of the PageRank scores.

The rescaled PageRank score Ri(p) of a given paper i is defined as the number of standard deviations paper i outperforms with respect to papers of similar age. The corresponding formula is

where μi(p) is the average PageRank score of the papers published at a similar time as paper i and σi(p) is the standard deviation of these scores. Technical details on the calculation of the rescaled PageRank scores can be found here.

To better understand the meaning of the Ri(p) score, consider the three following examples:

  1. The recent review Optical atomic clocks from June 2015 has rescaled PageRank score R = 10.4 which means that the PageRank score of this paper outperforms papers of similar age by more than 10 standard deviations. This score puts the paper at rank 805 from all 589,017 papers. At the same time, this review has only received 31 citations yet which puts it at rank 45,331 (the factor between the two ranks is more than 50).
  2. The old paper on the Einstein-Podolsky-Rosen (EPR) paradox and the very recent one on the detection of gravitational waves have both exceptionally high rescaled PageRank score above 20 which signifies that they are both groundbreaking contributions to science; the ranking of these two papers by rescaled PageRank is 81 and 20, respectively. By contrast, the rankings by the simple citation count are 32 and 3,412, respectively—the old paper is clearly favored. The difference is further magnified by the original PageRank which ranks the two papers at 4 and 20,917, respectively—the recent paper does not stand a chance here.
  3. One can notice that a very recent paper Solutions in bosonic string field theory and higher spin algebras in AdS from November 2015 has received only one citation, yet it has a comparatively high rescaled PageRank score R = 11.1. This is because it is cited by a paper which also scores highly (Irregular vertex operators for irregular conformal blocks from May 2016). However, if it will attract no further citations in the future, the rescaled PageRank of the bosonic string paper will decrease because the paper will be exception in comparison with papers of similar age anymore.

We performed statistical tests [3] to show that the ranking by Ri(p) is not biased by age. We showed that rescaled PageRank allows us to discover highly-influential papers much earlier than PageRank and significantly better than metrics only based on citation count, which suggests that rescaled PageRank is a better proxy for paper significance than other metrics. Some of these results can be also found in our second blog post. Besides scientific papers, rescaled PageRank can be applied to any other system that can be effectively described with a directed network. If, similarly to scientific papers, there is a strong time preference in this network, scores obtained with rescaled PageRank are likely to be superior to those obtained with PageRank alone.

References

[1] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN Systems 30, 107-117 (1998)
[2] D. F. Gleich, Pagerank beyond the web, SIAM Review 57, 321-363 (2015)
[3] M. S. Mariani, M. Medo, Y.-C. Zhang, Identification of milestone papers through time-balanced network centrality, Journal of Informetrics 10, 1207–1223 (2016)