Blog post #2:
Deeper into rescaled PageRank

(added on 11 October 2016)

The classical PageRank metric acts on directed networks and it hadn't take long for researchers to realize that the directed network of citations among scientific papers could perhaps be a use case for PageRank. Albeit an interesting attempt, it quickly became obvious that the PageRank score of papers is strongly influenced by their age—this time bias has been recently studied in a general setting. The reason for this bias in the citation data is simple: a scientific paper can only cite past papers. The citation network thus has an implicit time-arrow built-in and since PageRank scores flow along the directed networks links, they only flow back in time in the citation network and the bias of PageRank toward old papers is born (the teleportation term in the PageRank equation weakens the effect).

Once we realize that the time bias is innate to PageRank results on citation data, the natural questions how to remove it to make papers of different age comparable with each other. The simplest possibility is to compute PageRank scores of all papers, pi for paper i, and for each paper compare the score with the scores of other papers that have been published at a similar time. Without going into details, the formula to rescale the original PageRank scores has the form

where μi(p) is the average PageRank score of the papers published at the same time as paper i and σi(p) is the standard deviation of these scores. To put it simply, this formula implements a double correction: it corrects for the fact that the average score changes with time, and it also corrects for the fact that the dispersion of scores changes with time. To illustrate that the rescaled scores indeed lack any time bias, the following figure shows the distribution of score values for papers divided into 40 mutually exclusive groups by their age. Scores of the oldest papers (group 1) clearly have follow a pattern that is very similar to that of papers published at any later time period (these results have been obtained for the 450,000 papers published by the American Physical Society from 1893 until 2009).

Albeit it is nice to have a score without any time bias, the question is whether the ranking it produces is any useful. In particular, how does it compare with ranking produced by some other metrics of paper impact such as the citation count and PageRank itself? To assess a ranking of papers, we used a recently published list of milestone papers that have been selected by APS editors for "making long-lived contributions to physics, either by announcing significant discoveries, or by initiating new areas of research". While the selection enjoys the benefit of hindsight (the most recent milestone papers are from 2001), in the evaluation we can look at how they are ranked short after their publication. Choosing the fraction of milestone papers that rank in top 1% of papers as the evaluation metric (we refer this as the identification rate), the following figure shows how the identification rate achieved by various metrics changes with the time since the publication of milestone letters. The five included metrics are: citation count c, PageRank p, rescaled citation count R(c), rescaled PageRank R(p), and CiteRank T (rescaled citation count is obtained by a rescaling procedure analogous to that of rescaled PageRank). Rescaled citation count, rescaled PageRank, and CiteRank take paper publication time into account and to various degree remove the time bias.

There are two points to note in the figure. First, the three time-aware metrics clearly outperform the usual time-unaware metrics (citation count and PageRank) in ranking the milestone papers short after their publication. That's because to excel in a time-unaware metric, a paper needs to attract sufficient attention to be able to compete with papers that have been published long ago—and this takes time, of course. Second, rescaled PageRank performs best over the whole time range. Notably, rescaled citation count performs worse than rescaled PageRank which suggests that analyzing the full topology of the citation network is indeed more efficient than just counting the incoming citations.

Finally, the following video shows a dynamical comparison of the ranking of the PRL milestone papers by PageRank and rescaled PageRank. One can see that half a year after publication, rescaled PageRank ranks all milestone papers better than PageRank (dots lie above the diagonal). At this time, 21 milestone papers are then in top 1% by rescaled PageRank (blue and green region together), while only 1 milestone paper is in top 1% by rescaled PageRank (green and red region together). As time goes, most milestones first move up (their rescaled PageRank ranking improves faster than their PageRank ranking) and only then they more right (their PageRank ranking improves and their rescaled PageRank ranking stagnates). 15 years after publication, most milestones lie close to the diagonal line—their ranking by PageRank and rescaled PageRank are similar.