Deeper into rescaled PageRank

The classical PageRank metric acts on directed networks and it hadn't take long for researchers to realize that the directed network of citations among scientific papers could perhaps be a use case for PageRank. Albeit an interesting attempt, it quickly became obvious that the PageRank score of papers is strongly influenced by their age—this time bias has been recently studied in a general setting. The reason for this bias in the citation data is simple: a scientific paper can only cite past papers. The citation network thus has an implicit time-arrow built-in and since PageRank scores flow along the directed networks links, they only flow back in time in the citation network and the bias of PageRank toward old papers is born (the teleportation term in the PageRank equation weakens the effect).

Once we realize that the time bias is innate to PageRank results on citation data, the natural questions how to remove it to make papers of different age comparable with each other. The simplest possibility is to compute PageRank scores of all papers, *p _{i}* for paper

where *μ _{i}*(

Albeit it is nice to have a score without any time bias, the question is whether the ranking it produces is any useful. In particular, how does it compare with ranking produced by some other metrics of paper impact such as the citation count and PageRank itself? To assess a ranking of papers, we used a recently published list of milestone papers that have been selected by APS editors for "making long-lived contributions to physics, either by announcing significant discoveries, or by initiating new areas of research". While the selection enjoys the benefit of hindsight (the most recent milestone papers are from 2001), in the evaluation we can look at how they are ranked short after their publication. Choosing the fraction of milestone papers that rank in top 1% of papers as the evaluation metric (we refer this as the identification rate), the following figure shows how the identification rate achieved by various metrics changes with the time since the publication of milestone letters. The five included metrics are: citation count *c*, PageRank *p*, rescaled citation count *R*(*c*), rescaled PageRank *R*(*p*), and CiteRank *T* (rescaled citation count is obtained by a rescaling procedure analogous to that of rescaled PageRank). Rescaled citation count, rescaled PageRank, and CiteRank take paper publication time into account and to various degree remove the time bias.

There are two points to note in the figure. First, the three time-aware metrics clearly outperform the usual time-unaware metrics (citation count and PageRank) in ranking the milestone papers short after their publication. That's because to excel in a time-unaware metric, a paper needs to attract sufficient attention to be able to compete with papers that have been published long ago—and this takes time, of course. Second, rescaled PageRank performs best over the whole time range. Notably, rescaled citation count performs worse than rescaled PageRank which suggests that analyzing the full topology of the citation network is indeed more efficient than just counting the incoming citations.

Finally, the following video shows a dynamical comparison of the ranking of the PRL milestone papers by PageRank and rescaled PageRank. One can see that half a year after publication, rescaled PageRank ranks all milestone papers better than PageRank (dots lie above the diagonal). At this time, 21 milestone papers are then in top 1% by rescaled PageRank (blue and green region together), while only 1 milestone paper is in top 1% by rescaled PageRank (green and red region together). As time goes, most milestones first move up (their rescaled PageRank ranking improves faster than their PageRank ranking) and only then they more right (their PageRank ranking improves and their rescaled PageRank ranking stagnates). 15 years after publication, most milestones lie close to the diagonal line—their ranking by PageRank and rescaled PageRank are similar.