Blog post #3:
A glance at the data

(added on 20 December 2016)

Note that this blog post is based on the data last updated in October 2016. In the future, some values are likely to change because of, mainly, adding papers published in October 2016 and later.

The APS data

The APS citation data from 1893 until December 2013 have been kindly provided to us by the APS (see their datasets for research page). To make the data as up to date as possible, we did several crawls of the APS web site to obtain the metadata and reference lists for papers published after December 2013. The resulting dataset now spans from 1893 until early October 2016 and includes 589,017 papers by 306,101 authors. To assign authors to papers, we had to solve the name disambiguation problem (i.e., to determine that A. Einstein and Albert Einstein is the same author). As usually in real data, this gets really messy (people find many unusual ways how to write their names). We used the common straightforward approach (see here, for example): kept only the first two given names and represented them with initials. Note that some of the “authors” are actually research collaborations (there are more than 400 of them).

The data allow us to take a look at the journal structure of the APS. It started with a single journal, Physical Review, that was later joined by Reviews of Modern Physics and (yet later) Physical Review Letters, until in 1970 multiple field-specific APS journals have been established. As shown in the figure below, the cumulative number of papers published by the APS journals has grown exponentially for a long time (the fit of the data from 1920-1970 gives the annual growth rate 7.1%). This growth is now slower (see the gradually opening divergence between the fitting line and the cumulative paper count) but one could argue that most journals show an exponential growth also in the period from 1990 until now.

Next we examine the citation patterns of the papers published by the APS. There are almost 85,998 papers (almost 15% of all papers) that have not been cited at all. The median citation count is 5, mean is 11.8, and the largest citation count is 7,589 (achieved by the relative young paper Generalized Gradient Approximation Made Simple which has recently surpassed the classical paper by Kohn-Sham with 6,808 citations). It is obvious from these aggregate characteristics that the citation count distribution is very broad. This is confirmed by the figure below which clearly shows that the distribution tail is close to a power law. Statistical analysis shows that the power law exponent is 2.77 ± 0.01 which applies to the citation count 60 and above. It is interesting to note that the five topical APS journals (PRA-PRE) and PRL have similar citation count distributions with PRE being somewhat different in having a narrower distribution (the fitted power law exponent is 3.36 ± 0.09 which is substantially more than the overall exponent of 2.77).

Rescaled PageRank

Besides the rescaled PageRank's lack of age bias and superior ranking of milestone papers (see the second blog entry for more details), it is useful to realize a couple of other features of this metric: