visualizing citations predicted by reads

05 Jun 2017


About the plot

The position and size for each data point is controlled by a single paper's read count and its position along the y-axis is controlled by how many times the paper was cited by other papers in the ScienceOpen.com database. If you hover your mouse above a point, it will show you the exact stats and the title for the data point, e.g. it gives the coded citation for the paper. You can download the imaginary data that I used for the plot here.

Made with plotly.js MIT License

Show me the code

You can mess around witg the code I wrote at codepen.io. In this post, I'll show you how I created the data and I'll refelect on how I plotted it.

What do the numbers mean?

Here's the short of it: each data point represents a research article from an particular author and 1) how many times it was read and 2) how many other authors have cited the paper in their own work.

Here's a longer story: I plotted imaginary data that approximately resembles the summary statistics maintained by ScienceOpen for authors listed on their website. In the ScienceOpen database, the string corresponding to an author's published name is a key to different pages on the ScienceOpen website. In addition to hosting author pages (for free), almost all of the pages on ScienceOpen are articles published by different publishers of natural sciences or academic arts and humanities journals. Some articles on ScienceOpen are attended to more frequently both readers and other authors. You can see an example of an author's summary statistics on ScienceOpen here. This is getting a bit redunant, but note that a single article has both a read-count and a cited-by count. A high read-count means that a lot of people have read (or least clicked the link for the paper). A single cited-by count implies that many authors have considered the article in their own work. There's a lot to say about ScienceOpen, but I wanted to provide a background for the numbers.

How do I get the numbers?

ScienceOpen doesn't host any statistics for download (at least not en masse). Typically, if a site doesn't do that, I'd just scrap it. If you're interested, I scrap using the python modules BeautifulSoup and Requests. Unfornunately for me, ScienceOpen is dyanmically generating their html documents. So scrapperings a bit more challenging, Therefore, instead of scraping their site, I randomly generated two distributions of data that mimic the "read count" and the "cited by count" for an the author's papers. I did that in my mac terminal with python and its numpy package

cookdj0128$  python
Python 2.7.12 |Anaconda custom (x86_64)| (default, Jul  2 2016, 17:43:17)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
cookdj0128$  import numpy as np
cookdj0128$  read_count = [int(np.random.lognormal(mean=4)) for i in range(140)]
cookdj0128$  citation_count = [int(np.random.lognormal(mean=4)) for i in range(140)]

From there, you can pretty much copy and paste the variable read_count and citation_count into an excel sheet, which I did here.

Why use a lognormal distribution?

The short version is that when I analyzed similar summary stats maintained by Google Scholar for researchers, the distributions for these stats appeared to come from a log-normal distribtion.

So log normal distributions are called "log normal" because they can be mathematically transformed into a gaussian distribution e.g. normally bell curved distribution, by iteratively applying a log function to each data point in the distribution. One property of a lognormal distribution (when it's not transformed into a gaussian) is that a lot of the numbers bunch up toward one tail of the distribution, which is what I needed to simulate a situation in which many of an author's papers were read just a few times, but a few were read and cited very often.

To thet this data, I used numpy. The numpy function np.random.lognormal() takes three parameters: 1) the underlying mean of a normal distribution, 2) the standard deviation of the underlying normal distribution, and 3) the array shape. Unless you want to do some some multivariate stuff, you can ignore the shape parameter. For now, you just need to know that I used a mean of 4. I selected the mean empirically, looking at the resulting mean of the log distribution until I got something that roughly resemebled a known author on ScienceOpen. So that's the long explanation for my use of a log-normal distribution. It's not totally accurate, but it allowed me to simulate the data that I wanted to model. Personally, I think more companies should open up there data and allow othes to ask inerestng quuestions. This is just a small effort to make that point. In the future, I will be addressing the entire school about best practices for sharing data.