ADS author connectivity graph

This notebook produces a graph of authors of papers that match a given query, then finds a good grouping and works out which keywords are most associated with each group. It has only been tested with Python 3.6.

We use four Python packages in this tutorial that you may need to install (assuming that, like most astronomers, you already have numpy and matplotlib). The first is the "ads" package ; note that you will need an ADS token for this, installed in the correct place for the ads library to find it. The second is plotly. The third is nltk. The fourth is graph-tool. All can be installed via conda or pip. To install graph-tool via conda, it is easiest to create a new conda environment and add it as an iPython kernal via the commands

conda create -n graph-tool -c ostrokach-forge -c conda-forge -c defaults --override-channels 'python=3.6' graph-tool python -m ipykernel install --user --name graph-tool --display-name "Python3 (graph-tool)"

and then install the other packages in the graph-tool environment. To run nltk, you will first need to run nltk.download().

This notebook can be cloned from https://github.com/msimet/ads_clouds

.
In [1]:
import itertools
import re
import string
from collections import defaultdict, Counter
from operator import itemgetter

import numpy as np
import matplotlib.pyplot as plt
import graph_tool.all as gt
import plotly.offline as py
import plotly.graph_objs as go

import nltk

import ads
%matplotlib inline
# This next line allows us to use plotly in offline mode (i.e., without logging in)
py.init_notebook_mode(connected=True)

First, let's create a class to grab the data and do some preprocessing. Here's what's going on.

The ADS will give you a max of 2000 rows, so if the user requests more than that, we issue an error. We make a SearchQuery object, then save the results of the query so we can reuse them. Different journals use different name conventions (e.g. "M. Simet" or "Melanie Simet"), so we will force everything to the M. Simet-type format in a simplistic way--this will combine, say, John Smith with James Smith, but that will be more rare than splitting J. Smith and John Smith when they should be the same. Then we find all the unique authors and get rid of the ones with the fewest papers until we have max_authors (or less) remaining, to ensure that our plot isn't too crowded.

In [4]:
class ADSGraphData:
    """ Query ADS, then perform some massaging of the results for easier use.
    
        Parameters
        ----------
        query: str
            The ADS query you wish to run
        rows: int
            The number of results to return (maximum 2000)
        max_authors: int
            The maximum number of authors to retain in the authors list.
    """
    def __init__(self, query, rows=2000, max_authors=100):
        if rows>2000:
            raise TypeError("Max 2000 rows will be returned by a query to ADS. Please request a number"
                            "less than this value.")
        self.original_query = query
        self.q = ads.SearchQuery(q=query, fl=['abstract', 'author'], rows=rows)
        self.results = [r for r in self.q]
        self.rows = rows
        self.max_authors = max_authors
        self.raw_results = self.results.copy()
        self.fix_authors()
        self.measure_authors()
        self.clean_authors()
    def __fix_author(self, a):
        if isinstance(a, list):
            return [self.__fix_author(aa) for aa in a]
        asplit = a.split(',')
        if len(asplit) == 1:
            asplit = [asplit[0], '']
        if len(asplit) > 2:
            asplit = [asplit[0], ','.join(asplit[1:])]
        return ' '.join([asplit[0]] + [aa[0].upper()+'.' for aa in asplit[1].split()])        
    def fix_authors(self):
        """ Change all authors to "F. Lastname" form """
        for paper in self.results:
            paper.author = self.__fix_author(paper.author)
    def measure_authors(self):
        """ Count how many times each author appears on a paper, and keep a record of up to
            self.max_authors.
        """
        authors = list(itertools.chain.from_iterable([paper.author for paper in self.results]))
        author_counts = Counter(authors)
        most_common = author_counts.most_common(self.max_authors+1)
        max_papers = most_common[-1][1]+1
        self.unique_authors = [auth for auth, count in author_counts.most_common(self.max_authors) 
                                        if count >= max_papers]
        self.unique_author_counts = [count for auth, count in author_counts.most_common(self.max_authors) 
                                               if count >= max_papers]
    def clean_authors(self):
        """ Remove authors with only a few papers from the dataset. """
        for paper in self.results:
            paper.author = [a for a in paper.author if a in self.unique_authors]
        

Given what I do, I'm most interested in weak lensing papers, so I'll write a query to find them.

In [5]:
query = '"weak lensing" OR "weak gravitational lensing"'
gd = ADSGraphData(query, max_authors=500)

Our "network" here is a set of authors, joined together by coauthorship on papers. To track this, let's make a giant table that measures coauthorship. Each row/column of the table represents a single author from our list of unique authors; the count in diagonal cells is the number of papers authored by that author, and the count in off-diagonal cells represents the number of papers each set of authors cowrote together. To understand how closely linked two authors are, we will divide that matrix by the expectation value for how many papers each pair of authors $i,j$ would have coauthored assuming they each wrote $n_i$ or $n_j$ of the $N$ total papers in our data set.

In [6]:
paper_authors = [paper.author for paper in gd.results]
author_dict = {a: i for i, a in enumerate(gd.unique_authors)}
lenu = len(gd.unique_authors)
edge_list = []

counts = np.zeros((lenu, lenu))
for authors in paper_authors:
    for author1 in authors:
        for author2 in authors:
            counts[author_dict[author1], author_dict[author2]] += 1
# If every author was on a random subset of papers to make up their total paper count,
# this is how often you would expect 2 people to be coauthors. [:, None] turns a row 
# vector into a column vector.
expected_counts = np.diag(counts)*np.diag(counts)[:, None]/gd.rows
weights = counts*1.0/expected_counts

We can turn this into an "edge list"--a list of the connections between authors $i$ and $j$--weighted by the matrix above, so authors with more coauthorship are more strongly linked.

In [7]:
edge_list = []

for i, key1 in enumerate(gd.unique_authors):
    for j, key2 in enumerate(gd.unique_authors):
        # Lazy triangle
        if j <= i or weights[i][j]==0:
            continue
        edge_list.append((i, j, weights[i][j]))

Now we'll make a graph-tool graph out of this data. We add a set of vertices corresponding to all our authors, and then add the edge list, with the weights as an additional edge property.

In [8]:
g = gt.Graph(directed=False)
g.add_vertex(lenu)
weights = g.new_edge_property('float')
g.add_edge_list(edge_list, eprops = [weights])

A fun thing you can do with a graph is to turn it into connected blocks. graph-tool does this via an MCMC, so it doesn't always find the optimal solution; we'll do 7 runs and pick the best one, defined by the lowest entropy.

In [9]:
blocklist = []
entropylist = []
for i in range(7):
    # The weights are supposed to be one of five types. This isn't exactly 'real-exponential'
    # but it's closer than any of the other options--it doesn't seem to perform too badly, anyway.
    blocks = gt.minimize_blockmodel_dl(g, state_args=dict(recs=[weights], rec_types=['real-exponential']))
    blocklist.append(blocks)
    entropylist.append(blocks.entropy())
blocks = blocklist[np.argmin(entropylist)]

We also want to know what unique words are in each abstract. We'll limit ourselves to the top 20% of words (to eliminate rare "words" like measurement numbers) and then ask which ones are most common for each set of authors. First, make a list of all abstracts, and eliminate the bottom 80% of words by frequency, in a very simple way since the language of abstracts is fairly structured already:

In [10]:
abstracts = [paper.abstract for paper in gd.results]
abstract_tokens = nltk.word_tokenize(' '.join([a.lower() for a in abstracts if a]))
n_abs_words = len(abstract_tokens)
abstract_counter = Counter(abstract_tokens)

Now, let's make a dataset of abstracts from each block in our graph. We'll count abstracts multiple times if multiple authors from the same block were coauthors on the same paper, since those are likely to be more typical than papers that only one author of the block contributed to.

In [11]:
# author_dict: the index of the papers that each author appears on
author_dict = defaultdict(list)
for i, paper in enumerate(gd.results):
    for author in paper.author:
        author_dict[author].append(i)
block_indices = blocks.get_blocks()
# block_authors: the names of the authors in each block
block_authors = defaultdict(list)
for i, vertex in enumerate(g.vertices()):
    block_authors[block_indices[i]].append(gd.unique_authors[i])
# block_abstract_wordcounts: the word counts for the trimmed abstracts in each block
block_abstract_wordcounts = []
block_abstract_nwords = []
for block in block_authors:
    these_abstracts = [abstracts[i] for author in block_authors[block] for i in author_dict[author]]
    # lower() to avoid capitalization problems; the replacements because latex quoting not recognized by nltk
    # the join joins all abstracts into one long string
    tokens = nltk.word_tokenize(' '.join([t.lower().replace('`', "") for t in these_abstracts if t]))
    block_abstract_nwords.append(len(tokens))
    block_abstract_wordcounts.append(Counter(tokens))
In [12]:
def is_punctuation_or_digit(s):
    return all([ss in string.punctuation or ss in string.digits or ss=="±" for ss in ''.join(s.split('\pm'))])
In [13]:
top_block_uniques = []
n_abs_words = len(abstract_tokens)
stopwords = nltk.corpus.stopwords.words('english')
abs_stopwords = ['.', ',', '<', '>', ')', '(', 'SUB', '/SUB', ]
for b_abs, b_n in zip(block_abstract_wordcounts, block_abstract_nwords):
    word_overages = {key: 1.0*b_abs[key]*b_n/n_abs_words - abstract_counter[key] for key in b_abs}
    b_words = [key for key in word_overages if not key in stopwords and not is_punctuation_or_digit(key)]
    b_words.sort(key=lambda x: word_overages[x], reverse=True)
    top_block_uniques.append(b_words[:10])

Now, we will make a graph of the connections between all the authors. We'll get the layout from graph-tool, then use plotly to make an interactive graph (easier to visualize).

In [14]:
# Find optimal vertex positions from graph-tool
pos = gt.sfdp_layout(g, eweight=weights)
In [15]:
# Make some layouts for plotly graph
Xn=[p[0] for p in pos] 
Yn=[p[1] for p in pos]
    
    
# Each trace can have only one line width, so make a different trace for every line width.
set_weights = set([e[2] for e in edge_list])
list_Xe = []
list_Ye = []
list_W = []
for w in set_weights:
    # Using w directly made the plot more crowded; this is easier to read
    list_W.append(np.sqrt(w))
    Xe=[]
    Ye=[]
    for edge in edge_list:
        if edge[2] == w:
            Xe += [Xn[edge[0]], Xn[edge[1]], None]
            Ye += [Yn[edge[0]], Yn[edge[1]], None]
    list_Xe.append(Xe)
    list_Ye.append(Ye)
In [16]:
# Turn this into objects for plotly to display
trace_list = []
for Xe, Ye, w in zip(list_Xe, list_Ye, list_W):
    trace_list.append(
        go.Scatter(x=Xe, y=Ye,mode='lines',
                   line=dict(color='rgb(125,125,125)', width=0.1*w),
                   hoverinfo='none')
        )

trace_list.append(
    go.Scatter(x=Xn, y=Yn,
               mode='markers', name='actors',
               marker=dict(symbol='circle', size=6,
                           color=list(block_indices), colorscale='Viridis',
                           line=dict(color='rgb(50,50,50)', width=0.5)
                           ),
               text=list(gd.unique_authors), hoverinfo='text')
    )

axis=dict(autorange=True,
          showgrid=False,
          zeroline=False,
          showline=False,
          ticks='',
          showticklabels=False)


layout = go.Layout(
             title='Network of top {} authors from the ADS query "{}"'.format(gd.max_authors, gd.original_query),
             width=1000, height=1000, showlegend=False,
             xaxis=axis, yaxis=axis,
             margin=dict(t=100),
             hovermode='closest')
In [17]:
fig=go.Figure(data=trace_list, layout=layout)

py.iplot(fig, filename='ads_author_graph')

The points are colored by which block they're part of, by the way! We can also print the authors and typical abstract words for each block. These aren't perfect--below, you may notice that "Refregier A." and "Réfrégier A." are separated, despite the fact that they're really the same author, and a bunch of mathematical text gets into the abstract words. Still, there are some definite real groupings here, with definite real words that indicate what the grouping is (such as the first group, for KiDS and CFHTLenS, or the fifth group, who have done a lot of work on galaxy cluster weak lensing).

In [18]:
for i, unique in enumerate(top_block_uniques):
    author = block_authors[i]
    print("For block {} including authors: {}".format(i, ', '.join(sorted(author))))
    print("Top phrases: {}\n".format(', '.join(unique)))
For block 0 including authors: Erben T., Heymans C., Hildebrandt H., Hoekstra H., Kitching T. D., Kuijken K., Mellier Y., Miller L., Schneider P., Simon P., Van Waerbeke L., van Waerbeke L.
Top phrases: hde, cfhtlens, lde, h_, ugri, kilo-degree, stellar-to-halo, kilo, kids-i-800, gama

For block 1 including authors: Annis J., Dodelson S., Frieman J., Huterer D., Jain B., Jarvis M., Lin H., Nichol R. C., Rozo E., Sheldon E.
Top phrases: bpz, photometric-redshift, ×457, rivals, lightcurve, year-one, 10.8\sigma, 6.8\sigma, \equiv, skynet

For block 2 including authors: Clowe D., Dahle H., Johnston D., Mandelbaum R., Massey R., Rhodes J., Seljak U., Wittman D.
Top phrases: shear-measurement, blindly, unattained, pixellization, benchmark, implementations, desirable, facilitating, posed, communities

For block 3 including authors: Bacon D. J., Bernstein G., Bridle S., Frieman J. A., Joachimi B., Lanusse F., Leonard A., Refregier A., Rowe B., Réfrégier A., Schrabback T., Starck J. -.
Top phrases: downsizes, nulled, pseudo-cl, e'-convergence, b'-modes, capacity, vlass, degs, l-band, a-array

For block 4 including authors: Bahcall N., Battaglia N., Calabrese E., Futamase T., Geller M. J., Hamana T., Hughes J. P., Komiyama Y., Medezinski E., Miyatake H., Miyazaki S., More S., Nishimichi T., Nishizawa A. J., Oguri M., Okabe N., Sato M., Sehgal N., Shirasaki M., Spergel D. N., Takada M., Takahashi R., Umetsu K., Utsumi Y., Yoshida N.
Top phrases: hsc-ssp, mcxc, camira, emulated, atacama, jk, non-mcxc, hyper, polarimeter, sz-estimated

For block 5 including authors: Amara A., Bacon D., Choi A., Hilton M., Makler M., Seitz S., Simet M.
Top phrases: nulled, photometric-redshift, ×457, rivals, lightcurve, 10σ-15σ, diameters, arcmin…1°, cylinders, ̃116

For block 6 including authors: Abdalla F. B., Armstrong R., Bertin E., Chang C., Dietrich J. P., Gruen D., Kacprzak T., Kirk D., Melchior P., Menanteau F., Roodman A., Rykoff E. S., Zhang Y., Zuntz J.
Top phrases: y1, im3shape, sv, bpz, sp, metacalibration, blinded, verification, w_a, w_p

For block 7 including authors: Babul A., Baccigalupi C., Baldi M., Bartelmann M., Benítez N., Bertin G., Blandford R. D., Brainerd T. G., Broadhurst T., Challinor A., Diaferio A., Dolag K., Ettori S., Fan Z., Franx M., Giocoli C., Haiman Z., Higuchi Y., Inoue K. T., Kaiser N., Kamionkowski M., Koekemoer A., Koyama K., Kratochvil J. M., Li B., Lilje P. B., Lin C., Liu X., Lombardi M., Lu T., Luppino G., Luppino G. A., Maddox S. J., Marian L., Martinelli M., Maturi M., May M., Mazzotta P., Melchiorri A., Meneghetti M., Mo H. J., Moscardini L., Nonino M., Okura Y., Pan C., Peel A., Pen U., Perrotta F., Petri A., Postman M., Romano A., Rosati P., Scaramella R., Sereno M., Smith R. E., Squires G., Starck J., Wang S., Wang Y., White R. L., Wilson G., Yamauchi D., Yang X., Zhang J., Zhang P., Zhang T., Zitrin A., van den Bosch F. C.
Top phrases: m_☉, gg, pa, azimuthally, adopts, shear-and-magnification, lbc, =21^\circ, 7^\circ, dr7

For block 8 including authors: Allam S., Becker M. R., Bernstein G. M., Bridle S. L., Burke D. L., Cunha C. E., Desai S., Evrard A. E., Jeltema T., Kent S., Krause E., Lahav O., Lima M., MacCrann N., Martini P., Miller C. J., Miquel R., Mohr J. J., Plazas A. A., Romer A. K., Scarpine V., Schubnell M., Soares-Santos M., Tarle G., Troxel M. A., Wechsler R. H., Weller J., Wester W.
Top phrases: des, y1, verification, sv, cdm, im3shape, redmapper, year, sp, troughs

For block 9 including authors: Benabed K., Benjamin J., Bernardeau F., Coupon J., Fu L., Hudson M. J., Kilbinger M., Rowe B. T. P., Semboloni E., Vafaei S., Velander M.
Top phrases: downsizes, u*g, u*, //www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/community/cfhtlens/query.html, how-to, manuals, density-fluctuations, distance-ladder, ladder, \a

For block 10 including authors: Barber A. J., Birkinshaw M., Brown M. L., Camera S., Coles P., Gray M. E., Harrison I., Heavens A., Heavens A. F., Kay S. T., Kiessling A., Kitching T., Metcalf R. B., Munshi D., Patel P., Pires S., Schäfer B. M., Smail I., Taylor A., Taylor A. N., Teyssier R., Valageas P., White M., Whittaker L.
Top phrases: jvla, capacity, vlass, degs, l-band, a-array, unsurpassed, jvla-accessible, pseudo-cl, e'-convergence

For block 11 including authors: Blake C., Cacciato M., Cooray A., Covone G., Grado A., Harnois-Déraps J., Herbonnet R., Hojjati A., Loveday J., Merten J., Nakajima R., Napolitano N., Radovich M., Sifón C., Tewes M., Viola M., Yee H. K. C., van Uitert E.
Top phrases: nulled, ̃450, s_8≡, _8√, pre-planck, 2.3σ, four-band, high-level, //kids.strw.leidenuniv.nl, downsizes

For block 12 including authors: Abbott T. M. C., Avila S., Benoit-Lévy A., Brooks D., Buckley-Geer E., Carnero Rosell A., Carrasco Kind M., Carretero J., Castander F. J., Crocce M., D'Andrea C. B., Davis C., De Vicente J., Diehl H. T., Doel P., Eifler T. F., Flaugher B., Fosalba P., García-Bellido J., Gaztanaga E., Gerdes D. W., Gruendl R. A., Gschwend J., Gutierrez G., Hartley W. G., Honscheid K., Hoyle B., James D. J., Kuehn K., Kuropatkin N., Maia M. A. G., March M., Marshall J. L., Nord B., Sanchez E., Schindler R., Sevilla-Noarbe I., Smith M., Smith R. C., Sobreira F., Suchyta E., Swanson M. E. C., Thomas D., Vikram V., Walker A. R., da Costa L. N.
Top phrases: des, y1, cdm, bins, λ, z, year, verification, sv, science

For block 13 including authors: Amendola L., Bahcall N. A., Brinkmann J., Böhringer H., Clarkson C., Connolly A., Csabai I., Dell'Antonio I., Dell'Antonio I. P., Fischer P., Hirata C. M., Hui L., Ishak M., Joffre M., Khiabanian H., Koester B. P., Kubo J., Marra V., McKay T., McKay T. A., Nichol B., Padmanabhan N., Perlmutter S., Peterson J., Peterson J. R., Racusin J., SDSS Collaboration, Scranton R., Shapiro C., Sheldon E. S., Sholl M., Smith R., Stebbins A., Tyson J. A., Tyson T., Uzan J., Vallinotto A., Wittman D. M., Zhan H.
Top phrases: m_260, sdss/rass, gmcf, sigma_+, replace, five-color, ∆z=0.018, ergs, =γ, t0

For block 14 including authors: Adami C., Bergé J., Bradač M., Capak P., Carvalho C. S., Cohen J., Courbin F., Cypriano E. S., Doré O., Ebeling H., Eifler T., Ellis R., Ellis R. S., Er X., Finoguenov A., Fort B., García Lambas D., Gavazzi R., George M. R., Gonzalez E. J., Hartlap J., Hetterscheidt M., Hilbert S., Hirata C., Hobson M. P., Hu W., Ilbert O., Johnston D. E., Jullo E., King L., King L. J., Kneib J., Kneib J. -., Leauthaud A., Limousin M., Mahdavi A., Maoli R., Marshall P. J., Massey R. J., McCarthy I. G., McCracken H. J., Meyers J. E., Meylan G., Natarajan P., Paulin-Henriksson S., Rhodes J. D., SNAP Collaboration, Schimd C., Schirmer M., Schmidt F., Scoville N., Shan H., Smith G. P., Soucail G., Tanaka M., Taylor J. E., Tereno I., Treu T., Zentner A. R.
Top phrases: aura, cooperative, national, inc., universities, m-l, nasa/esa, observatory, 175.a-0839, kitt

For block 15 including authors: Barden M., Borch A., Dye S., Jahnke K., Jogee S., McIntosh D. H., Meisenheimer K., Merkel P. M., Vale C., Wisotzki L., Wolf C.
Top phrases: nulled, consecutively, semi-time-dependent, ω*, 901α, 901alpha, go-10395, affording, 4.7σ, 0.5×0.5

For block 16 including authors: Allen S. W., Brodwin M., Burchat P. R., Gladders M. D., Gonzalez A. H., High F. W., Jee M. J., Jones C., Knox L., Liu J., Markevitch M., Norman D. J., Reiprich T. H., Vikhlinin A., Williams L. L. R.
Top phrases: \xi, \nu\lambda, \sigma_8=0.781\pm0.037, 2.3\sigma, 2.5\sigma, w=-1.55\pm0.41, z\sim1.7, clay, colour-cut, \omega_\mathrm

For block 17 including authors: Applegate D., Carlstrom J. E., Israel H., Mantz A., Marrone D. P., Mohr J., Rapetti D., Reichardt C. L., Saro A., von der Linden A.
Top phrases: \xi, \nu\lambda, \sigma_8=0.781\pm0.037, 2.3\sigma, 2.5\sigma, w=-1.55\pm0.41, z\sim1.7, galaxy-cluster-based, spt-selected, incurred

For block 18 including authors: Allen S., Benson B. A., Blazek J., Bonnett C., Clampitt J., DES Collaboration, Dark Energy Survey Collaboration, Huff E. M., Lidman C., McClintock T., Rykoff E., Wechsler R.
Top phrases: nulled, downsizes, photometric-redshift, ×457, rivals, lightcurve, cluster-member, simet, saro, spt-detected

For block 19 including authors: Alarcon A., Banerji M., Baxter E., Bechtol K., Capozzi D., Cawthon R., DePoy D. L., DeRose J., Drlica-Wagner A., Estrada J., Fernandez E., Gatti M., Giannantonio T., Hollowood D. L., Li T. S., Neilsen E., Ogando R., Ogando R. L. C., Palmese A., Prat J., Rau M. M., Reil K., Rollins R. P., Sako M., Samuroff S., Serrano S., Sánchez C., Tucker D. L., Varga T. N.
Top phrases: y1, bpz, im3shape, ∆z, n^i_pz, z^i, metacalibration, w_a, w_p, 6df

For block 20 including authors: Bilicki M., Brough S., Dvornik A., Hopkins A. M., Joudaki S., Klaes D., McFarland J., Norberg P., Robotham A. S. G., Valentijn E. A., Verdoes Kleijn G., de Jong J. T. A.
Top phrases: nulled, sheets, knots, ̃450, s_8≡, _8√, pre-planck, 2.3σ, ̃180, advancements