Blog Post 5: What’s Life Without Some Experimentation?

Mitali Palekar
GatesNLP
Published in
7 min readApr 26, 2019

--

👋🏼 you! Early on this week, we spent a ton of time evaluating and re-evaluating our ideas based on new findings and resources, especially after talks with folks from AI2. Then, once we finalized on our plan for the next few weeks, we got to work creating our dataset of NLP papers as well as building our pipeline for both the baseline approaches and beyond.

Building our Game Plan

This week, we put a good amount of work into clarifying what our plans will be going forward, including our evaluation plan that was promised last week. We go through our pipeline in order.

In addition to deciding which papers to include in our dataset as discussed in the next section, we decided to start with titles and abstracts since they are the pieces of text readily available in our corpus. Contingent on getting access to papers’ full text and/or if we can parse PDFs ourselves, we are hoping to also use parts of the papers like introductions and conclusions to get more access to the paper’s content.

We plan for experiments with different models to be the heart of our work this quarter. We want to try single-task supervised learning (as we do here), multi-task learning, and unsupervised learning taking the cosine similarity of SciBERT embeddings [Beltagy et al., 2019]. We will start with pairwise comparisons to get scores to then build the rankings, and perhaps consider nearest neighbors as needed for performance. We are planning to train on pairs (how similarly we rank them) with triplet loss as a stretch goal.

Training and evaluation can be completed on different tasks, so we plan to try different pairings. To start, we are only evaluating on citations (hoping that actually cited papers appear high in our model’s output rankings), but we can eventually evaluate on venues or entities. When we train our supervised models, we can also train on any of these tasks (citations, venues, etc.) or do multiple at once. We plan to try many of these combinations.

For our evaluation, we are planning to treat this as a ranking problem. This calls for using rankings metrics such as mean reciprocal ranking (MRR), which only considers the first answer that is a “truly” relevant paper (depending on the task). We may also try F1 for the k top rankings and normalized discounted cumulative gain (nDCG) because of recommendations from those in the field.

Dataset Creation

In our last blog post, we ran our first baseline approach (Jaccard Similarity, explained in more detail below) on a subset of the Semantic Scholar Corpus (10K papers, 12 MB) [built by Ammar et al., 2018]. After running this model, we achieved a mean reciprocal ranking (MRR) of 0.0015, where only 2 papers top-10 rankings contained papers that the original paper had cited (from our test set). This obviously is an extremely low mean retrieval ranking and so we thought about why that might be the case. After some pondering (and code checking to make sure we didn’t have any bugs!), we realized that the reason why so many papers that are originally cited in the paper do not appear in our rankings is because the papers that are originally cited are not in our training set and as such, can never appear in our rankings (duh!). Since the whole Semantic Scholar corpus is also too large to use in our development this quarter (40+ million papers), we had to go back to the drawing board and think more critically about how best to develop a more comprehensive dataset while also not being too expansive (since we would like to eventually embed paper text using BERT and this is computationally expensive).

We initially discussed several approaches to building a comprehensive yet manageable dataset. For example, we considered building a dataset time-stamped by year, including papers after a specific year, but not before. However, we realized that we may still run into issues with many papers cited by the papers we are evaluating not appearing in the training set, which doesn’t allow us to even consider the true cited papers for our rankings because we aren’t picking a subset of papers grouped by their metadata (just time). Therefore, after discussions with Noah Smith and Iz Beltagy, we decided to build datasets on a per-field basis by looking at papers from top conferences in different fields on computer science. For now, we have built a dataset of approximately ~7500 papers in the field of NLP based on venue (ACL, NAACL, and EMNLP). Over the next few days, we are working on including more papers from different fields, including that of security and distributed systems.

Another choice concerning our dataset was the split between training, development, and test sets. We decided to sort the dataset by year, and take earlier years in our training set, later ones in our dev set, and even later ones in the test set. This makes citations more likely to occur in the training set since papers cannot cite a future paper, and also fits our model more closely to our main use case of inputting a recent paper and recommending related older papers.

Baseline Approach 1: Jaccard Similarity

Our first baseline approach included implementing Jaccard Similarity, a method that measures document similarity by determining the number of shared words in a document (with more shared words, implying greater similarity obviously!).

While we implemented this baseline for our last blog post, we have made several improvements since then. First and foremost, we have updated results on the new dataset that we have created for NLP papers. We have also switched our results to be running on the development set rather than the test set (as advised in class). Finally, we have results of our model both for lemmatization and without lemmatization (since we were advised by Beltagy to avoid using spaCy since it is computationally expensive for larger datasets).

Our results on the development set are as follows:

MRR without lemmatization = 0.022MRR with lemmatization = 0.039

While both these results seem relatively poor, we are not surprised. In this method we are not weighting based on the type of word (technical/domain-specific words vs generic word) as well as are not in any way encoding semantic meaning (so even if two different words which mean the same thing are used, we consider them as two separate entities). This results in creating an encoding that does not accurately capture semantic or contextual meaning and as such, does not provide strong results.

Baseline Approach 2: TF-IDF

Our second baseline approach was calculating a TF-IDF vector for each document and computing the cosine similarities to check the rankings. We decided on TF-IDF because it has the additional property of offsetting the importance of words by the frequency of the word in the corpus and Beltagy mentioned it as a difficult baseline to beat.

Due to the computational overhead of lemmatization moving forward, we removed lemmatization for TF-IDF. We achieved the following results:

MRR (without lemmatization) = 0.00035983150844786044

This result was much lower than we expected, being an order of 100 lower than our result when using Jaccard Similarity. It is possible that due to the abstracts’ content or what papers tend to be cited, shared words in the abstract is not a good signal for similarity, so getting more nuanced on this approach as TF-IDF does will not help us here. However, TF-IDF does consider duplicate terms in their evaluation. This might have caused the cosine similarities to be biased towards favoring overused words which also consequently were not offset enough to give a less biased score.

Error Analysis Overview

Before we delved deep into the specific reasons of why our current baselines perform so poorly, we spent some time analyzing our dataset that we generated. Here are some general statistics about our dataset:

Training dataset size = 5960Development dataset size = 745Number of development set papers that have citations in training = 675Total number of times a training paper is cited in the development set = 4293Maximum number of citations of single development set paper in training = 27Maximum rank of correctly cited paper from our current baseline (Jaccard Similarity, no lemmatization) = 1

These statistics are really important as they give us a strong broad view of our dataset as well as certain important characteristics that we can capitalize on for future models.

Now, focusing on our current baselines, as can be seen above, we have currently implemented two baselines, namely Jaccard Similarity and TF-IDF with cosine similarity to compute the relative rankings of similar papers. Lemmatization makes a difference here (relatively small) since we are doing directly token comparisons. However, at a high level, both these models very poorly create a list of rankings (i.e. the MRR score is very low). We can see these results in the following example where the list of source text is similar to the correct citation (what the answer should be), but instead our model suggests paper that are not related in any which way (with no lemmatization).

Source Paper: Finding Patterns in Noisy Crowds: Regression-based Annotation Aggregation for Crowdsourced DataCorrect Citation: Cheap and Fast — But is it Good? Evaluating Non-Expert Annotations for Natural Language TasksIncorrect Citations (generated by our model):* Automatic Selectional Preference Acquisition for Latin Verbs* Adaptive Joint Learning of Compositional and Non-Compositional Phrase Embeddings

We explain these results by stating that our current baselines do not do a very good job of capturing semantic or contextual signals to determine if two papers are related. These models solely rely on statistics of individual words to compute the similarity between two documents. For instance, Jaccard Similarity simply relies on the number of unique words shared by a pair of documents to determine how associated they are. Similarly, TF-IDF returns a word vector for each document on which cosine similarity is computed to determine relevance. Both of these baselines show that this problem is non-trivial, and we look forward to building more informed models that have the ability to learn more patterns in this dataset.

--

--

Mitali Palekar
GatesNLP

software @linkedin, @uwcse grad | once at @stripe, @facebook, @uber