New open source field of study classifier S2FOS | AI2 Blog

Excerpt

New Semantic Scholar model S2FOS makes academic field of study classification widely available.


Announcing S2FOS, an open source academic field of study classifier

[

Semantic Scholar

](https://medium.com/@semantic_scholar?source=post_page-----9d2f641949e5--------------------------------)[

AI2 Blog

](https://blog.allenai.org/?source=post_page-----9d2f641949e5--------------------------------)

By Kelsey MacMillan and Sergey Feldman

New model makes academic field of study classification widely available and adds Linguistics, Law, Education, and Agriculture and Food Sciences to Semantic Scholar

An illustration in yellow and green of a talk bubble, scales of justice, corn, and a magnifying glass.

Kelsey MacMillan is a Product Manager for Semantic Scholar at the Allen Institute for AI. Sergey Feldman is a Senior Applied Research Scientist.

Semantic Scholar’s mission is to accelerate breakthroughs in science using AI, and we believe aggregating, structuring, and, ultimately, understanding academic data is an important pillar for achieving that mission. We are continuously making investments in our data processing pipeline leveraging our strength in AI and have worked on filling gaps left by MAG’s deprecation at the end of 2021.

To that end, we are announcing the release of S2FOS, a field of study classifier for scientific papers. S2FOS is being released via the Semantic Scholar website, the S2AG API and Dataset (our open academic graph corpus), and as a downloadable model for deployment anywhere.

S2FOS operates on paper titles and abstracts across all domains, though it is currently limited to English-language papers. After filtering low confidence predictions, we have measured 90% coverage on English-language papers in our corpus.

The taxonomy of fields of study for S2FOS is largely based on MAG’s taxonomy. This choice allowed us to leverage existing data from MAG’s classification. However, we did take the opportunity to make targeted additions to the set of fields based on user feedback and comparison to other popular academic data sources such as Dimensions. We added the ability to classify papers as Linguistics, Law, Education, and Agriculture and Food Sciences.

Keep reading for details on the model architecture, training procedure, inference, and evaluation.

An illustration of a bar graph.

Model Set Up

We’ve approached field of study classification as a multilabel classification task, as some classes are not well differentiated in practice (e.g. biology vs medicine) and interdisciplinary papers are fairly common. The model architecture choice was driven by two factors: (a) we didn’t want to add another BERT to our already-hefty product pipeline, and (b) previous experiments on SciDocs showed that simple models can be very hard to beat on short-text classification tasks. With this in mind, we chose a model that “just works”: a linear SVM running on character n-gram TF-IDF representations. To get the most juice out of this model, we use the top 300000 most common character unigrams to 5-grams. We used a combination of scikit-learn and lightning to build the model.

Dataset Development

We used “silver” training data and “gold” evaluation data. The silver data generation was based on the idea that publication venues usually publish within a relatively narrow set of fields. As such, we manually labeled a large number of publication venues with our fields-of-study, and then propagated those labels to papers published in their respective venues.

For the gold data, we manually labeled 500 papers into one, two or three fields-of-study.

Inference Details

Given the training data set, we were only confident in our model’s ability to predict a field of study on English-language papers. As such, we do language detection first (with an ensemble of pycld2 and fasttext), and then only produce a prediction if the paper’s title and abstract (if available) are in English.

In the S2 production system we’ve configured the model to err towards recall while maintaining a floor of 0.9 precision for papers with abstracts and a floor of 0.8 precision for papers with titles only. We developed a little decision tree that summarizes when we produce predictions and attempts to capture this “sweet spot” in the precision and recall curves:

<span id="11eb" data-selectable-paragraph=""><em>if len(abstract) &gt; 0:<br>  if max(score) &gt; -0.2:<br>    Take all predictions where score &gt; -0.2<br>  elif max(score) &gt; -1.0:<br>    Take prediction with the largest score<br>else:<br>  if max(score) &gt; -0.2:<br>    Take prediction with the largest score</em></span>

Many English-language papers with and without abstracts end up with no predictions at all, while other papers have 2 or even 3 fields of study associated with them.

The model is deployed with AWS Fargate. As new papers continuously enter our system, an AWS lambda pulls work off an SQS queue, consults our paper feature stores and requests FOS inference from the Fargate endpoint.

Final Evaluation

After integrating the S2FOS model into S2’s data pipeline we did an end-system human evaluation on 500 papers selected from a traffic-weighted data set. This evaluation compared S2FOS results to MAG data as a baseline.

The human annotators were asked to pick from the S2FOS taxonomy and assign to each paper all FOSs that they would consider “reasonable”. We graded a model classification as correct if any of the labels overlapped with the human-generated set.

The result was 86% S2FOS correctness according to this metric. The MAG baseline was calculated only on the 279 papers where we had MAG-sourced data available and was 74% correct.

The new field-of-study data is available for use on Semantic Scholar in the form of search filters and as part of detailed paper information. Internally we also continue to use field-of-study data for analytics and domain-specific model development, such as TLDRs for Computer Science and BioMed. For direct data access check out our Semantic Scholar Academic Graph API, S2AG, or the open source Github repo!