NLP-Based Grant Analysis

Exploratory Analysis of NSF Standard Grants Using NLP and Data Visualization Tools


[Back to index]

We start by retrieving the NSF Grant data files. The grant data, including abstracts of the funded projects, are available as zipped XML files on the NSF web pages. We download the data and extract the relevant parts with BeautifulSoup. Since the zip files are rather large and need a lot of memory, we work with one zip file at a time and append the results to an uncompressed CSV file.

from bs4 import BeautifulSoup
from zipfile import ZipFile
import pandas as pd
from glob import glob

def parse_xml(soup):
    xdata = []
    xdata.append(soup.find('AwardEffectiveDate').string[-4:])
    xdata.append(soup.find('AwardInstrument').find('Value').string)
    xdata.append(soup.find('Directorate').find('LongName').string)
    xdata.append(soup.find('Division').find('LongName').string)
    xdata.append(int(soup.find('AwardAmount').string))
    xdata.append(soup.find('AbstractNarration').string)
    return xdata

def readzips(zfiles, outfile):
    headers = ['year', 'instrument', 'directorate', 'division', 'funding', 'abstract']
    df = pd.DataFrame(columns=headers)
    df.to_csv('./nsf-standard-grants.csv', encoding='utf-8')
    for zf in zfiles:
        xlist = []
        with ZipFile(zf) as z:
            for filename in z.namelist():
                if filename[-3:] == 'xml':
                    with z.open(filename) as f:
                        soup = BeautifulSoup(f, 'xml')                        
                        xlist.append(parse_xml(soup))
        df = pd.DataFrame(xlist, columns=headers)
        df = df[df.instrument == 'Standard Grant']
        df.to_csv(outfile, encoding='utf-8', header=None, mode='a')
        del df, xlist

outfile = './nsf-standard-grants.csv'
readzips(glob('*.zip'), outfile)

We then read the entire data set into memory from the CSV file, remove duplicated rows and compress the data.

df = pd.read_csv(outfile,  index_col=0, encoding='utf-8')
df.drop_duplicates(inplace=True)
df.to_csv(outfile + '.xz', encoding='utf-8')

Next we load some useful libraries for exploratory data analysis and visualisation.

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn import preprocessing
from sklearn.manifold import TSNE, SpectralEmbedding
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.decomposition import LatentDirichletAllocation, NMF, TruncatedSVD
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from umap import UMAP

import matplotlib.pyplot as plt
import seaborn as sns

We load the previously saved grant data to a Pandas dataframe and check the structure and headings of the data.

df = pd.read_csv('./nsf-standard-grants.csv.xz', index_col=0, encoding='utf-8')
df.head()
year instrument directorate division abstract funding
0 2008 Standard Grant Directorate For Engineering Div Of Civil, Mechanical, & Manufact Inn NSF Proposal # 0800628: Management in Supply C... 281167.0
1 2008 Standard Grant Direct For Computer & Info Scie & Enginr Division Of Computer and Network Systems Proposal Summary: The Association for Computin... 20200.0
2 2008 Standard Grant Directorate For Engineering Div Of Civil, Mechanical, & Manufact Inn The research objective of this Grant Opportuni... 177948.0
3 2008 Standard Grant Directorate For Engineering Div Of Civil, Mechanical, & Manufact Inn Abstract <br/>The research objective of this a... 222600.0
4 2008 Standard Grant Directorate For Engineering Div Of Civil, Mechanical, & Manufact Inn This research will lead to advanced, functiona... 309973.0

Let’s see how many abstracts there are per each directorate.

df.groupby('directorate').count()['abstract']
directorate
Dir for Tech, Innovation, & Partnerships                         3495
Direct For Biological Sciences                                  10594
Direct For Computer & Info Scie & Enginr                        18405
Direct For Education and Human Resources                         6909
Direct For Mathematical & Physical Scien                        20218
Direct For Social, Behav & Economic Scie                        12068
Directorate For Engineering                                     26773
Directorate For Geosciences                                     13157
Directorate for Computer & Information Science & Engineering        2
Directorate for STEM Education                                   3115
Directorate for Social, Behavioral & Economic Sciences              3
Office Of Polar Programs                                            3
Office Of The Director                                           1708
Name: abstract, dtype: int64

Some of the directorates seem to have been spelled in multiple ways. We clean up the data by remapping the directorates to a more compact list.

dir_map = {
    'Dir for Tech, Innovation, & Partnerships': 'Tech, Innovation, & Partnerships', 
    'Direct For Biological Sciences': 'Biological Sciences', 
    'Direct For Computer & Info Scie & Enginr': 'Computer & Information Science & Engineering', 
    'Direct For Education and Human Resources': 'Education', 
    'Direct For Mathematical & Physical Scien': 'Mathematical & Physical Sciences', 
    'Direct For Social, Behav & Economic Scie': 'Social, Behavioral & Economic Sciences', 
    'Directorate For Engineering': 'Engineering', 
    'Directorate For Geosciences': 'Geosciences', 
    'Directorate for Computer & Information Science & Engineering': 'Computer & Information Science & Engineering', 
    'Directorate for STEM Education': 'Education', 
    'Directorate for Social, Behavioral & Economic Sciences': 'Social, Behavioral & Economic Sciences', 
    'Office Of Polar Programs': 'Geosciences', 
    'Office Of The Director': 'Office Of The Director'
}

df['directorate'] = df['directorate'].map(dir_map)

We drop rows that contain an empty abstract from our dataframe and recheck how many abstracts there are per each directorate.

df.dropna(subset=['abstract'], inplace=True)
df.groupby('directorate').count()['abstract']
directorate
Biological Sciences                             10594
Computer & Information Science & Engineering    18407
Education                                       10024
Engineering                                     26773
Geosciences                                     13160
Mathematical & Physical Sciences                20218
Office Of The Director                           1708
Social, Behavioral & Economic Sciences          12071
Tech, Innovation, & Partnerships                 3495
Name: abstract, dtype: int64

In total, we have over 100k abstracts in our data set.

df.shape
(116450, 6)

We can cross-tabulate the number of grants per directorate and year.

directorates = sorted(df['directorate'].unique())
pd.crosstab(df.directorate, df.year)
year 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
directorate
Biological Sciences 0 17 495 1267 829 670 663 648 666 734 748 694 540 623 790 584 569 57
Computer & Information Science & Engineering 0 12 894 1480 1164 1098 1216 1133 1165 1328 1347 1363 1430 1366 1324 1010 1022 55
Education 0 4 470 755 689 611 632 605 579 708 735 664 714 666 735 719 667 71
Engineering 0 41 1118 2381 2162 1855 1900 1981 2080 2175 2419 2068 1760 1338 1232 1043 1115 105
Geosciences 1 19 543 1566 1155 1008 933 759 732 770 824 739 673 838 883 892 755 70
Mathematical & Physical Sciences 0 38 826 1997 1411 1176 1282 1104 1216 1293 1323 1234 1516 1541 1514 1282 1440 25
Office Of The Director 0 5 106 194 226 165 91 74 116 69 53 70 97 92 81 127 110 32
Social, Behavioral & Economic Sciences 0 12 581 1194 975 801 872 807 843 851 802 804 696 697 873 660 577 26
Tech, Innovation, & Partnerships 0 0 0 0 0 0 0 0 0 0 0 1 432 831 805 763 658 5

Classification

Let’s train a simple linear classifier and check how well the model can predict under which directorate each abstract belongs. First we transform the abstracts to a document-term matrix using CountVectorizer. We use unigrams as features since the vocabulary size becomes rather large with bigrams.

vect_clf = CountVectorizer(stop_words='english', ngram_range=(1,1), max_df=0.2, min_df=2)
X = vect_clf.fit_transform(df.abstract)
X.shape
(116450, 85757)

We set up a classifier pipeline with TfidfTransformer and LinearSVC. Note that we did not use TfidfVectorizer above in order to avoid data leakage during cross-validation but instead apply TfidfTransformer separately for each train and test set.

clf = LinearSVC(class_weight='balanced')
tfidf = TfidfTransformer(sublinear_tf=True)
pipe = Pipeline([('tfidf', tfidf), ('clf', clf)])

Now we can check how well the model performs. We do this by a 5-fold cross-validation and check the average accuracy score.

score = cross_val_score(pipe, X, df.directorate, cv=5, n_jobs=-1)
print(score, score.mean())
[0.8455131  0.87084586 0.86994418 0.84753113 0.82271361] 0.8513095749248605

Accuracy score of 0.85 is quite decent. We will also check which directorates the model mixes up by calculating the confusion matrix between the predicted and actual directorates.

pred_directorate = cross_val_predict(pipe, X, df.directorate, cv=5, n_jobs=-1)

cfm = (100*confusion_matrix(df.directorate, pred_directorate, labels=directorates, normalize='true')).round(1)

The model has learned to separate Geosciences and Social, Behavioral & Economic Sciences most consistently from the other directorates. “Office of the Director” and “Tech, Innovation, & Partnerships” are most often mixed with other directorates but these two have the lowest number of grants which likely has a negative impact on the accuracy.

sns.heatmap(cfm, cmap='Blues', annot=True, fmt='.1f', xticklabels=directorates, yticklabels=directorates)
plt.plot();

png

Visualisation

Let’s try to visualize the dataset. This time we transform the full dataset with TfidfVectorizer to a normalized document-term matrix.

vect = TfidfVectorizer(stop_words='english', ngram_range=(1,1), sublinear_tf=True, max_df=0.2, min_df=10)
X = vect.fit_transform(df.abstract)
fnames = vect.get_feature_names_out()
X.shape
(116450, 33219)

We embed the vectorized abstracts to 2 dimensions with UMAP using the angle between the vectors (“cosine distance”) as the distance metric.

Xr = UMAP(n_components=2, n_neighbors=10, metric='cosine').fit_transform(X)
df['x'] = Xr[:,0]
df['y'] = Xr[:,1]

Now we can create a scatterplot of the data. The resulting plot shows that abstracts coming from the same directorate are most often fairly well clumped together, but there are also some clusters that contain abstracts from multiple directorates. This seems to happen especially with abstracts from Mathematical & Physical Sciences and Engineering.

fig, ax = plt.subplots(figsize=[8,6])
sns.scatterplot(data=df, x='x', y='y', hue='directorate', s=1, alpha=0.5, linewidth=0, ax=ax)
plt.axis('off')
plt.legend(bbox_to_anchor=(1.02, 0.95), loc='upper left', borderaxespad=0)
plt.show()

png

Next we try to identify some keywords that have been emerging in the last few years. Before that, we will do some simple preprocessing of the abstract texts with spaCy.

[Back to index]