applying large-scale NLP linguistic analysis to web archives: 101 billion word nlp dataset

10 Jan 2020

      For those interested in the kinds of at-scale research questions large web
archives make possible and/or those interested in web-scale linguistic
analysis and entity understanding, we've just released an open
machine-annotated part of speech dataset from running 101 billion words of
worldwide online news from 100 million English articles 2016-present
through Google's NLP API cataloging machine-assigned part of speech
information (tag, aspect, case, form, gender, mood, number, person, proper,
reciprocity, tense and voice) and dependency label, along with snippets of
each usage, which builds upon a parallel dataset that includes the more
than 11 billion entities found within those articles.

Both datasets are available as open datasets, along with a third dataset
that applies the same entity extraction to a decade of television news on
BBC, CNN, MSNBC, FOX and ABC, CBS and NBC evening news to allow
online-television topical comparisons.

PART OF SPEECH + DEPENDENCY LABELS
https://blog.gdeltproject.org/announcing-the-web-partofspeech-dataset-101-bi...

ENTITIES

https://blog.gdeltproject.org/announcing-the-global-entity-graph-geg-and-a-n...

TV ENTITIES
https://blog.gdeltproject.org/a-deep-learning-powered-entity-graph-over-tele...

Kalev

kalev leetaru

tags

participants (1)