applying large-scale NLP linguistic analysis to web archives: 101 billion word nlp dataset
For those interested in the kinds of at-scale research questions large web archives make possible and/or those interested in web-scale linguistic analysis and entity understanding, we've just released an open machine-annotated part of speech dataset from running 101 billion words of worldwide online news from 100 million English articles 2016-present through Google's NLP API cataloging machine-assigned part of speech information (tag, aspect, case, form, gender, mood, number, person, proper, reciprocity, tense and voice) and dependency label, along with snippets of each usage, which builds upon a parallel dataset that includes the more than 11 billion entities found within those articles. Both datasets are available as open datasets, along with a third dataset that applies the same entity extraction to a decade of television news on BBC, CNN, MSNBC, FOX and ABC, CBS and NBC evening news to allow online-television topical comparisons. PART OF SPEECH + DEPENDENCY LABELS https://blog.gdeltproject.org/announcing-the-web-partofspeech-dataset-101-bi... ENTITIES https://blog.gdeltproject.org/announcing-the-global-entity-graph-geg-and-a-n... TV ENTITIES https://blog.gdeltproject.org/a-deep-learning-powered-entity-graph-over-tele... Kalev
participants (1)
-
kalev leetaru