WOS |
Web of Science (WOS) is a document classification dataset that contains 46,985 documents with 134 categories which include 7 parents categories. |
RCV1-v2 |
The RCV1 dataset is a benchmark dataset on text categorization. It is a collection of newswire articles producd by Reuters in 1996-1997. It contains 804,414 manually labeled newswire documents, and categorized with respect to three controlled vocabularies: industries, topics and regions. |
New York Times Annotated Corpus |
The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between 1987 and 2007. Over 1.5 million documents have at least one tag. Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles. |