The HathiTrust Research Center (HTRC): Mining the 17 Million Volumes of the HathiTrust Digital Library

Copy the text below to embed this resource

Main contributors
Christie, Jennifer; Kloster, David; Walsh, John
The HathiTrust Digital Library (HTDL) was founded in 2008 with just over 2 million volumes in the collection. Today there are over 17 million volumes ranging from 6th-century psalters to 21st-century academic texts. The diverse contents of the HTDL include government documents, academic journal articles, and monographs from all the disciplines one would find represented in a typical academic research library. While the majority of materials are in English, there are many volumes in German, French, Spanish, Italian, Arabic, Chinese, Russian, and Latin. Researchers may perform text analysis on the contents of HTDL by utilizing the many text analysis tools and data sets provided by the HathiTrust Research Center (HTRC).

The HathiTrust Research Center (HTRC), based at IU Bloomington, develops infrastructure, tools, and services to support Text Data Mining of the HTDL corpus. These include off-the-shelf web-based text analysis tools, a secure data capsule computing environment for analysis of rights-restricted content, and the HTRC Extracted Features Data Set, which provides volume-level and page-level word counts and other metadata for the entire corpus.

This presentation will discuss the current contents of the HTDL collection and its benefits as a data source and provide examples of existing research facilitated by HTDL collections and HTRC resources. In addition, this presentation will give an overview of the various HTRC text analysis tools and the different options for analyzing public domain and copyrighted material.
Indiana University Digital Collections Services
Wednesday Noon Digital Scholarship Series
IUScholarWorks Repository
Related Item
IU ScholarWorks Record 

Access Restrictions

This item is accessible by: the public.