HathiTrust Research Center: Challenges and Opportunities in Big Text Data

Copy the text below to embed this resource

Main contributors
Chen, Miao; Plale, Beth
HathiTrust Research Center (HTRC) is the public research arm of the HathiTrust digital library where millions of volumes, such as books, journals, and government documents, are digitized and preserved. By Nov 2013, the HathiTrust collection has 10.8M total volumes of which 3.5M are in the public domain [1] and the rest are in-copyrighted content.
The public domain volumes of the HathiTrust collection by themselves are more than 2TB in storage. Each volume comes with a MARC metadata record for the original physical copy and a METS metadata file for provenance of digital object. Therefore the large-scale text raises challenges on the computational access to the collection, subsets of the collection, and the metadata. The large volume also poses a challenge on text mining, which is, how HTRC provides algorithms to exploit knowledge in the collections and accommodate various mining need.
In this workshop, we will introduce the HTRC infrastructure, portal and work set builder interface, and programmatic data retrieve API (Data API), the challenges and opportunities in HTRC big text data, and finish with a short demo to the HTRC tools.
More about HTRC
The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge. See http://www.hathitrust.org/htrc for details.
[1] http://www.hathitrust.org/statistics_visualizations
Indiana University Digital Collections Services
Wednesday Noon Digital Scholarship Series
IUScholarWorks Repository
Related Item
IUScholarWorks Record 

Access Restrictions

This item is accessible by: the public.