Skip to main content
- Main contributors:
- Plale, Beth, Kouper, Inna
- Social-ecological research studies complex human-natural environments and the uses and sharing of ecological resources. Elinor Ostrom, a Nobel prize laureate from IU, pioneered the idea that social-ecological data can be collected and stored in a centralized database, which will capture complex relationships between various components of data and facilitate their collective collaborative use. While useful in its active stage, databases present a challenge for archival and preservation, especially if they are stored in a proprietary format and where changes are often applied retrospectively to both new and existing data. In this talk we will present an approach to archiving a social-ecological research database, the International Forestry Resources and Institutions (IFRI) database, and discuss challenges that we encountered as well as lessons learned. The talk aims at stimulating a discussion about preservation of complex data objects and possible solutions that can be generalized beyond one case.
- Main contributors:
- Plale, Beth, Zeng, Jiaan, McDonald, Robert, Chen, Miao
- The first mode of access by the community of digital humanities and informatics researchers and educators to the copyrighted content of the HathiTrust digital repository will be to extracted statistical and aggregated information about the copyrighted texts. But can the HathiTrust Research Center support scientific research that allows a researcher to carry out their own analysis and extract their own information?
This question is the focus of a 3-year, $606,000 grant from the Alfred P. Sloan Foundation (Plale, Prakash 2011-2014), which has resulted in a novel experimental framework that permits analytical investigation of a corpus but prohibits data from leaving the capsule. The HTRC Data Capsule is both a system architecture and set of policies that enable computational investigation over the protected content of the HT digital repository that is carried out and controlled directly by a researcher. It leverages the foundational security principles of the Data Capsules of A. Prakash of University of Michigan, which allows privileged access to sensitive data while also restricting the channels through which that data can be released.
Ongoing work extends the HTRC Data Capsule to give researchers more compute power at their fingertips. The new thrust, HT-DC Cloud, extends existing security guarantees and features to allow researchers to carry out compute-heavy tasks, like LDA topic modeling, on large-scale compute resources.
HTRC Data Capsule works by giving a researcher their own virtual machine that runs within the HTRC domain. The researcher can configure the VM as they would their own desktop with their own tools. After they are done, the VM switches into a "secure" mode, where network and other data channels are restricted in exchange for access to the data being protected. Results are emailed to the user.
In this talk we discuss the motivations for the HTRC Data Capsule, its successes and challenges. HTRC Data Capsule runs at Indiana University.
See more at http://d2i.indiana.edu/non-consumptive-research
- Main contributors:
- Chen, Miao, Plale, Beth
- HathiTrust Research Center (HTRC) is the public research arm of the HathiTrust digital library where millions of volumes, such as books, journals, and government documents, are digitized and preserved. By Nov 2013, the HathiTrust collection has 10.8M total volumes of which 3.5M are in the public domain  and the rest are in-copyrighted content.
The public domain volumes of the HathiTrust collection by themselves are more than 2TB in storage. Each volume comes with a MARC metadata record for the original physical copy and a METS metadata file for provenance of digital object. Therefore the large-scale text raises challenges on the computational access to the collection, subsets of the collection, and the metadata. The large volume also poses a challenge on text mining, which is, how HTRC provides algorithms to exploit knowledge in the collections and accommodate various mining need.
In this workshop, we will introduce the HTRC infrastructure, portal and work set builder interface, and programmatic data retrieve API (Data API), the challenges and opportunities in HTRC big text data, and finish with a short demo to the HTRC tools.
More about HTRC
The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge. See http://www.hathitrust.org/htrc for details.