words what do you read, m'lord?

 

« | main | »

#access2009pei – Roy Tennant – Inspecting the Elephant

description here

Roy works in research and is somewhat apologetic about yesterday’s sales pitch.

The Hathi Trust is a shared digital repository that grew out of the Google digitization project. U Mich leads the effort, OCLC works on the service side. Lots of partners, but 82% of contributions from UM, 12% from UCal (ramping up quickly), Indiana and Wisconsin have small pieces as well.

When Hathi Trust web site was set up, also allowed download of all metadata describing volumes in the project; 13 elements mostly of little use outside the environment. Roy, for fun, grabbed the file, parsed it in XML (no standards, just because). Indexed it and created a search utility.

Then a colleague (Constance Malpas) at OCLC came up with a “cloud library” project. Shared digital and print repositories to create new operational efficiencies for research institutions. Requires new infrastructure for managing, monitoring, consuming shared services.

[insert Stan Rogers, "The White Collar Holler", here]

Downloaded HT metadata, enhanced with OCLC numbers, explode the data into millions of tiny xml files, indexed it, extracted unique OCLC numbers and sent to JT (a person) who extracts the WorldCat records. Then merge HT data into WC records, index, and extract info to simplify reporting. Perl/XML/Swish-e, XSLT, xsltproc

OCLC doesn’t really use MARC internally. Rather, have their own CDF (common data format?) which allows data to be extracted in more usable ways (e.g. of dates). Also inserted HT metadata. Also plan to insert metadata from libraries involved (holdings? I kinda missed that).

Some interesting reports. Murky buckets, a Lorcan Dempsy term. E.g. of weird dates. Roughly 16% as of Sept 2009 are public domain: 600K volumes, mostly pre-1922 and government documents. UM actually does proactive review of copyright status (unlike Google), trying to open up as much as possible. Subject distribution dominated by arts and humanities, esp. literature and history.

Now working with NYU: considering impact of collections overlaps among NYU, Hathi Trust, and ReCAP (a shared storage facility in NY).

Lessons
– identifiers are essential: OCLC number Roy thinks is best
– standards are great until they get in your way; have ignored both internal and external standards but it gets effective work done
– never underestimate the power of a prototype

Posted by pzed on October 2, 2009 at 12.27pm
Categories: access 2009, conferences, libraries, twitter

Post a comment