« #access2009pei – thunder talks | main | #access2009pei – Roy Tennant – Inspecting the Elephant »
#access2009pei – Dorothea Salo – Representing and Managing the Data Deluge
2 Oct 09
Grab a bucket, it’s raining data!
Potentially golden age of digital librarianship; digital research data is an entirely new form of research publication.
Salo admits she’s been described as the Cassandra of OA. She’s not against Open, but five years of running IRs…
– unclear goals
– insufficient means
– asking faculty for something, offering nothing
– IR view of digital universe is more narrow than the content we need to contain
– fit between user needs and system needs not good
sees similar trends in the early days of data curation, but it’s early days: no reason we have to make the same mistake again.
focus on the fit between content and container, with a human (rather than technology) lens
what do we know about research data?
– there’s a lot of it; do we have big enough buckets? CLOUD.
– data are there to be interacted with, we store it in order for people to do stuff with it; not “look but don’t touch” museum objects
– CC0 is about getting a legal barrier out of the way of data reuse; we must get rid of the tech barriers
– the data buckets will be must internalize and respect the affordances of different kinds of data
– data are diverse, as are their technical environment. E.g. can’t treat a book marked up in TEI in the same way you treat a book made up of page scans. Also can’t treat similar or related data as entirely separate entities in the manner of dSpace
– we often don’t control the technical environment (e.g. proprietary formats) our data live in
– if we’re lucky, we might be able to advise our researchers on how to store their data, but more often we will need to adapt to them; they’ve already created quite a lot of it, and they’re not always thinking very far ahead; nor do IT people often have as long-term a time horizon as we do
– researchers often have no idea we can help manage their data, sometimes don’t even trust us to; have to go out and rescue it
– and of course it’s also us creating a tonne of unsustainable digital silos; all that stuff is in danger
– a lot of data are analog but really want to be data: paper lab notebooks, linguistic field notes on paper, slides; can we scale up to that?
– data are project based: Exploring the Hype(r) a dissertation based on WordPress; how are we going to deal with this? Researchers are not above building an entirely new tech stack for every project
– data are sloppy; if we insist our repositories will only accept clean, pretty data, we have a problem
– data aren’t standardized, aren’t going to be
Our big bucket: the digital library. We already do big data. Another big bucket: the IR. Neither of these will magically solve the data problem. There is an impedence mismatch between DLs or IRs and data. We have developed lots of skills and tools that will help, but we need to rethink how to apply them.
U Wisconsin is rebranding “digital collections” – being digital is no longer what linguists call a “marked state”. Digital libraries carefully built and tended, careful selection of best materials, we then lavish a lot of effort on them. But how will our careful collection/development policies cope with what’s already out there?
Concern that data projects will follow the money, leaving arts and humanities behind; how do we decide what to archive? How will we rescue the sloppy data that’s out there, when our natural tendency is to keep things neat? How much and what kind of care can we give our data libraries? They can’t all look as good as our digital libraries?
We like to do things in a “Taylorist” way in production. In the DL context, tend to limit the kinds of work to what you can easily automate and train for. A DL will specialize in one or a few things. Specialize ourselves by data types. How will that serve us when we’re not in control of the data production process, when it doesn’t fit in our buckets? Will not have the luxury of specialization. How can we be efficient when the data don’t come in standardized form?
There will be technology structure mismatches. Choices? We can pull the data out of their environment and recreate it. Kind of like pinning a butterfly in a museum case. Lose things, like search functionality, when flattening a dynamic site into something like dSpace. Other choice? Can take on maintenance and “future proofing” of the site: not an efficient process. For every single new input, somebody has to figure out how its put together, how it works, how to move the old interpretation into the new one. That’s work.
Many digital libraries are project silos, built to solve a specific problem, but not built for the future, and not replicable. There’s a flood coming, none of us needs to reinvent the wheel. E.g. of Decameron Web: can’t build a “Dante Web” because the tech is completely hidden. DLs are often “cabinets of curiosities”; beautiful, but you can’t get in and play. Context is not the be-all and end-all of how an object must be presented; context is fluid, built and rebuilt. Digital objects need to be exposed so they can be recontextualized, that’s what researchers want to do with data.
Presentation is content specific, but it’s possible to go so far in the direction of content-specific presentation that the data become locked within an unworkable interface. We’ve already lost a lot of digital projects to project siloing. Most Mellon funded digital humanities projects are GONE. Must develop a coordinated rescue effort for project silos. If we can rescue our own projects, we will then know a lot about rescuing other peoples.
What about IRs? We are caged in our institutions. Salo must prove a link to someone in her institution to undertake an archiving project. A lot of data falls through the cracks. Collaboration, people moving around; institutional focus must be given up. Problem starts with scholarly publishers, who allow IR deposit, but not other kinds of web archiving. Research does not respect institutional boundaries.
The IR “we’ll take anything” promise is always broken. IRs are built for journal articles. Can only take stuff that is static and final; old news to a researcher. Model does not work with interactive data. We will lose data that’s already out there but not cleaned up and ready. Sometimes you think something’s final, but it’s not. dSpace and Fedora make it pretty hard to correct things. Our response is, we’ll take anything at all, but it has to be one file at a time: not practical for the data deluge. IR installs not as easily customized as promised.
We’ll take any metadata you want, but only in key-value pairs. Just don’t cut it for data. Running out of time! Content models are broken. Silos are both necessary and unacceptable. Will have to do a lot of content modelling. Run standardization processes on top of that. Lot’s of code to write, please share it ’cause we can’t do it alone. Social processes around DLs and IRs very fragmented.
Fedora looks like a big part of the future, but it needs to change. Replaceability and editability of objects weak.
Must get involved earlier in the research process. Can’t curate what you don’t have.
Love’s Solr: lightweight tool that does heavweight things. Would prefer to become the Clio of Data Curation!
Posted by pzed on October 2, 2009 at 11.55am
Categories: access 2009, conferences, libraries, twitter
