#access2009pei – Dorothea Salo – Representing and Managing the Data Deluge
2 Oct 09
Grab a bucket, it’s raining data!
Potentially golden age of digital librarianship; digital research data is an entirely new form of research publication.
Salo admits she’s been described as the Cassandra of OA. She’s not against Open, but five years of running IRs…
– unclear goals
– insufficient means
– asking faculty for something, offering nothing
– IR view of digital universe is more narrow than the content we need to contain
– fit between user needs and system needs not good
sees similar trends in the early days of data curation, but it’s early days: no reason we have to make the same mistake again.
focus on the fit between content and container, with a human (rather than technology) lens
what do we know about research data?
– there’s a lot of it; do we have big enough buckets? CLOUD.
– data are there to be interacted with, we store it in order for people to do stuff with it; not “look but don’t touch” museum objects
– CC0 is about getting a legal barrier out of the way of data reuse; we must get rid of the tech barriers
– the data buckets will be must internalize and respect the affordances of different kinds of data
– data are diverse, as are their technical environment. E.g. can’t treat a book marked up in TEI in the same way you treat a book made up of page scans. Also can’t treat similar or related data as entirely separate entities in the manner of dSpace
– we often don’t control the technical environment (e.g. proprietary formats) our data live in
– if we’re lucky, we might be able to advise our researchers on how to store their data, but more often we will need to adapt to them; they’ve already created quite a lot of it, and they’re not always thinking very far ahead; nor do IT people often have as long-term a time horizon as we do
– researchers often have no idea we can help manage their data, sometimes don’t even trust us to; have to go out and rescue it
– and of course it’s also us creating a tonne of unsustainable digital silos; all that stuff is in danger
– a lot of data are analog but really want to be data: paper lab notebooks, linguistic field notes on paper, slides; can we scale up to that?
– data are project based: Exploring the Hype(r) a dissertation based on WordPress; how are we going to deal with this? Researchers are not above building an entirely new tech stack for every project
– data are sloppy; if we insist our repositories will only accept clean, pretty data, we have a problem
– data aren’t standardized, aren’t going to be
Our big bucket: the digital library. We already do big data. Another big bucket: the IR. Neither of these will magically solve the data problem. There is an impedence mismatch between DLs or IRs and data. We have developed lots of skills and tools that will help, but we need to rethink how to apply them.
U Wisconsin is rebranding “digital collections” – being digital is no longer what linguists call a “marked state”. Digital libraries carefully built and tended, careful selection of best materials, we then lavish a lot of effort on them. But how will our careful collection/development policies cope with what’s already out there?
Concern that data projects will follow the money, leaving arts and humanities behind; how do we decide what to archive? How will we rescue the sloppy data that’s out there, when our natural tendency is to keep things neat? How much and what kind of care can we give our data libraries? They can’t all look as good as our digital libraries?
We like to do things in a “Taylorist” way in production. In the DL context, tend to limit the kinds of work to what you can easily automate and train for. A DL will specialize in one or a few things. Specialize ourselves by data types. How will that serve us when we’re not in control of the data production process, when it doesn’t fit in our buckets? Will not have the luxury of specialization. How can we be efficient when the data don’t come in standardized form?
There will be technology structure mismatches. Choices? We can pull the data out of their environment and recreate it. Kind of like pinning a butterfly in a museum case. Lose things, like search functionality, when flattening a dynamic site into something like dSpace. Other choice? Can take on maintenance and “future proofing” of the site: not an efficient process. For every single new input, somebody has to figure out how its put together, how it works, how to move the old interpretation into the new one. That’s work.
Many digital libraries are project silos, built to solve a specific problem, but not built for the future, and not replicable. There’s a flood coming, none of us needs to reinvent the wheel. E.g. of Decameron Web: can’t build a “Dante Web” because the tech is completely hidden. DLs are often “cabinets of curiosities”; beautiful, but you can’t get in and play. Context is not the be-all and end-all of how an object must be presented; context is fluid, built and rebuilt. Digital objects need to be exposed so they can be recontextualized, that’s what researchers want to do with data.
Presentation is content specific, but it’s possible to go so far in the direction of content-specific presentation that the data become locked within an unworkable interface. We’ve already lost a lot of digital projects to project siloing. Most Mellon funded digital humanities projects are GONE. Must develop a coordinated rescue effort for project silos. If we can rescue our own projects, we will then know a lot about rescuing other peoples.
What about IRs? We are caged in our institutions. Salo must prove a link to someone in her institution to undertake an archiving project. A lot of data falls through the cracks. Collaboration, people moving around; institutional focus must be given up. Problem starts with scholarly publishers, who allow IR deposit, but not other kinds of web archiving. Research does not respect institutional boundaries.
The IR “we’ll take anything” promise is always broken. IRs are built for journal articles. Can only take stuff that is static and final; old news to a researcher. Model does not work with interactive data. We will lose data that’s already out there but not cleaned up and ready. Sometimes you think something’s final, but it’s not. dSpace and Fedora make it pretty hard to correct things. Our response is, we’ll take anything at all, but it has to be one file at a time: not practical for the data deluge. IR installs not as easily customized as promised.
We’ll take any metadata you want, but only in key-value pairs. Just don’t cut it for data. Running out of time! Content models are broken. Silos are both necessary and unacceptable. Will have to do a lot of content modelling. Run standardization processes on top of that. Lot’s of code to write, please share it ’cause we can’t do it alone. Social processes around DLs and IRs very fragmented.
Fedora looks like a big part of the future, but it needs to change. Replaceability and editability of objects weak.
Must get involved earlier in the research process. Can’t curate what you don’t have.
Love’s Solr: lightweight tool that does heavweight things. Would prefer to become the Clio of Data Curation!
Posted by pzed on October 2, 2009 at 11.55am
#access2009pei – thunder talks
2 Oct 09
Natalie Collins, CISTI Lab
– came out of a hackfest-like “innovation challenge”
– invited everyone across the organization to create something innovative
– proposal day, teams of up to 4 pitching their ideas
– 15 proposals, 8 accepted to go forward
– gave a week’s worth of free time, ended with a final presentation
– winner looked at the impact of news reporting of research on searching
Ali Sadaqain, York U
– web redesign in VuFind
– facets on left sidebar, call number issues, language tweaks
– multiple formats showing up
– wanted some “2.0″ stuff, so added an “add to favorites” [sic] option; not shared
– also included click-throughs to journal articles, although I don’t see explicit SFX enabling
(sorry, missed presenter’s name) Jamaican libraries
– looking at revamping Jamaican libraries in Drupal
– most have single page websites
– most use UNISYS(?) from UNESCO
– few resources for most up-to-date software, and for digitization
– internet penetration is about 20%; 80% of businesses have access
– mobile phone use us very high
Anne Barrett, Dalhousie
– one month live with WorldCat Local as primary search interface
– adds relevance ranking, FRBR features, SFX integration, ILL support
– more modern presentation of information
– older search engine to Novanet still available
– 17,000+ searches in Sept, apparent calm is hopefully good news
– still some outstanding issues: large number of records with no OCLC match
– significant impact on cataloguing
– smaller issues, RefWorks isn’t integrating well off campus
– can FRBR algorithm be adapted to favour more academic institutions
– impact on doc delivery, because of links to items held world wide
Craig Deplace, Jamie O’Toole, School District 16 in NB
– teachers
– used Drupal to develop repository for student and staff generated video
– were publishing on Youtube, less than perfect for kids
– decided to create own Youtube, extended it to allow teachers to upload wide variety of resources: District 16 Media Centre
– close to 3000 pieces of content
– recognized potential to use Drupal elsewhere
– distance delivered media course wanted a way to publish content, originally a PDF newsletter
– The District 16 Report
– students log in, act as creator/editor for news content, can also maintain a personal blog
– there is a moderation queue
– brings together students from 5 sites
– in the process of moving the schools’ websites into Drupal
– much simpler than training teachers to use Dreamweaver
– spreads content maintenance across many teachers, Drupal makes it easy and changes what and how we publish
– one school (Gretna Green?) streams their announcements in video and then uploads to their website
Bess Sadler, U Virginia
– Scholars’ Lab – merger of geospatial/stat data centre with text centre
– had lots of GIS data with very little metadata
– generally had to talk to the “GIS guy”
– received a grant to buy ESRI(?) online mapping system, spent 6 weeks trying to install
– somebody said, why not try OS?
– OpenLayers on top of PostGIS
– simple search utility
– visualizations as well
– planning to load data sets into main catalogue
Jennifer Richard, Acadia
– digitized herbarium
– started with manageable collections, starting with rare/endangered species, then invasive
– applied CFI project for Canadensys, now up to 50K specimens
– improved quality control, search capabilities
– considering replacing proprietary image management with dJatoka
– also includes smaller collection from Cape Breton U, plans to work with St FX, UNB, UPEI
Cameron Metcalf, U of Ottawa
– have 250K air photos
– hoping to bring online, at least the indexes
– photos described by roll number and photo sequence
– very manual, somewhat tedious methods to find photo numbers
– starting to use MarkerClusterer to allow drilling down via web map index
Karen Hunt, U Manitoba
– chat reference
– have been using PHP live
– solution to bridge gap between culture of students and of librarians
– no real IM culture in library, much less text messaging
– now advertising a text messaging number that connects via Google Android phone to librarian interface
Posted by pzed on October 2, 2009 at 9.33am
#access2009pei – Cary Gordon – Drupal In Libraries
2 Oct 09
Drupal is free, OS, simple, based on blogging idiom. Beneath that is a content management framework that can be used as a rapid, web-based development platform.
Hook system, very flexible for inserting your own code. CVS used for versioning, pretty much anyone can build modules and add to the official repository.
Community designed, with about 800 contributors to core (thousands of contributors to modules), 25 maintainers, 2 core committers. By comparison, Mozilla has 50 contributors, 10 staff (Drupal has no staff). Most OS projects follow that more centralized model.
Drupal principles; D has its own lingo (nodes, blocks, etc.); modules and themes to extend capabilities and customize. “Core” means the basic Drupal install, “contrib” is everything else that can be plugged in.
Requires PHP, most installations on the LAMP stack. Drupal modules connect through the hook system. Use jQuery and PHPTemplate as presentation helpers.
Best practices:
– plan your site before you build it, not after!
– plan for the future, don’t lock yourself in
– get involved in the D community
– back up your site
– test your php snippets: Drupal allows you to put PHP in your content, but there’s no safety net!
– observe Drupal Programming Best Practices
– use a version control system
– keep your site up-to-date; should be in the latest release (usually security patches)
Warning:
– Don’t use a Windows server and IIS
– Don’t hack core: cautionary tale about outsourcing to offshore developers….
Of the 4849 modules contributed, there’s often little info on what they do, most start as solutions to specific problems; strongly encouraged to contribute if we write modules that can be generalized. Easiest way to find out what modules are good is to get out in the community and ask. Gordon’s company has a collection of modules they make available to all sites they build; slightly different packages for different types of libraries.
Rather than get new copies when versions come out and overwrite the old, they use links in the root folder.
Examples
– Ann Arbor District Library; wanted the library site to be a social site.
– Darien Library; next version of the social opac (SOPAC), integrating catalogue data.
– Genesee Valley BOCES; mostly school libraries, again integrating catalogue data
– Idaho Commission of Libraries
– Troy Public Library; higher level of theming
– Benicia Public Library; good example of a library with few resources, had no website previously, not fully integrated but lots of features
– Camarena Memorial Library; bilingual, have a Spanish site
– North State Cooperative Library System; service site
The whole concept of using a blogging type system is that you have information that’s up-to-date, ideally stays that way.
Example Applications
– CSU San Marcos: digital repository, intranet, e-resources directory; some custom coding, but pretty much take advantage of D’s taxonomy stuff
– SFU Library Thesis submission system; students submit, staff can manage, grad office gets stats
– Cornell Mann Library room booking app
– Anchor Archive Zine Library
– William Hayes’s biometric data curation tool, import data from spreadsheets, with filtering/visualizations and curation tools
– McMaster subject guides
– Islandora (not added to Drupal contrib, yet….)
Brief update on Drupal 7: in code freeze, no fixed release date,
– D7CX pledge: trying to get contribs to commit to being ready for D7 live date
– changes include allowing users to cancel accounts, more semantic class/ID names, friendlier to CSS layout, jQuery 3, and more!
– more database support, SQLite, in theory could extend to Oracle
– Field API
– File API (files now first class nodes)
– Registry
Start with a clear idea of what you’re trying to build, keep your initial install simple
Cracking Drupal a must read
Drupal Library Group, Drupalib, Drupal4lib
DON’T HACK CORE
Posted by pzed on October 2, 2009 at 8.07am
#access2009pie – Peter Rukavina – Infinite Malleability
2 Oct 09
Going to talk about application vs capabilities instead.
When designing a system, you can build apps, or you can build capabilities. Flexible, multi-use systems for multi-talented generalists: farm example, architecture that supports multiple uses. Farmers do lots of stuff. Contrast with a factory, inflexible, purpose-built, made for specialists. Factory workers do one thing over and over. Industrialism changed the system design paradigm from capabilities to applications. The two overlap, but are different sensibilities. Apps are discrete, manageable, predictable, artificial; capabilities are interrelated, malleable, unpredictable and natural.
E.g. of the unix command line as providing capabilities. Contrast Royal Botanical Gardens–a nature application that has been installed in Hamilton–with the Charlottetown Boulder Park, a capability of the urban landscape. There are many other examples.
Library web site (Robertson UPEI) as an example of an application oriented design, vs Google (of course) which provides capability. Is the library’s mission to run a book and reading management application, or to extend society’s knowledge capabilities?
Posted by pzed on October 2, 2009 at 7.04am
#access2009pei – Stevan Harnad – Grasping what is already within immediate reach
1 Oct 09
Open Access means free, online, immediate, permanent access to reading, downloading, storing, printing, data crunching
Primary target is 2.5M articles written for academic journals, primarily author giveaways. Optionally can include books and other categories.
About 25K journals published worldwide. Most universities can only subscribe to a small fraction. Research is having only a fraction of its potential impact, achieving only a fraction of its productivity. OA provides a remedy. Free articles found to be cited > 3x as often (Lawrence 2001), with significant impact advantage. True in every field tested, Research that is freely accessible has 25-250% greater impact (Brody & Harnad 2004).
Two ways to do it: publishers convert to OA (the golden way), researchers deposit in IRs (the green way). However, only 15% of articles are being voluntarily submitted. Gold relies on publishers, whereas Green only requires the research community. USouthampton has created EPrints (which Harnad strongly recommends over DSpace).
Creating IRs is a necessary but not sufficient condition for creating 100% OA. Many repositories, but most are almost empty. Incentives are not sufficient to increase self-archiving. To guarantee 100% self-archiving, must make it an administrative requirement. USouthampton ECS repository virtually 100%. Why?
Publishing is mandated already (publish or perish), self-archiving mandate can be a natural extension. Surveys indicate 95% or researchers would comply, more than 80% willingly. Only those IRs with mandated deposit achieve any where near 100% self-archiving. There are currently 98 institutions world wide with Green mandated deposits. That’s out of over 10,000 institutions. See ROARMAP. There are 57 university mandates so far. There are 41 research funder mandates. In Canada, only one departmental mandate, 8 funder mandates, one funder proposed (NSERC).
OA articles accelerate the research/access/use/citation cycle: OA articles are cited sooner. Time-course of citation/use cycle shows more citations means more downloads. Higher early downloads means correlate with high citation rates later.
Mandates should be to
– deposit all articles
– in an IR
– immediately upon acceptance for publication; a compromise is the “immediate deposit – delayed access” mandate
63% of journals endorse immediate, Green OA self-archiving. For the remaining 37%, EPrints has an EPrint Request button. Any user on the web can still reach the metadata, but click “request a copy”, then send an email form that indicates the article is needed for research purposes. Email goes to the author, who can then click “OK” thereby sending a copy to the requestor.
EPrints has rich use metrics. Integrates with CiteBase. One of the rewards of self archiving mandates: authors are often interested in vanity searches. Also important in evaluating impact etc.
Posted by pzed on October 1, 2009 at 10.17pm
#access2009pei – Roy Tennant & Mike Rylander – ILS stuff
1 Oct 09
Roy Tennant
ILS in the Sky with Diamonds
Many of our systems and services are moving into the “cloud”. Moving library data/apps to the network level at web scale. “Cloud computing” opportunities. COmputing tasks move from in-house servers to the net. Incorporates infrastructure, platform, and software as services. Low barriers to entry, pay as you go, no need for local server capacity, automatic software upgrades, saves staff. Drawbacks: lack of complete control, reliance on network connectivity, data held by a third party.
Amazon’s web-scale value proposition: most orgs spend 70% on infrastructure. Attempt to flip that: spend 30% on infrastructure, 70% on initiative.
One library example is OLE Project, ground-up new design for an ILS including workflows for all tasks envisioned in new system. Tennant not sure if this project has new funding, at the modelling stage. Another example is the Extensible Cataloging Project.
OCLC believes libraries have too many systems to support, too much invested in maintenance, a fragmented web presence, and lost opportunities for leveraging common data. There’s growing dissatisfaction with existing options, very few alternatives. Starting to see some real choices, finally!
OCLC is uniquely positioned to get into this stuff.
– > 1M libraries worldwide, > 5K transactions per second
– OCLC thinks this could all be done with a handful of commodity servers. Some nice graphics on simplifying he mess of siloed systems we deal with
– how much of our data could be shared, in what ways? vendor data, e.g.
A next-gen LMS
– supports all library management functins
– scalable and 100% web based
– reduced total cost of ownership
– unified platform for print/electronic
– flexible, customizable, but not unique
– network effects: application sharing, data registries
Responsive, scalable, fault tolerant, agile, suitable for public consumption (web services!), integrates with existing OCLC stuff
There’s a Web-scale management strategic steering advisory committee (or sumthin); anticipate rolling out a product in 2011.
Data: intent on making workflows more efficient, allow more intelligent CM decisions. The library who add’s data, owns the data. What goes in must come out. Leverage the power of WorldCat….
Mike Rylander
OS ILS
Without open software, the knowledge of how to read formats in which open data are stored will be forgotten. Even MARC will die!
OS ILS advantages:
– ROI – Evergreen an order of magnitude cheaper
– leverage with proprietary vendors
– control, over the code, the data, and the direction
– ideology a very good match with libraries
Next big things
– Cloud computing: using other peoples computers, learning not to waste computing resources
– Software as a service: hosting with a service provider, on virtual servers – BUT you don’t get to write the software
– Platform as a service: hosting with a service provider, on virtual servers – BUT you can only run apps targeted at a specific framework
Evergreen is SOA, SaaS(able), PaaS(ish)
Could we leverage the scaling features of Evergreen to build a community owned, run, and maintained PaaS cloud? Vision of each of us running one or two commodoty servers talking to each other over the internet, would immediately become the largest Evergreen instance in the world. PINES has 20 servers, never go over 25% utilization.
Posted by pzed on October 1, 2009 at 10.15pm
#access2009pei – Donald Moses & Paul Pound – Islandlives
1 Oct 09
(Donald) Privately funded project to digitize PEI books, running mostly on OS software. At it’s heart is Islandora.
Community pieces: created the repository of bib records first, began local promotion through newspaper and others, building community support. Then needed to connect with rights holders. Contact authors for copyright-protected items, are some challenges with handling orphan works. “Post-it note” method of tracking rights holders! Have a notice and takedown page for individuals who believe they have rights in incorrectly categorized orphans can contact. Allow community contributed metadata for photographs, which are cropped without metadata from books.
Develope TEI viewer, page viewer. Used Google Forms for QA process.
(Paul) Islandora is a Drupal module. Drupal manages users, roles, permissions. Connects to Fedora Drupal filter. Can use Drupal’s LDAP without having to build into Islandora. Use Drupal permissions to manage permissions for data operations. Drupal also provides themes, easily customized. Some installs use out-of-the-box themes, others minor customizations. Islandora actually six modules; also leverages Lucene, Djatoka, OpenOffice.
– Fedora Repository module
– Scholar module
– Fedora-Attach module, extends Drupal’s file attach module
– Fedora Imageapi module, building on D’s imageapi to manipulate image streams
– Islandlives module: several book object datastreams
– that’s only 5, did I miss one?
Use Fedoragsearch to index MODS, TEI, and DC datastreams. Untokenized fields for lists (place names, etc.); tokenized fields indexed for searching.
IslandLives JP2 viewer using djatoka, openlayers, jquery to view jpeg2000 images. djatoka is an OS jp2000 image server. Some issues with djatoka and Fedora.
IslandLives TEI Editor is a separate Drupal module. GUI to mark up TEI, lots of js
Tagging application: where to store tags? usually choose Fedora: have many Drupals hitting on Fedora. Tag xml can be indexed in Lucene.
For the future:
– use Drupal’s Solr module to combine Drupal/Fedora searching in one index.
– take advantage of Drupal hooks to sync data between Fedora and Drupal
There’s an Islandora google group, and FedoraCommons hosting. Islandora team is eight people.
Posted by pzed on October 1, 2009 at 12.25pm
#access2009pei – Dan Chudnov – Chudnovian Stuff
1 Oct 09
Repository Development Group at LC: 30 people, various roles (including dedicated project managers), various backgrounds. LC21 Report guiding LC srtategies, from this report the Office of Strategic Initiative came to be.
– capturing digital artefacts
– make them available for copyright registration/deposit
– pass along for inclusion in the collection
– subsequently processed for cataloguing, indexing etc.
Scale is global: LC universal collection imperative. Capture world scale, distribute web scale. E.g. of wdl.org – global partners, content from all over the world, users as well. Launched April 2009, big press release resulting in 9K requests/second on day 1. Entirely relying on open source software. Clean URIs, static pages: global edge caching with “very well known” caching service.
Another e.g.: Chronicling America; digitization of local/regional newspapers. Approx. 140K US newspaper titles’ bib records, 1.4M pages of content. All freely available now. Scale already over 100Tb from only 16 of 50+ states/territories from about 1850 to 1922. Similar software stack and design decisions to wdl.org
Using the word “movage” more and more: preservation and storage, on a practical day to day, is actually moving bits around. Capture artefacts using BagIt: think of it as a packing slip for data. Tells you what data should be inside, can then check to make sure it’s really there. Sender tells you what is being sent, receiver checks to make sure it really was. Oddly, this hasn’t really been solved previously. Works across space, systems, organizations, time. Also easy to make: tools: md5deep, (bagit library?), bagger; free, OS releases from LC: sf.net/projects/loc-xferutils/. OS release was very new for LC, lawyers got involved, but it got done!
Challenge of managing communication among people: for every bit that moves around, there are human communications that have to take place. Need to improve transfer, inventory, workflows.
Chudnov really cares about incorporating digital objects in the collection. Traditionally using catalogue records, exhibit sites. Cost of integrating everything in this way is high. Hard, expensive, need skilled people with time. Cost of updating everything is even higher. Good news: cost of consistent web strategies (increasingly adopted) is low. E.g. of linked data. linked data design issues. In LC’s case, LC authorities on the web is a newly available example. Machine readable view is acceptable for people and bots, and the end point includes a clean, concise definition of what it means (mainly for humans, but bots can work with it too).
Visit a URI, get something that defines a concept with a precise meaning. This is a standard way to refer to a catalog heading. Never had that before. A healthy web of data. Available now, and can download and mashup. Also new for LC to give headings away like this.
OAI-ORE aggregations to describe data. Look at the web, see a thing; OAI-ORE defines the constellation of things that make up that page. Each concept defined explicitly in the RDF. Interesting thing about all this work: the web itself is the API. Repeat that! No secret key, no custom interface.
LC mission says make things available and useful. Idea for how to incorporate digitally into the collection–”sustain and preserve a universal collection”–if we’re consistent about what we mean when we publish something, giving people links to follow, and everyone is consistent in the same way, end up with distributed conceptual integration. The web is a universal collection. Let’s all incorporate all our artefacts into the universal web!
Posted by pzed on October 1, 2009 at 11.44am
#access2009pei – Mark Jordan & Brian Owen – COPPUL stuff
1 Oct 09
Marc Jordan
COPPUL’s LOCKSS Private Network
LOCKSS preserves by making at least six copies of things. Does a preservation check to ensure copies do not become damaged. Private networks tyipcally have mixed content, public network primarily ejournals.
How does something get into the LOCKSS network? On the public network, there’s a nomination/voting process. On the private, content is determined by whoever manages the private network. COPPUL includes collections of local interest, of greater than usual risk of being lost if not preserved by LOCKSS. Can be done on a low end server, storage about 1-4Tb. Storage is the big hurdle. Minimal staff needs to set up and run the machine.
COPPUL: OJS content, CONTENTdm, USask ETD database, local “staged” content.
To allow content to be harvested, must set up a manifest to tell LOCKSS crawler that it has permission to access. OJS supports this, not surprisingly.
One outstanding tech task: integrating LOCKSS private network into campus proxy.
Staged content: not public facing, packaged for programmatic exposure and retrieval; e.g. of CONTENTdm content, SFU Editorial Cartoons. Intended to be dbs/repository neutral and to facilitate long-term preservation. Archival units are folders containing zip files, manifest page links to file and says “yes, you can harvest”. BagIt specification identifies content/metadata in the directories. Much simpler than an XML packaging format. Relies heavily on checksums. Metadata itself will be XML.
Brian Owen
Software Lifecycles & Sustainability: a PKP and reSearcher Update
reSearcher: CUFTS, GODOT, dbWiz, Open Knowledgebase
PKP: OJS, OCS, Harvester, OMP, Lemon8-XML, PKP WAL
Projects are both open source (GPL), LAMP architecture.
Under development: Open Monograph Press (OMP); PKP Web Application Library (WAL). PKP user interface upgrade will be tried out first on OMP.
Open source not just about good code:
– community building
– sustainability strategies
Posted by pzed on October 1, 2009 at 9.35am
#access2009pei – Richard Akerman – Will We Command Our Data?
1 Oct 09
The David Binkley Lecture.
(Akerman writes Science Library Pad.)
Issues around data use and management are not unlike those facing copyright.
How big is data? Although storage capacity is significantly improved, it takes about ten 2m tall racks to contain a petabyte. There is a physical aspect to data, and costs associated with it. At the petabyte scale, data must by close to computation because of bandwidth constraints.
Four sources of data: research data, government data, library data, personal data. Government data is being released a bit more freely, so there’s more of it and we might be in a position to leverage even more into the public realm.
Convergence of factors since 2000: value of sharing, ease of sharing, and level of sharing at the machine level. We see this as good, and it’s increasingly easy to do. Are increasingly able to expose raw data to machines and take advantage of the rote activities in processing that machines do really well.
“OECD Principles and Guidelines for Access to Research Data from Public Funding” (April 2007). Fairly non-controversial principal that if the public funds research, the data should be release publicly. Publishers do not have a vested interest in becoming data publishers.
“The Toronto Statement on prepublication data sharing” (September 2009). Encouraging sharing of data before the long publication process.
OECD: “Open access to research data … easy, timely, user-friendly and preferable Internet based”
Gov’t data: US Memorandum on Transparency and Open Government, US Memorandum on the FOIA; commitment to public release of gov’t information and the power of transparency. UK Power of Information Task Force: “public information held by for example the police, health bodies and local authorities is often not available. This is bad for democratic expression, the economy, and citizen customers.” US – data.gov; can librarians help governments learn to share this data more effectively? UK PM Brown meets with Tim Berners-Lee, announces UK wants to release gov’t data as linked data.
Library data: ILS Customer bill-of-rights (2005); Berkeley accord (2008).
Personal data: privacy risks, but potential power from the data in our lives. Wired cover feature “Living by numbers” (July 2009). Twitter will soon allow you to opt-in to automatically recording you geographic position.
Why libraries? Advocates, exemplars, experts. Open up data in a sensible, productive, usable way. Unlike print, data is not self-describing. E.g. of DataCite: “DOIs for data”; NRC-CISTI Gateway to (Canadian) Scientific Data Sets.
Canada hasn’t had the strong push from the PM/Pres level that other nations have, but there are significant projects. It’s actually really difficult to release government data under crown copyright. Can look at geogratis, DLI, and ODESI for examples of how it can be done.
Municipal efforts too: Vancouver, Toronto has plans, Ottawa working on a policy.
Back to Library data: how do we connect library data to patrons in a similar way? Some examples: a million free covers from LibraryThing, the Open Library has pulled data from all over, TALIS Connected Commons specifically about linked data, MESUR (resolver data) – we have data in our resolver logs that we could use to build interesting tools, LCSH (see Dan Chudnov, later).
APIs vs raw data….
Personal data: Daytum, people can record almost anything about themselves.
Back to the peta-scale: Total Recall; only valuable if you could find stuff in that huge store of information. Libraries as preserving culture and its outputs, must think about how we record and preserve people’s lifestreams.
Posted by pzed on October 1, 2009 at 8.15am
