words what do you read, m'lord?

 

access 2009 category archive

#access2009pie – Peter Rukavina – Infinite Malleability

description here

Going to talk about application vs capabilities instead.

When designing a system, you can build apps, or you can build capabilities. Flexible, multi-use systems for multi-talented generalists: farm example, architecture that supports multiple uses. Farmers do lots of stuff. Contrast with a factory, inflexible, purpose-built, made for specialists. Factory workers do one thing over and over. Industrialism changed the system design paradigm from capabilities to applications. The two overlap, but are different sensibilities. Apps are discrete, manageable, predictable, artificial; capabilities are interrelated, malleable, unpredictable and natural.

E.g. of the unix command line as providing capabilities. Contrast Royal Botanical Gardens–a nature application that has been installed in Hamilton–with the Charlottetown Boulder Park, a capability of the urban landscape. There are many other examples.

Library web site (Robertson UPEI) as an example of an application oriented design, vs Google (of course) which provides capability. Is the library’s mission to run a book and reading management application, or to extend society’s knowledge capabilities?

Posted by pzed on October 2, 2009 at 7.04am

#access2009pei – Stevan Harnad – Grasping what is already within immediate reach

description here

Open Access means free, online, immediate, permanent access to reading, downloading, storing, printing, data crunching

Primary target is 2.5M articles written for academic journals, primarily author giveaways. Optionally can include books and other categories.

About 25K journals published worldwide. Most universities can only subscribe to a small fraction. Research is having only a fraction of its potential impact, achieving only a fraction of its productivity. OA provides a remedy. Free articles found to be cited > 3x as often (Lawrence 2001), with significant impact advantage. True in every field tested, Research that is freely accessible has 25-250% greater impact (Brody & Harnad 2004).

Two ways to do it: publishers convert to OA (the golden way), researchers deposit in IRs (the green way). However, only 15% of articles are being voluntarily submitted. Gold relies on publishers, whereas Green only requires the research community. USouthampton has created EPrints (which Harnad strongly recommends over DSpace).

Creating IRs is a necessary but not sufficient condition for creating 100% OA. Many repositories, but most are almost empty. Incentives are not sufficient to increase self-archiving. To guarantee 100% self-archiving, must make it an administrative requirement. USouthampton ECS repository virtually 100%. Why?

Publishing is mandated already (publish or perish), self-archiving mandate can be a natural extension. Surveys indicate 95% or researchers would comply, more than 80% willingly. Only those IRs with mandated deposit achieve any where near 100% self-archiving. There are currently 98 institutions world wide with Green mandated deposits. That’s out of over 10,000 institutions. See ROARMAP. There are 57 university mandates so far. There are 41 research funder mandates. In Canada, only one departmental mandate, 8 funder mandates, one funder proposed (NSERC).

OA articles accelerate the research/access/use/citation cycle: OA articles are cited sooner. Time-course of citation/use cycle shows more citations means more downloads. Higher early downloads means correlate with high citation rates later.

Mandates should be to
– deposit all articles
– in an IR
– immediately upon acceptance for publication; a compromise is the “immediate deposit – delayed access” mandate

63% of journals endorse immediate, Green OA self-archiving. For the remaining 37%, EPrints has an EPrint Request button. Any user on the web can still reach the metadata, but click “request a copy”, then send an email form that indicates the article is needed for research purposes. Email goes to the author, who can then click “OK” thereby sending a copy to the requestor.

EPrints has rich use metrics. Integrates with CiteBase. One of the rewards of self archiving mandates: authors are often interested in vanity searches. Also important in evaluating impact etc.

Posted by pzed on October 1, 2009 at 10.17pm

#access2009pei – Roy Tennant & Mike Rylander – ILS stuff

Roy Tennant

ILS in the Sky with Diamonds

description here

Many of our systems and services are moving into the “cloud”. Moving library data/apps to the network level at web scale. “Cloud computing” opportunities. COmputing tasks move from in-house servers to the net. Incorporates infrastructure, platform, and software as services. Low barriers to entry, pay as you go, no need for local server capacity, automatic software upgrades, saves staff. Drawbacks: lack of complete control, reliance on network connectivity, data held by a third party.

Amazon’s web-scale value proposition: most orgs spend 70% on infrastructure. Attempt to flip that: spend 30% on infrastructure, 70% on initiative.

One library example is OLE Project, ground-up new design for an ILS including workflows for all tasks envisioned in new system. Tennant not sure if this project has new funding, at the modelling stage. Another example is the Extensible Cataloging Project.

OCLC believes libraries have too many systems to support, too much invested in maintenance, a fragmented web presence, and lost opportunities for leveraging common data. There’s growing dissatisfaction with existing options, very few alternatives. Starting to see some real choices, finally!

OCLC is uniquely positioned to get into this stuff.
– > 1M libraries worldwide, > 5K transactions per second
– OCLC thinks this could all be done with a handful of commodity servers. Some nice graphics on simplifying he mess of siloed systems we deal with
– how much of our data could be shared, in what ways? vendor data, e.g.

A next-gen LMS
– supports all library management functins
– scalable and 100% web based
– reduced total cost of ownership
– unified platform for print/electronic
– flexible, customizable, but not unique
– network effects: application sharing, data registries

Responsive, scalable, fault tolerant, agile, suitable for public consumption (web services!), integrates with existing OCLC stuff

There’s a Web-scale management strategic steering advisory committee (or sumthin); anticipate rolling out a product in 2011.

Data: intent on making workflows more efficient, allow more intelligent CM decisions. The library who add’s data, owns the data. What goes in must come out. Leverage the power of WorldCat….

Mike Rylander

OS ILS

Without open software, the knowledge of how to read formats in which open data are stored will be forgotten. Even MARC will die!

OS ILS advantages:
– ROI – Evergreen an order of magnitude cheaper
– leverage with proprietary vendors
– control, over the code, the data, and the direction
– ideology a very good match with libraries

Next big things
– Cloud computing: using other peoples computers, learning not to waste computing resources
– Software as a service: hosting with a service provider, on virtual servers – BUT you don’t get to write the software
– Platform as a service: hosting with a service provider, on virtual servers – BUT you can only run apps targeted at a specific framework

Evergreen is SOA, SaaS(able), PaaS(ish)

Could we leverage the scaling features of Evergreen to build a community owned, run, and maintained PaaS cloud? Vision of each of us running one or two commodoty servers talking to each other over the internet, would immediately become the largest Evergreen instance in the world. PINES has 20 servers, never go over 25% utilization.

Posted by pzed on October 1, 2009 at 10.15pm

#access2009pei – Donald Moses & Paul Pound – Islandlives

description here

(Donald) Privately funded project to digitize PEI books, running mostly on OS software. At it’s heart is Islandora.

Community pieces: created the repository of bib records first, began local promotion through newspaper and others, building community support. Then needed to connect with rights holders. Contact authors for copyright-protected items, are some challenges with handling orphan works. “Post-it note” method of tracking rights holders! Have a notice and takedown page for individuals who believe they have rights in incorrectly categorized orphans can contact. Allow community contributed metadata for photographs, which are cropped without metadata from books.

Develope TEI viewer, page viewer. Used Google Forms for QA process.

(Paul) Islandora is a Drupal module. Drupal manages users, roles, permissions. Connects to Fedora Drupal filter. Can use Drupal’s LDAP without having to build into Islandora. Use Drupal permissions to manage permissions for data operations. Drupal also provides themes, easily customized. Some installs use out-of-the-box themes, others minor customizations. Islandora actually six modules; also leverages Lucene, Djatoka, OpenOffice.
– Fedora Repository module
– Scholar module
– Fedora-Attach module, extends Drupal’s file attach module
– Fedora Imageapi module, building on D’s imageapi to manipulate image streams
– Islandlives module: several book object datastreams
– that’s only 5, did I miss one?

Use Fedoragsearch to index MODS, TEI, and DC datastreams. Untokenized fields for lists (place names, etc.); tokenized fields indexed for searching.

IslandLives JP2 viewer using djatoka, openlayers, jquery to view jpeg2000 images. djatoka is an OS jp2000 image server. Some issues with djatoka and Fedora.

IslandLives TEI Editor is a separate Drupal module. GUI to mark up TEI, lots of js

Tagging application: where to store tags? usually choose Fedora: have many Drupals hitting on Fedora. Tag xml can be indexed in Lucene.

For the future:
– use Drupal’s Solr module to combine Drupal/Fedora searching in one index.
– take advantage of Drupal hooks to sync data between Fedora and Drupal

There’s an Islandora google group, and FedoraCommons hosting. Islandora team is eight people.

Posted by pzed on October 1, 2009 at 12.25pm

#access2009pei – Dan Chudnov – Chudnovian Stuff

description here

Repository Development Group at LC: 30 people, various roles (including dedicated project managers), various backgrounds. LC21 Report guiding LC srtategies, from this report the Office of Strategic Initiative came to be.
– capturing digital artefacts
– make them available for copyright registration/deposit
– pass along for inclusion in the collection
– subsequently processed for cataloguing, indexing etc.

Scale is global: LC universal collection imperative. Capture world scale, distribute web scale. E.g. of wdl.org – global partners, content from all over the world, users as well. Launched April 2009, big press release resulting in 9K requests/second on day 1. Entirely relying on open source software. Clean URIs, static pages: global edge caching with “very well known” caching service.

Another e.g.: Chronicling America; digitization of local/regional newspapers. Approx. 140K US newspaper titles’ bib records, 1.4M pages of content. All freely available now. Scale already over 100Tb from only 16 of 50+ states/territories from about 1850 to 1922. Similar software stack and design decisions to wdl.org

Using the word “movage” more and more: preservation and storage, on a practical day to day, is actually moving bits around. Capture artefacts using BagIt: think of it as a packing slip for data. Tells you what data should be inside, can then check to make sure it’s really there. Sender tells you what is being sent, receiver checks to make sure it really was. Oddly, this hasn’t really been solved previously. Works across space, systems, organizations, time. Also easy to make: tools: md5deep, (bagit library?), bagger; free, OS releases from LC: sf.net/projects/loc-xferutils/. OS release was very new for LC, lawyers got involved, but it got done!

Challenge of managing communication among people: for every bit that moves around, there are human communications that have to take place. Need to improve transfer, inventory, workflows.

Chudnov really cares about incorporating digital objects in the collection. Traditionally using catalogue records, exhibit sites. Cost of integrating everything in this way is high. Hard, expensive, need skilled people with time. Cost of updating everything is even higher. Good news: cost of consistent web strategies (increasingly adopted) is low. E.g. of linked data. linked data design issues. In LC’s case, LC authorities on the web is a newly available example. Machine readable view is acceptable for people and bots, and the end point includes a clean, concise definition of what it means (mainly for humans, but bots can work with it too).

Visit a URI, get something that defines a concept with a precise meaning. This is a standard way to refer to a catalog heading. Never had that before. A healthy web of data. Available now, and can download and mashup. Also new for LC to give headings away like this.

OAI-ORE aggregations to describe data. Look at the web, see a thing; OAI-ORE defines the constellation of things that make up that page. Each concept defined explicitly in the RDF. Interesting thing about all this work: the web itself is the API. Repeat that! No secret key, no custom interface.

LC mission says make things available and useful. Idea for how to incorporate digitally into the collection–”sustain and preserve a universal collection”–if we’re consistent about what we mean when we publish something, giving people links to follow, and everyone is consistent in the same way, end up with distributed conceptual integration. The web is a universal collection. Let’s all incorporate all our artefacts into the universal web!

Posted by pzed on October 1, 2009 at 11.44am

#access2009pei – Mark Jordan & Brian Owen – COPPUL stuff

description here

Marc Jordan

COPPUL’s LOCKSS Private Network

LOCKSS preserves by making at least six copies of things. Does a preservation check to ensure copies do not become damaged. Private networks tyipcally have mixed content, public network primarily ejournals.

How does something get into the LOCKSS network? On the public network, there’s a nomination/voting process. On the private, content is determined by whoever manages the private network. COPPUL includes collections of local interest, of greater than usual risk of being lost if not preserved by LOCKSS. Can be done on a low end server, storage about 1-4Tb. Storage is the big hurdle. Minimal staff needs to set up and run the machine.

COPPUL: OJS content, CONTENTdm, USask ETD database, local “staged” content.

To allow content to be harvested, must set up a manifest to tell LOCKSS crawler that it has permission to access. OJS supports this, not surprisingly.

One outstanding tech task: integrating LOCKSS private network into campus proxy.

Staged content: not public facing, packaged for programmatic exposure and retrieval; e.g. of CONTENTdm content, SFU Editorial Cartoons. Intended to be dbs/repository neutral and to facilitate long-term preservation. Archival units are folders containing zip files, manifest page links to file and says “yes, you can harvest”. BagIt specification identifies content/metadata in the directories. Much simpler than an XML packaging format. Relies heavily on checksums. Metadata itself will be XML.

Brian Owen

Software Lifecycles & Sustainability: a PKP and reSearcher Update

reSearcher: CUFTS, GODOT, dbWiz, Open Knowledgebase
PKP: OJS, OCS, Harvester, OMP, Lemon8-XML, PKP WAL

Projects are both open source (GPL), LAMP architecture.

Under development: Open Monograph Press (OMP); PKP Web Application Library (WAL). PKP user interface upgrade will be tried out first on OMP.

Open source not just about good code:
– community building
– sustainability strategies

Posted by pzed on October 1, 2009 at 9.35am

#access2009pei – Richard Akerman – Will We Command Our Data?

The David Binkley Lecture.

(Akerman writes Science Library Pad.)

Issues around data use and management are not unlike those facing copyright.

How big is data? Although storage capacity is significantly improved, it takes about ten 2m tall racks to contain a petabyte. There is a physical aspect to data, and costs associated with it. At the petabyte scale, data must by close to computation because of bandwidth constraints.

Four sources of data: research data, government data, library data, personal data. Government data is being released a bit more freely, so there’s more of it and we might be in a position to leverage even more into the public realm.

Convergence of factors since 2000: value of sharing, ease of sharing, and level of sharing at the machine level. We see this as good, and it’s increasingly easy to do. Are increasingly able to expose raw data to machines and take advantage of the rote activities in processing that machines do really well.

“OECD Principles and Guidelines for Access to Research Data from Public Funding” (April 2007). Fairly non-controversial principal that if the public funds research, the data should be release publicly. Publishers do not have a vested interest in becoming data publishers.

“The Toronto Statement on prepublication data sharing” (September 2009). Encouraging sharing of data before the long publication process.

OECD: “Open access to research data … easy, timely, user-friendly and preferable Internet based”

Gov’t data: US Memorandum on Transparency and Open Government, US Memorandum on the FOIA; commitment to public release of gov’t information and the power of transparency. UK Power of Information Task Force: “public information held by for example the police, health bodies and local authorities is often not available. This is bad for democratic expression, the economy, and citizen customers.” US – data.gov; can librarians help governments learn to share this data more effectively? UK PM Brown meets with Tim Berners-Lee, announces UK wants to release gov’t data as linked data.

Library data: ILS Customer bill-of-rights (2005); Berkeley accord (2008).

Personal data: privacy risks, but potential power from the data in our lives. Wired cover feature “Living by numbers” (July 2009). Twitter will soon allow you to opt-in to automatically recording you geographic position.

Why libraries? Advocates, exemplars, experts. Open up data in a sensible, productive, usable way. Unlike print, data is not self-describing. E.g. of DataCite: “DOIs for data”; NRC-CISTI Gateway to (Canadian) Scientific Data Sets.

Canada hasn’t had the strong push from the PM/Pres level that other nations have, but there are significant projects. It’s actually really difficult to release government data under crown copyright. Can look at geogratis, DLI, and ODESI for examples of how it can be done.

Municipal efforts too: Vancouver, Toronto has plans, Ottawa working on a policy.

Back to Library data: how do we connect library data to patrons in a similar way? Some examples: a million free covers from LibraryThing, the Open Library has pulled data from all over, TALIS Connected Commons specifically about linked data, MESUR (resolver data) – we have data in our resolver logs that we could use to build interesting tools, LCSH (see Dan Chudnov, later).

APIs vs raw data….

Personal data: Daytum, people can record almost anything about themselves.

Back to the peta-scale: Total Recall; only valuable if you could find stuff in that huge store of information. Libraries as preserving culture and its outputs, must think about how we record and preserve people’s lifestreams.

Posted by pzed on October 1, 2009 at 8.15am

#access2009pei – Cory Doctorow – Copyright vs Universal Access

description here

tale of two networks: the one we thought we would get, delivering 500 channels of high-res tv! The network that would make us more socially normally (instead of infinitely weirder). David Eisenberg calls this the “smart” network.

Instead, we got a dumb network, in which the people in the middle don’t know what they tech is for or what people would do with it. Great advantage to this is that people at the edge can be very smart.

Surprisingly, dumb network delivered progressively low resolution. Example of telephone, from high quality centrally controlled network, through introduction of crappy phones, to mobile, to skype. We trade quality for price, access, and customizability. Content isn’t king, conversation is.

Every exec thinks they’re industry is the most important thing ever, and are regularly proven wrong by the cycle of creative destruction that is the market economy. Except when the have a regulatory monopoly.

Countries have formerly managed copyright in local, idiosyncratic ways. However, the current regime is governed by a harmonized approach developed through WTO etc. and these rules are written primarily by industry insiders, preferring rights of producers over rights of users.

The network is fundamentally a copying machine, with increasing capacity for storage. It just gets easier to copy. But copying is reified not as an act of an individual, but as an act of a company making copies on an industrial scale. The problem is it doesn’t take a giant, industrial machine to make a copy any more, but we trigger the same set of regulations that govern industry to govern the activities of private individuals. On the internet, we make copies simply by accessing material. We communicate, make plans; read for education, political engagement; work, fall in love… all governed by copyright.

UK study: Extending the term of copyright has a net negative effect economically. DRM doesn’t work. Policies are set without any recourse to evidence. Industrial revolution was not based on buying and selling machines, but using and access to them. Info revolution must also be based on access and use.

The punishment for infringement in many places is disconnection from the internet. Effectively, this is equivalent to the death penalty for citizenship. Future treaties may build surveillance and control into regulations, requiring hardware to be checked at borders, ISPs to inspect packets. These negotiations are entirely in secret, the Obama admin says its position papers are state secrets. Why? Because experience has shown (Hello, Sam Bulte) that when the public becomes aware of them, we rebel.

Copyright law should go on doing what it’s always done: regulate the way corporate entities interact with one another, not how we as individuals act. The point of copyright law can’t be to ensure that one group of people get to make a living for ever. Rather, its role should be to ensure that the greatest number of people can participate in culture. Libraries have an important role, as an unimpeachable moral authority.

Posted by pzed on October 1, 2009 at 7.16am