Wednesday, September 23, 2009

Completely backwards

Emails, blog posts, and tweets are flying by regarding OCLC's recent message to OAI-PMH data providers asking them to agree to a set of Terms & Conditions allowing OCLC to include data harvested via OAI-PMH in both free and toll services that OCLC provides. We do love our drama in the library community!

I agree with the predominant theme that this has all been handled very poorly, but I think the biggest problem lies somewhere else entirely. OCLC has set this whole system up completely backwards. OAI-PMH is a mechanism to share metadata widely, without having 1:1 agreements between data providers and service providers (harvesters). The entire point is to reduce the overhead of sharing. OCLC asking each data provider to check their status and preferences against OCLC's ideal is the wrong way 'round! The way this really should be done is with data providers making clear statements about what can and can't be done (per both copyright and license) with the metadata they're sharing. And, oh, look, OAI-PMH, already lets data providers do that.

To be fair, there's lots of data provider software out there that doesn't support this optional part of the profile. Still others are using software that provides for this but they don't go to the effort to use it. My own repository doesn't have this mechanism in place. (Working on it, I promise!) But this really is the way it has to be for any kind of open data initiative to work. I as a data provider put my metadata (and content if I can!) up, make it clear what copyright terms apply and what license terms I place on its use, and let the sharing begin. The burden must be on the service provider (or harvester, OCLC/OAIster in this case) to determine if the use they want to put the data to conforms with my terms. Service providers should bear the load of managing multiple data providers - it's part of the work they have to do to set up the service. If they want the free stuff, they have to do the work to figure out if their efforts are kosher. OCLC must be responsible for protecting themselves from lawsuits stemming from their use of stuff they're not supposed to, rather than transferring that responsibility to us as data providers.

But I have to temper the other side of this too. I was a member of the group that developed this set of recommendations, urging data providers not to put undue restrictions over reuse of their metadata. I really believe this is the right way to go. Of course we as data providers are sometimes under legal (copyright, contract, etc.) constraints that limit what we can do with our metadata. We have to honor those agreements. But for the vast majority of our stuff, we can share without restriction if we choose to. Giving up control is part of sharing, and we have to learn to live with that. Blessing certain uses and banning others is a dangerous business, and one that doesn't mix very well with the open sharing of information libraries are all about. As the Creative Commons recently found, even "non-commercial use" isn't a very straightforward issue, so I don't think it serves us well to fall back on that old standby. Freedom is about taking the inevitable small amounts of bad with the overwhelming good, and I really do believe those principles apply to information sharing as well. Let's spend our efforts on sharing more and better information, and less on metering out what we do have.

Monday, July 27, 2009

Thoughts on FRSAD

I don't usually publish my individual comments on things sent out for review within our community, but I've decided to make an exception for the FRSAD report. I'm actively working with a FRBR implementation (and trying to take in as much of FRAD as we can), and anything I can do to help push FRSAD (FRSAR? what's in a name? ha - there's got to be a FRAD joke in there somewhere...) to be useful to the work I'm doing I see as a good thing. So here are the comments I sent in through official channels.

-----------------

In short, I think good work has been done here but it doesn't meet my needs as someone working diligently (and actively implementing FRBR and FRAD) to re-imagine discovery systems in libraries.

While I am a great believer in the power of user studies to inform metadata models, I believe inappropriate conclusions have been drawn here. It doesn't surprise me at all that users had trouble sorting actual subjects into categories such as concept, object, event, place. But that doesn't mean our models shouldn’t make that distinction. Users wouldn't be able to distinguish between Work/Expression/Manifestation/Item, either, but those are still useful entities for us to use underlying our systems.

The draft report rightly notes that the concept/object/event/place division is only one way of looking at it, that other divisions such as those outlined by Ranganathan and the framework (which seems to be basically abandoned?). But that's the very essence of a *model* - to pick one of many possible representations and go with it, in order to achieve a purpose. The fact that competing interpretations are possible is not a rationale for abandoning selecting one that can advance the purpose of the model (even taken together with user studies showing users don’t gravitate to any one specific division). By choosing concept/object/event/place (or Ranganathan's model, or , or any other option) we can delve deeper into the modeling we need to do and provide a way forward for our discovery systems. By refusing to do so, we don't advance our case the way we must.

The thema/nomen structure outlined here is very useful. However, I believe strongly the report should not stop here. Going further is often stated here as "implementation dependent" but I think there is a great deal of room for the conceptual model to grow without venturing into actual implementations. Certainly FRBR and FRAD take that approach.

In general, the thema/nomen structure could apply to any attribute or relationship under vocabulary control. There is great (and unfortunately here unexplored) potential for this model to apply beyond simply aboutness. Limiting it in this way I believe is a disservice to those of us who are attempting to use these models to reinvent discovery systems.

I'm concerned about the significant lack of cohesion between the FRBR, FRAD, and FRSAR reports. They show their nature of independently generated by different groups with different interests over a long span of time. This limitation definitely needs to be overcome if these reports are to be useful as a whole for the community. Each could be used on its own, but we need a more coherent group. In fact, the thema/nomen structure in the FRSAD draft isn't really all that different than the (whatever entity)/name structure presented in FRAD. Much greater cohesion of the three reports could be made - what's written here seems to ignore FRAD in particular. I believe this is a missed opportunity. I think the most significant mismatch between the three reports is where they draw the boundary for how far a "conceptual model" should go.

On a higher level note, the report reads more as an academic paper outlining alternative options rather than providing a straightforward definition of the conceptual model. I respect the background work done here, and believe it needs to be done. There's a lot of room for papers like that in this environment; however, this report series needs to serve practitioners better and stick closer to the model.

On a more practical note, in the report the Getty AAT is often referred to by example. Yet most of the facets in the AAT bring out the "isness" (which in the introduction is explicitly described as out of scope) rather than "ofness" or "aboutness". For example, on p. 45, #7 under "select," "ale glass" in AAT is intended to be used for works of art that ARE ale glasses, not works (presumably textual) that are ABOUT ale glasses. This internal inconsistency is a serious flaw in the report.

I'm certainly not one to promote precoordinated vocabularies, but they exist in library metadata and we must deal with them. It's unclear to me from this report how these fit into the model proposed.

Sunday, May 03, 2009

DLF Aquifer Metadata Working Group "Lessons Learned" report available

That moment when a long-term project comes to an end is always simultaneously filled with relief and sadness. Relief in that new opportunities can be embraced and a pretty package placed around what was accomplished, with appropriate rationales for what didn't make its way into the package. Sadness in that productive and creative working relationships come to a close or change, and that there is always more to be done that cannot for practical reasons be embarked upon at this time.

The Digital Library Federation's Aquifer initiative wrapped up this spring, and causes me to experience that moment of relief and sadness. (Well, to be honest, several moments!) I've been involved with Aquifer from the beginning, and during that time my relationship with it evolved from skepticism to "just jump in and see what you can do" to "bite off one reasonable chunk of a problem and do your best to make this chunk work with other chunks." A report the Metadata Working Group just released, "Advancing the State of the Art in Distributed Digital Libraries: Accomplishments of and Lessons Learned from the Digital Library Federation Aquifer Metadata Working Group," reflects that last approach, attempting to place our work in an ever-evolving context. There is much more that could have been done, and the limitations and benefits of a volunteer committee to do work like this is more evident to me now than ever. Nevertheless, I'm proud of the work this group did. Congratulations to all involved on sucessfully navigating through our many tasks.

The message I sent out about this report to various listservs included the following "thank you":
The Aquifer Metadata Working Group would like to thank all who have been involved with the initiative, including current and past Working Group members; the Aquifer American Social History Online project team; participants in ground-breaking precursor activities such as the DLF/NSDL OAI-PMH Best Practices; individuals and institutions who tested, implemented, and provided feedback on the Metadata Working Group's MODS Guidelines and other work products; and of course DLF for its ongoing support. It's been a wild, educational, and wholly enjoyable ride!
I can't state with enough gratitude the role the community has played in what the Aquifer Metadata Working Group was able to accomplish. I like to talk with those thinking of entering the digital library field just how much of our work is figuring it out as you go - we're constantly refining models to apply to new types of material and take advantage of new technologies. My absolute favorite part about working in this area is navigating the tricky path of effectively building on previous work while pushing the envelope at the same time. I hope the Aquifer Metadata Working Group's contributions continue to be useful as building blocks for a long time to come.

Thursday, March 05, 2009

Must Watch! Michael Edson: "Web Tech Guy and Angry Staff Person"

I heard Michael Edson (Director of Web and New Media Strategy for the Smithsonian) speak at the IMLS WebWise conference last week. He delivered an astonishingly good talk centering around an animation entitled "Web Tech Guy and Angry Staff Person." It's a riot, and the animation sets a lighthearted attitude that reinforces his disclaimer that he's not poking fun or diminishing the very real tensions cultural heritage institutions face as our communication, collection, and even the dreaded B-word (business!) models change underneath us. Instead, I believe it's effective in using exaggeration to highlight some underlying issues and think intelligently about what it takes to say we CAN do something rather than taking the easy road and saying no. We can't just dismiss the challenges - understanding them will help us address them.

Sunday, March 01, 2009

Google vs. Semantic Web

On a number of fronts recently I've been thinking a bunch about RDF, the DCMI Abstract Model, and the Semantic Web, all with an eye towards understanding these things more than I have in the past. I think I've made some progress, although I can't claim to fully grok any of these yet. One thing does occur to me, although it's probably a gross oversimplification. The difference in the Semantic Web/RDF approach from the, say, Google approach is this: is the robustness in the data or is it in the system?

The Semantic Web (et al) would like the data to be self-explanatory, to say itself explicitly what it is it is describing and with explicit reference to all the properties used in the description. The opposite end of the spectrum is systems like Google which assume some kind of intelligence went into the creation of the data but doesn't expect the data itself to explicitly manifest it. The approach of these systems is to reverse engineer that data, getting at the human intelligence that created it in the first place.

The difference is one of who is expected to to the work - the sytem encoding the data in the first place (Semantic Web approach) or the system decoding the data for use in a specific application. Both obviously present challenges, and it's not clear to me at this point which will "win." Maybe the "good enough and a person can go the last bit" approach really is appropriate - no system can be perfect! Or maybe as information systems evolve our standards for the performance of these systems will be raised to a degree where self-describing data is demanded. As a moderate, I guess I think both will probably be necessary for different uses. But which way will the library community go? Can we afford to have feet in both camps into the future?