Friday, April 28, 2006

Thesauri and controlled vocabularies

I had a very interesting conversation recently with two colleagues about the differences between thesauri and controlled vocabularies. Both of these colleagues are developers who work in my department. One is finishing up a Ph.D. in Computer Science, is currently in charge of system design for a major initiative of ours, and has a knack for seeing all the aspects of a problem before finding the right solution; the other is a database guru with whom I've collaborated on some very interesting research and has just started pursuing an M.L.S to add to his already considerable expertise. I like and respect both of these individuals a great deal.

The interesting conversation began when the database-guru-and-soon-to-be-librarian (DGASTBL) (geez, that's not any better, is it?) asked if the terms "controlled vocabulary" and "thesaurus" are used interchangeably in the library world. He asked because from our previous work and a solid basis in these concepts he knew they really aren't the same thing, yet he had seen them used in print in ways that didn't match his (correct) understanding. The high-level system diagram we had at the time had a box for "vocabulary" which was intended to handle thesaurus lookups for the system. We discussed how a more precise representation of that diagram would have an outer box for "vocabulary" to handle things like name authority files and subject vocabularies with lead-in terms but no other relationships, and an inner box for "thesauri" (as a subset of controlled vocabularies) with full syndetic structures that the system could make use of. We lamented that the required outer label in this scenario of "controlled vocabulary" isn't as sexy as its subset "thesauri." The latter sounds a great deal more interesting when describing a system to those not involved in developing it.

The system designer then presented a different perspective on the issue. While the librarian types considered thesauri a subset of controlled vocabularies (perhaps party for historical reasons - we've been using loosely controlled vocabularies a lot longer than true thesauri), the system designer viewed the situation as the opposite - that controlled vocabularies were a specific type of thesauri using only one type of relationship (the synonym), or perhaps also some rudimentary broader/narrower relationships that don't qualify as true thesauri (think LCSH). I found the difference in point of view interesting - that the C.S. perspective expected a completely structured approach to the vocabulary problem, and the library perspective represented an evolving view that has never quite gotten to the point where we can make robust use of this data in our systems. It struck me that the system designer's perspective in this conversation was overly optimistic as to the state of controlled vocabularies in libraries.

Yet there's light at the end of this particular tunnel. Production systems in digital libraries are starting to emerge that make good use of controlled vocabularies in search systems, rather than relying on users to consult external vocabulary resources before searching. Libraries have not taken advantage of the revolution in search systems shifting many functions from the user to the system (think spell-checking), to our supreme discredit. Making better use of these vocabularies and thesauri is one way of shifting this burden. I hope this integration of vocabularies into search systems will push the development of these vocabularies further and make them more useful as system tools rather than just cataloger tools. By providing search systems that can integrate this structured metadata, we can improve discovery in ways not currently provided by either library catalogs or mainstream search engines.

1 comment:

Anonymous said...

Excellent entry, thanks. As a former systems designer and recent MLIS grad (and soon to be systems librarian) this helped clarify some of my own internal confusion over what's going on with this stuff, and over why the library world doesn't think about this stuff in quite the way I did before I was librarian-indoctrinated.

I agree, if that's what you're saying, that it is imperative that the library world figure out how to maintain and make better use of more fully featured controlled vocabularies. (Of course, I've got the CS background too). But this is not nearly a unanimous opinion. Much of the library world seems to think that sophisticated controlled vocabularies are a thing of the PAST, not the FUTURE. Based on Amazon and Google or whatever. (Ironically, I think Amazon and Google will only start making MORE use of increasingly sophisticated controlled vocabularies of various sorts. The first steps of this are already being seen.)

-Jonathan