MikeSimpsonThoughtsDuringConference

From DSpace Wiki

Jump to: navigation, search

ANDOM CHONOLOGICALLY-SOTED THOUGHTS OCCUING DUING THE 2004 USE CONFEENCE (as refactored at 20,000 feet, courtesy of Midwest Airlines) Metadata should probably have a separate authorization from the content. I.e. you should be able to set a switch that says, "Show the metadata for this object, but not the object itself." Another kind of record that should be attachable to a digital object should be a statistical ("usage") record -- this is different from regular metadata in the sense that it should be composed of counters that automatically update when triggered by API actions. So when the interface calls Item.display() (or whatever) the "display count" attribute gets autoincremented. These counters should be implemented at the lowest level possible (like the UNIX kernel counters for I/O, virtual memory, etc.) so that all sorts of statistical analysis can be built on top of them. Should the statistical functionality wind up as a separate module ("the Statistics API") or be part of the APIs of all modules (everything has a getCounters() call or something like that)? Actually, I think you could do all this with extensive logging plus a log analysis toolkit (i.e. Apache+Analog). It's where you want to put the overhead. Here's a concept and a vocabulary term hanging on it: "viewpoints", which are retrieval interfaces customized for specific data types (Text, Image, Audio, Video, Export). This is sort of Model-View-Controller, in that we split display from data. I suppose you should be able to request the Image viewpoint on a textual bitstream, but I'm not sure about what it should return. Should viewpoints attach only to certain levels in the object scheme, i.e. do communities need to have viewpoints? All objects are container objects. Even a "itstream" is really an abstraction of bytes plus extra information. All objects should be loosely-coupled to their containers -- maybe objects exist as non-hierarchical "pools" (bitstream pool, item pool, collection pool, community pool) and then we express hierarchy as arbitrary linkages defined between pools -- this is like a filesystem abstraction sitting on top of a physical storage device, where branches (directories) and leaves (files) are structure imposed on randomly-distributed disk sectors. etrieval interfaces should respond to (at least) two types of identifiers: something a bit more handle-ish (the "canon path") and something that reflects a (human-readable) pathway to the object through a specific hierarchy of communities/collections/etc. (the "label path"). There should be one unique "canon path" for each object, but many possible "label paths" (compare to symlinks in a filesystem). Any object should be able to be aliased into the next-higher-level set of containers: this action would create a new "label path" but not another "canon path". Maybe the original label path (the one created upon item submission?) should be privileged somehow -- the owner of that path can grant/create other paths, but no one else. Or, there's an owner of the "canon path", who can then grant/create one or more label paths with various authorization parameters upon request. I.e. an interface to send a request: "I see you are the owner of this item; I'd like to include it in my collection as well, and I'd like it to be publicly-accessible along the new label path." And then one for the reply: "The new label path has been created, and I've put this set of authorizations on it." Instead of "label paths", they could be called "alias paths". Any object should be able to have metadata records ("record objects?") attached, of various types (i.e. DC record, METS record, MAC record; but also "Collection" metadata, "Community" metadata, i.e. the information that is currently pulled off into the collection and community tables in PostgreSQL; this should be generalized and turned into a metadata record just like any other metadata). Note that being able to attach multiple records of the same schema type fixes the language issue (English DC, French DC, etc.). Metadata schemas themselves should be loadable/unloadable based on an XMLish definition file; actually any kind of registry-type information should exist in canon form in XML, which is then parsed by the loader process and turned into the appropriate internal commands (i.e. ANSI SQL) to create the necessary data structures in the metadata store. Authentication creates a session object with various "attributes" attached to it; attributes are the keys that provide authorization decisions during retrieval. Authorization should occur for retrieval of any object (communities down to bitstreams). An "authorization path" (what I called an "alias path", above) might define the sequence of authorizations that must be passed successfully to do retrieval. That does mean that different authorization paths could exist for a single object. Maybe the final arbiter is the canon path, i.e. authorization parameters set on the canon path are checked last, and override all other parameters set for the other authorization paths. I'm thinking that indexing/search functionality really has no business in DSpace, which is about archiving, browsing and retrieval. Index and search should really exist as a separate application that lives above the DSpace layer. The browse/retrieve API could of course extend useful functionality up to that service layer (i.e. OAI-PMH) but indexing and searching directly inside DSpace will always be a secondary function at best. An absolute baseline definition of "digital object": "a stream of bytes representing a discrete chunk of intellectual property." It would be nice if each defined MetadataSchema object could automagically imply an OAICAT crosswalk, if the appropriate XML mapping descriptor is available. Which is to say, there could be a "crosswalks" directory with XML files inside describing various mappings, and DSpace on startup would populate the OAI-PMH interface with the appropriate crosswalks based on the XML files that it found. etrieval limits should be able to be specified both in terms of number of records and/or size of delivered content (i.e. "give me ten records, or 100 Mb, whichever is less."). Persistent naming structures (handles, PULs, AKs) should be conceived as plugin views for the exposure and retrieval of repository content. DSpace should always maintain an internal canon identifier that can be algorithmically transformed into any of the other identifier types and exposed to harvesters et al. on demand. andom metaphor: DSpace is a hammer; we haven't even started building the cabinets yet. More vocabulary possibilities: the canon path could be called the "identification path", vs. the alias path(s) which are "authorizations path(s)" to the object in question. Another type of record that it should be possible to attach: a "licensing record". Our container objects (bitstreams up to communities, or maybe even Instances of DSpace) are starting to look like convenient abstractions composed of a persistent identifier that gives us a hook upon which to hang metadata. Which matches perfectly the DSpace fundamental questions: "Where is it?" and "What is it?" andom metaphor: individual users produce archipelagoes of knowledge. Interfaces extended to the service layer by various modules should probably express themselves (or be able to be expressed) as XML. The service layer could then search for appropriately named/identified XSLT transforms and use them for display. I.e. when you do a "retrieve" action on item "foo", using the authentication path of collection "bar" and community "bat", the service layer grabs the appropriate XML, and then applies transforms for "/bat/bar/foo", if available. Plugin modules (a la Apache) are just a fabulous design (obert Tansley's presentation). It would be EVEN ETTE if the modules were stackable, i.e. for authorization, several different modules may all be registered at startup as "interested parties" (providing the requisite API(s)). Then a call to Item.authorize() is really a set of calls falling through the stack of registered modules. Each module has a chance to either handle the call itself or decline the call, which then falls through to the next module in the stack. Apache does something almost exactly like this, I believe. Toolkits become services through the application of policies and procedures for use. Don't get confused and start burying policies and procedures inside your toolkit code. The question is never, "What can I do with Apache?" The answer is always, "Yes, I can use Apache to do that." DSpace should strive for something similar. It would be nice to consider writing "mod_dspace" for Apache, which would provide a nice interface layer down into the DSpace services. This ties Apache's down-growing roots into DSpace's up-welling springs, and lets us stick a pin in the content and answer all three questions: "Where is it? What is it? How do I get it?"

Personal tools