About Data Formats
From DSpace Wiki
This page gives background information and describes the issues that are addressed in the BitstreamFormat Renovation project page. It is also a manifesto of sorts that presents some of the reasoning behind the new format architecture.
Contents |
[edit] About Data Formats
As we discuss it here, a "data format" is the description of how intellectual content (i.e. the abstract meaning) is encoded in a digital byte-stream to form what we call a digital object. The format description is technical metadata, since it describes how e.g. software would interpret the contents of a Bitstream to deliver the intellectual content.
In Automatic Format Identification using PRONOM and DROID, Adrian Brown defines a "data format" succinctly as:
The internal structure and encoding of a digital object, which allows it to be processed, or to be rendered in human-accessible form.
[edit] Why Data Formats Matter
The purpose of a digital archive or repository like DSpace is to find and deliver digital objects for users. However, the real goal of the users is to get at the intellectual content encoded in those digital objects. Providing a dissemination format that can readily be interpreted by common desktop applications and accurately naming it is an integral part of the job. If you doubt that, try changing the BitstreamFormats of an Item's contents to Unknown and see how useful it is.
Now that we have established that some knowledge of data formats and technical metadata is necessary to fulfill DSpace's mission, we'll break down the various uses of technical metadata, and show how much there is to know about data formats.
Not all uses of formats are relevant to every DSpace installation. For example, preservation of digital objects has specific demands of formats, but not every DSpace administrator is concerned with preservation. However, the architecture has to support preservation activities for those users that need them (and others who may be surprised at how relevant preservation is to them after all).
[edit] What Goes Into a Data Format
Typically, the description of a format may include:
- A descriptive name, e.g. "OpenDocument Format"
- Identifiers, including:
- Rigorous definition of the format's encoding, and documentation showing how to interpret it.
- References to other formats upon which it is built, e.g. XML depends on the ASCII (or UNICODE) character set and UTF-8 encoding.
- Relationships to other formats as subtypes, families, etc.
- References to application code that generates or interprets the format.
- Classification descriptive metadata about the nature of the intellectual content or encoding.
Not all of this metadata about formats is relevant to the mission of DSpace. And, fortunately, there are public format registries such as PRONOM which are already collecting and maintaining it.
[edit] DSpace Use of Data Formats
What does DSpace do with formats, and what should it be doing? Each kind of use has its own requirements of the format technical metadata:
[edit] Dissemination
When disseminating a Bitstream e.g. through a Web-based UI, DSpace needs a MIME type for format technical metadata. The only problem with this is that MIME types are poorly standardized, so there might be several valid identifiers for the same format. The recipient of the dissemination can only handle certain MIME types, so if DSpace gives it an unfamiliar one it will not render the content correctly.
To correct problems of patron's browsers not recognizing the MIME types that DSpace sends, it may be more practical to adjust DSpace than the browser. This is possible (and has been done) now, in DSpace 1.4.
[edit] Search
Does anyone search on actual Bitstream formats? The DC type element is similar to a format but not quite the same. Has anyone configured DSpace to put the types present in member Bitstreams into a search index for Items (or wanted to)?
The search criteria would probably be formats at a very coarse-grained level, e.g. "image", "audio", "text".
[edit] Locating Applications
DSpace uses format metadata to locate applications relevant to a Bitstream. For example, the MediaFilter mechanism processes a Bitstream by getting its format, and looking for filters that accept that format as their input.
This depends on the format of the Bitstream being accurately and completely identified. If a Bitstream's format was set to the wrong one or is completely unknown, it will not be processed.
There must also be a way to describe the formats each application accepts. This requires describing ranges of acceptable formats in the DSpace configuration. If we have thousands of formats available from a registry, it ought to also offer some simpler means of describing the range of formats accepted by an application, ideally without listing every individual format. For example, if formats are modeled in a hierarchy or by family groups, perhaps a range of fine-grained formats can be indicated by naming its superclass or parent.
[edit] Preservation
Data formats are also critical to all digital preservation operations:
[edit] Validation
To validate whether a Bitstream conforms to its identified format, it is first necessary to know its format in precise (fine-grained) detail. For example, if its format has distinct versions, the particular version (e.g. "PDF 1.2") must be identified, so the validation is meaningful.
[edit] Obsolete Format Detection
This also requires fine-grained format identification, because some versions of a format within a family of formats will usually go obsolete before the rest, so formats must be identified down to the version.
[edit] Migration
Migration is very similar to the existing MediaFilter application, since it is done by a collection of filters that translate obsolete or unpreservable formats into formats more suited to digital preservation.
A fine-grained identification of formats is necessary so we can choose only the Bitstreams in immediate danger of obsolesence, but then there has to be a way to match that specific format against a possibly-coarser-grained declaration of the formats accepted by a migration tool.
[edit] Why the DSpace Format Model is Inadequate
The data model and implementation of BitstreamFormat was originally intended as a placeholder to be supplanted by an externally-developed format registry such as the GDFR. Unfortunately, progress has been slow in the field of format technical metadata, so there hasn't been any obvious need to revisit the original design decision because of the ultimate format registry coming on the scene.
There are some external format registries available now, however, and a prototype implementation of the GDFR itself is under active development. Since the FACADE project is dedicated to improving DSpace's digital preservation capabilities, resources are also available to upgrade DSpace's use of data formats.
Here are some particular issues to be addressed, explained in greater detail below:
- BitstreamFormat short names are meaningless (as standard identifiers) outside of DSpace.
- Difficult to use external preservation tools without common format names.
- No fine-grained format representation.
- Lack of technical metadata useful in preservation (format documentation, links to tools, etc.)
- Cannot leverage format work done for other format registries.
[edit] Granularity of Data Format Technical Metadata
The granularity of a data format definition refers to how broad or narrow its concept of the format is. A fine-grained format description is limited to a particular indivisible version or subset of a format, e.g. "PDF 1.2". A coarse-grained format encompasses an entire family of formats, e.g. "PDF" (all versions and subsets), or "PostScript".
DSpace 1.4 includes descriptions of fewer than 40 coarse-grained formats. Also, its method of identifying formats is crude and extremely vulnerable to error and failure, since it depends on recognizing filename extensions of uploaded content files.
A serious preservation effort demands fine-grained indentification of file formats, and more reliable format identification.
We acknowledge that not every DSpace installation is required to implement preservation on its contents, so it will still be possible to configure simpler and coarser format identification tools.
Note that coarse-grained format descriptions can be useful as well, so long as there is a way to discover the fine-grained formats that conform to each coarse format. It is easier to cite a coarse-grained format when specifying the formats acceptable to an application, or when searching for Items with components of a given format.
[edit] Data Format Identifiers
Identifiers are names for data formats. There is obvious value in having a set of identifiers that:
- Are globally recognized, can be exchanged meaninfully between Dspace, other archives, and applications.
- Each uniquely identify a data format, i.e. correspond 1:1 with formats.
- Indexes an entry in a publically-available format registry, which contains metadata about the formats.
[edit] What is Wrong With MIME Types
Many of today's most popular multimedia applications, such as Web browsers and email user agents, recognize MIME types as data format identifiers. I believe they employ MIME types because they are already widely implemented as part of the mail system, not because they are a good solution. In practice, MIME types often fail to get content rendered correctly, requiring adjustments and human intervention.
Problems with MIME Types:
- There is no authoritative standard that actually gets followed, resulting in multiple "valid" identifiers for the same format (e.g. model/iges and application/iges have both been officially recommended at various times; unofficial usage is all over the map.)
- IANA attempts to provide a central registry.
- Applications receiving MIME types are wont to be configured with the types they commonly receive, regardless of the "standard".
- The same format can be legally described at different levels of granularity, e.g. these are all valid descriptions of RDF/XML:
- Content-type: application/rdf+xml (specifies RDF encoded as XML)
- Content-type: text/xml (only acknowledges the XML level of encoding)
- Content-type: application/xml
- In practice, it is never used for fine-grained format identification. Each format has its own rules (if any) to distinguish versions within a MIME type.
It's worth noting that Apple Computer felt it necessary to invent their own Uniform Type Identifiers to take over the role of MIME types on MacOS X. The UTIs are hiearchical, and cleanly represent families of formats and various granularities. They give each vendor (or organization) a private namespace to control, for unambiguous extensibility. It is not a file format registry, however, since there is no metadata, just names.
[edit] DSpace Abuse of "License" Format Name
The current (1.5) DSpace data model applies the special BitstreamFormat named "License" to all Bitstreams in the "LICENSE" bundle. This is a convention in the data model that identifies the Bistreams as, effectively, rights metadata. It has two serious flaws:
- Because it overloads the purpose of the BitstreamFormat metadata, there is no way to describe the actual data format of the License Bitstream.
- It is unnecessary and redundant, since all License Bitstreams are stored in the "LICENSE" bundle anyway.
Here is a case study that clearly illustrates the hazard of stealing a Bitstream's technical metadata: In the course of moving from a Windows to Unix platform, this DSpace administrator has discovered a digital preservation problem. Many existing License Bitstreams need to be either converted in place or labelled correctly on dissemination so they are understandable. If they had honest BitstreamFormat metadata accurately identifying their format, it would be straightforward to pick out the problem Bitstreams. As it is, there may be License bitstreams representing XML and URIs for CreativeCommons, so any mechanism to detect or fix them has to identify the file format from the contents of the Bitstream. Also, given the existing format model, the only way to repair this problem is to rewrite the License Bitstream in the new preferred character set -- which is also undocumented!
From: Christian Voelker <C.Voelker@gmx.net> To: "dspace-tech@lists.sourceforge.net Tech" <dspace-tech@lists.sourceforge.net> Date: Fri, 1 Feb 2008 23:03:07 +0100 Subject: [Dspace-tech] How to change license.txt bitstream for all items Hello, we have moved from one server to another and on the new machine, apache is configured to deliver text files with the proper utf-8 encoding. On the former machine, this was obviously not the case. The former admin decided to save the license.txt in windows encoding which made it display properly in the users browser then. But now, it looks really nasty. Example: http://www.stadtteilgeschichten.net/retrieve/940/license.txt Now I found, that all these licenses get copied to the assetstore at the time they are granted with a timestamp and username in the first line making each of them unique and an associated database entry containing the precise file size. Now, I could certainly put together some script to find all license files and change the special characters. This would probably reduce the file size as well, which would introduce errors to the db implicitly. Ok, I could produce another script to change the database entries accordingly. All this looks like quite some work for a more or less cosmetic problem. I see that the design makes sense because content should not be changed in an archive after the date of ingestion, but I think that this change is not dishonest. I could also reconfigure apache to deliver text files the old way but this means to perpetuate the situation for all items added from now on. Any advice appreciated. Bye, Christian
[edit] Value of External Format Registries
There are many advantages to using a data format registry outside of DSpace as the main source of format technical metadata, so long as it is stable and well-maintained.
- Formats have globally-unique, persistent identifiers
- Some other application talking to DSpace can refer precisely to a format with this external identifier, so it is mutually understandable.
- It provides technical metadata (and possibly tools) describing how to identify formats by internal and external characteristics.
- Includes references to standards documents and other descriptions.
- Registries typically include descriptive metadata showing relationships between formats, allowing fine-grained definitions to be grouped under coarse-grained formats.
- A central registry gets support and attention from other users, which we can leverage.
- Adding documentation of new formats
- Catching and correcting mistakes
- Monitoring formats for obsolescence
