BitstreamFormat Renovation Use Cases
From DSpace Wiki
This page contains out of date or incorrect information.
Please help update this page or other pages requiring updating.
This list of use-case sketches demonstrates how the BitstreamFormat Renovation proposal will work, and shows some of the scenarios the designers had in mind. They are "sketches" of use cases because they do not have the exhaustive detail and exploration of alternatives required for a full use case document.
Contents |
[edit] Ingest
Use case sketches about ingest operations.
[edit] Interactive Ingest
When submitting a new Item interactively, the user creates Bitstreams by uploading the contents through a Web browser. This supplies DSpace with a filename and possibly a MIME type, but no other clues to the data format except the contents of the Bitstream itself.
Upon receiving each Bitstream, the ingest service calls on the new automatic format identification service to assign it a BitstreamFormat. It also returns a "quality" metric indicating the certainty of the identification.
At this point the UI should display the identified format for confirmation by the user. It can also use the quality to advise the user on whether they need to check the automatic results; e.g. for the very weakest levels of quality. Also offer the user the option of overriding the format choice, see Interactive Format Selection.
[edit] Uploading Logo Images
The UI for creating and modifying Collections and Communities allows new "logo" images to be uploaded. The procedure for these is almost exactly the same as for the contents of Items, except that the UI should also check that the identified format is an acceptable image (i.e. test that its MIME type begins with "image/").
[edit] Unattended Ingest
The non-interactive (i.e. batch) ingestion methods benefit especially from a more reliable and accurate automatic identification of data formats.
[edit] Package-based Ingest
Each ingested Bitstream (both content and metadata) has its data format automatically identified. The submission information package (SIP) can potentially deliver three pieces of relevant metadata for each Bitstream:
- Filename, including the extension that may indicate type.
- MIME type (not useful for format identification)
- data format identifier from a known external registry
One source of the data format identifier is a PREMIS object element, which specifies the registry as well as the identifier. If that format registry is known to the ingesting DSpace archive (i.e. configured as an external registry by the BitstreamFormat implementation), then a simple lookup will return the exact BitstreamFormat referring to that format and we can accept that as the correct format, if the source of the package is trusted.
If there is no format identifier from a known format registry, then the automatic format identifier is invoked as in the interactive case.
Low-quality or failed format identifications should result in a warning.
NOTE: There is currently no designated mechanism to collect and deliver warnings during a non-interactive process. Messages can be sent to the Java logging facility, but that collects messages for all DSpace processes running on the server. There ought to be a way for a single application to collect its own warnings, and later deliver them.
The MIME type, if found, could be used to identify conflicts and possible mistakes in automatic format identification. Especially if the first word of the MIME type is different between package and identified type (e.g. it sent a "image" but was received as "audio"), then a warning should be recorded.
[edit] Mirror, or Custody Transfer of Item between DSpace archives
To move or copy an Item between archives, the source disseminates it as some sort of package, which the target then ingests. Ideally they use an actual Archive Information Package (AIP) so there is no loss of data or metadata when crosswalking to intermediate formats, which is inevitably incomplete.
The operation is successful if the new object is identical in content and behavior to the source object. This implies the Bitstreams have precisely equivalent BitstreamFormat values.
The copying of a Bitstream follows this sequence of operations:
- Encode its BitstreamFormat as technical metadata in outgoing package, by adding each global, external identifier for the format to the PREMIS metadata in a formatRegistry element.
- Ingester finds formatRegistry elements in the PREMIS, and so long as it has access to the same external format registry, it can create the equivalent BitstreamFormat.
- Possibly sanity-check on ingest that all recognized identifiers map to the same BitstreamFormat.
This assumes that both the source and target DSpace archives have the same format registries configured. But what if the source has GDFR and PRONOM configured, while the target only has GDFR? The target will still match the GDFR identifier in the PREMIS and assign the correct local BitstreamFormat to the Bitstream. However, when it produces an AIP or DIP of the Item, that Bitstream will only have the GDFR format identifier: some data has been lost, because the content model relies on BitstreamFormat objects to manage format identifiers. Ingested objects have their formats normalized to the local archive's format model, i.e. the collection of external registries it configures.
In practice, this should usually not matter. When our example Item is sent back to its original archive, the Bitstream will get back its original BitstreamFormat -- because the GDFR format identifier sent with it will get resolved to the same BitstreamFormat it had originally.
If two DSpace archives are exchanging a lot of Items, they should be configured with the same data format registries (or at least an overlapping set), so the format technical metadata on their Bitstreams is mutually comprehensible.
[edit] ItemImporter
The traditional ItemImporter ingests Items from a local file structure on the DSpace server. The only cues it has available to identify a Bitstream's data format are its filename (i.e. external signature) and the contents of the Bitstream itself. Format identification works the same as the package-based ingester (at least, in cases where the package does not contain explicit format identifiers).
The addition of "quality of identification" results makes it easier for archive administrators to evaluate the success of an import and determine whether to review the automatic format choices.
[edit] Dissemination
Here is how the new format design is used in various dissemination tasks:
[edit] Interactive
[edit] Single Bitstream over HTTP
DSpace's current Web-based user interfaces deliver content by sending the raw Bitstream's data stream to the browser via HTTP. To conform to this protocol, they must describe the data format with a MIME type, in Content-Type: header sent as part of the HTTP response message. See also W3C statement on MIME types. HTTP clients such as Web browsers should depend entirely on the MIME type to render the content correctly In practice, some clients cheat and look at filename extensions as well, but this is irregular and should be unusual. Therefore it is critical for DSpace to apply the correct MIME type on materials disseminated through HTTP.
If the MIME type of a Bitstream is "wrong", i.e. does not match what the browser expects for its format, it will not be rendered correctly. The user will be prompted for instructions if the MIME type is unknown to the browser. Note that the definition of "wrong" is somewhat situational; since MIME Type strings are poorly standardized, there are several valid descriptions of some formats, but commonly-used browsers may only recognize one of them.
[edit] Tweaking MIME Type
If Bitstreams are disseminated with a MIME Type that the prevailing browser does not recognize, this can lead to pressure on the DSpace administrator to change the behavior of his application, especially in an academic environment. For example, when a course uses digital objects from DSpace, they must be accessible to the browsers commonly used by the students.
Fortunately, the MIME type is a characteristic of the BitstreamFormat object, so changing it there will change the MIME type applied to all Bitstreams of that format. So, to alter the MIME type applied to a class of Bitstreams, the administrator only has to go to the DSpace admin interface and change the appropriate BitstreamFormat.
This requires the DSpace administrative GUI to provide an obvious path from a Bitstream's description to its BitstreamFormat, and the administrative interface for that format.
Although the MIME Type of a BitstreamFormat initially comes from its external format registry entry, it is subject to local override. This means the administrator can edit the local archive's BitstreamFormat to alter the MIME type, and the change is persistent, so it remains even if the BitstreamFormat is updated from its external registry.
[edit] Unattended
[edit] Package-based Dissemination
The METS package profiles for DSpace SIPs and AIPs call for Bitstream technical metadata in several places:
- MIME Type in the file element.
- MIME Type in the PREMIS object section.
- Registry-based format identifiers in PREMIS formatRegistry element.
All of them have obvious sources in the BitstreamFormat object. The MIME type or external registry identifiers are simply taken from the Bitstream's BitstreamFormat when adding technical metadata to the package.
[edit] Search
The file format can be the object of a search query. Usually it requires a very coarse-grained view of formats, e.g. "image", or "audio". Searchers are typically looking for Items that include, e.g., an image or audio component. The Dublin Core type element is supposed to describe the nature of the content as both a media type and purpose or venue i.e. types of text media are distinguished as "Article", "Thesis", "Monograph", etc. If the submitter did not provide a type value, perhaps it could be derived from the format types of content Bitstreams. This would only capture the media-type sense of type, but perhaps that is better than nothing?
This still raises the question of mapping the fine-grained format definitions we labor to identify so precisely onto the coarse range of values assigned to the DC type element. None of the present external format registries include such coarse-grained metadata for formats. The prefix of the MIME type might be made to serve, e.g. text, image, audio, etc.
[edit] Archive Admininistration
[edit] Interactive Format Selection
There are many tasks and dialogs in the Web user interface where the user is offered the option to select a data format for a Bitstream (note that it is not always necessary to use it; e.g. when the format has been identified automatically already, the manual selection is only needed if that result was unsatisfactory):
- Confirmation dialog after uploading a Bitstream while submitting an Item.
- Workflow tasks to revise the metadata of a pending Item. Show "quality" of previous format identification as well as result, allow changes.
- Administrative functions to edit the metadata of an archived Item.
- Administrative pages to upload "logo" images when creating or editing Collection and Community objects.
Since the internal collection of BitstreamFormats (BSFs) is now just a "cache" for entries in external format registry, it is not sufficient to give the user a choice of existing BSF entries. If they are looking for a format which has not been seen already in the archive, there will not be a BSF for it. To get access to the exhaustive list of data formats, we must offer the user a choice from amongst all of the formats in each of the configured external registries, or at least the most complete and preferred registry.
[edit] Listing and Navigating Formats
When presenting a choice of formats to the user, the fundamental problem is that the external format registries have many, many entries -- from 500 to thousands. This is too many to put in a simple pulldown menu. The registries each have their own metadata and tools to help users select a format. Rather than force a common interface on all registries, we propose that the DSpace UI defer to the format registry's UI to select a format, or perhaps implement a plugin-style UI dialogue to interface with the registry.
The registry's own search tools are bound to be more effective and powerful than a generic approach. PRONOM, for example, allows searching for formats by the software or vendor that produces them, which is more understandable to the naive user.
All DSpace requires from the format selection process is a namespaced external format identifier, which will either be matched to an existing BSF or ingested to create a new one. It is a simple matter for the module that accepts the results of a registry-specific UI to add the DSpace-specific namespace to the identifier, since the registry is already known.
[edit] Editing Format Metadata
Although a BitstreamFormat object in the DSpace content model is created by ingesting an external format description, most of the format's metadata may then be modified. The modifications act as persistent local "overrides" of the remote format data, so they remain even if the format is re-ingested after its remote source is updated. Any changes are local to the DSpace archive; they do not get reflected out to the external format registry. The modifiable properties include:
- Name: Affects how the format is displayed in the UI.
- Description: Detailed description available through UI.
- MIME Type: Affects how Bitstreams are disseminated through HTTP.
- Canonical Extension: Can be used to generate filenames, might be needed to accomodate broken HTTP user agents and when making up filenames in dissemination packages (DIPs).
- Support Level specifies policy regarding the level of commitment to preserve Bitstreams of this format.
[edit] Adding New File Formats
Sometimes it is necessary to add a new file format to the repertoire of the external format registries (note that new BSFs are added automatically for any unrecognized external identifier). When the true format of a Bitstream is not already listed in any of the configured external registries, it must be added somehow. The options are, in order of preference:
- Add a full description of the format to a user-editable external registry, such as the GDFR. It will create a globally unique identifier in its namespace.
- Include the data to drive automatic format identification tools, e.g. internal signatures.
- Supply other metadata called for by the registry, such as references to specification documents.
- Add a format entry to the built-in Local registry in your DSpace, make up a locally-unique identifier.
- References to this format are not portable to other DSpaces unless they have the same Local format entry (i.e. it was manually copied over).
The first option is much preferred, since adding a the format description to a common registry benefits all of its users, and gives you a persistent format identifier.
[edit] Managing Bitstreams
An archive administrator sometimes needs to modify format technical metadata in the content model to correct mistakes or accomodate changes. Some possible scenarios:
- One Bitstream in an Item is discovered to have the wrong format, or none, and must be corrected.
- Many Bitstreams did not have their format automatically identified in a recent batch import. After fixing the automatic identification, they must be re-identified.
- A new format description is added to the PRONOM registry, deprecating a format in the Local registry that was added for expedience.
- Change all Bitstreams referring to the old format over to the new one, and delete the Local entry.
The administrative UI needs a method to select a collection of Bitstreams by their BitstreamFormat, among other critera. (Note that since unidentified Bitstreams are set to the Unknown format, they are selected as easily as any other format.) This collection can then be the subject of other operations, namely:
- Change format of selected Bitstreams to a different BSF.
- Re-try automatic format identification.
- Must offer confirmation option when done interactively.
Some administrative operations are needed for the BitstreamFormats themselves:
- Delete a BSF (provided no Bitstreams refer to it, of course).
- Add or modify the external identifiers mapped to a BSF.
- Edit descriptive and administrative metadata.
- Locate the BSF for a given external identifier.
[edit] Assessments and Reports
The new format infrastructure gives the archive administrator much more control over how formats are identified and even where the technical metadata comes from. In order to make intelligent decisions and monitor their outcome, she needs to gather data about the archive, so these reports will be available:
- Histogram of number of Bitstreams referencing each BSF.
- Counts of each format-identification quality for each type of BSF.
- Dump of all BSF table entries.
- Dump of all external identifiers bound to BSFs, organized by registry.
The histogram of BSF usage is especially important since it can be coupled with alerts about obsolete formats to gauge how serious the problem is. It can also show how effectively the format identification works by the frequency of precise format versions versus generic broadly-defined formats. The report of quality per BSF shows that more graphically and can help tune the format identification configuration.
[edit] Preservation Tasks
The following digital preservation tasks depend on features of the data format infrastructure:
[edit] Format Identification
Virtually all preservation tasks depend on knowing the exact data format of the digital object being preserved, so accurate format identification is the cornerstone of DSpace preservation.
We believe it is better to concentrate on making automatic format identification precise, accurate, and efficient, since it is likely to be more reliable than manual format identification. Few end-users understand the importance and subtleties of data format identification, or appreciate the advantage of having thousands of known formats to choose from. In our experience the average submitter, or even the average workflow editor, is not likely to give more than cursory attention to format technical metadata.
[edit] Newly-ingested Bitstreams
It is critical to correctly identify the formats of Bitstreams in new submissions at the time of ingestion; once the Item is in the archive, it is not guaranteed to get any more attention even from administrators.
This requirement can be satisfied by a configurable policy and the mechanism for enforcing it. For example, the policy would state the minimum acceptable format identification quality, and the consequence for failure. Bitstreams receiving a quality metric below the minimum would result in one of these alternatives:
- Ingestion operation fails.
- Ingested item is held (as in workflow) for administrative checking and approval.
- Warning is logged and sent to ingester, and owner of target Collection.
- No consequences.
Since the range of quality includes NONE, meaning no format was identified, setting the minimum acceptable quality to NONE is another way to allow failures with no consequence.
The proposed format-identification policy is configurable at each Collection and as a default for the entire archive.
[edit] Tuning Format Identification
The machinery of automatic format identification is completely configurable, so the administrator of each DSpace instance can adapt it to suit his needs. It is implemented as a sequence plugin, in which the implementations are all called in a configured order. Each plugin may recognize only some formats, and it can also see and leverage the results of previously-called plugins.
Tuning the sequence of format identification plugins lets the archive administrator keep up with new data formats and the constant improvments in the technology of identifying them. As well, each archive has its own requirements of format identification, based on the types of material it ingests and the requirements for its preservation. To tune format identification:
- Determine the range of formats that need to be identified, and desired precision
- E.g. is "XML" adequate for all XML-based formats, or do some need to be identified as e.g. SVG, XHTML, METS..
- Select which plugin implementations to include and the most advantageous ordering.
- Test against samples of expected submissions, and revise if necessary.
- Keep up-to-date on format identification developments:
- Watch for news and exchange information within the DSpace community.
- Revise configuration as needed.
[edit] Detect Obsolete Formats
When preserving digital objects, it is essential to know when their format is becoming obsolete and thus needs attention. See the AONS II project for an example of an application that does this.
First, you need very fine-grained format identification that discriminates between versions of a family of formats (e.g. PDF). Often, older versions of a format will become unsupportable while the later versions are still viable.
This task also illustrates another advantage of using external data format registries as the archetype of DSpace format definitions: we automatically leverage the work of preservation specialists maintaining and using those external registries, e.g. when they announce obsolete formats.
When a format in an external registry is declared obsolete, the DSpace administrator can easily locate Bitstreams in that format using the same tools as for updating and changing formats.
The archive's actions should also be governed by policy, namely the support level property of the BitstreamFormat. If the support level is anything less than SUPPORTED, then the archive may ignore the obsolecense of that format or just issue a warning to owners of affected resources. Otherwise it is obligated to migrate or otherwise preserve the affected Bitstreams.
[edit] Selection of Applications
A primary use of file format technical metadata is to match Bitstreams to the applications and filters that can accept their format. The preservation-tool framework within DSpace helps manage this process by finding tools that support a Bitstream's format.
An application is configured in the framework by listing:
- The plugin interface it implements (e.g. MediaFilter)
- File formats it accepts, in the form of namespaced external format identifiers.
- Should include the appropriate format(s) for each external registry in use.
- Optionally, output format it produces.
If an application is capable of producing multiple output formats, it would be configured as multiple instances, since each instance will also need some sort of parameter to tell it which format to emit.
[edit] Matching Format Families and Supertypes
One problem in configuring an application is when it claims to accept all versions of a format, or simply is not specific beyond a generic description like "MS Word documents": how can you translate that to format entries in a registry? To configure it properly in DSpace, it seems you'd have to hunt down all the specific format definitions that fit the broad profile of what it accepts -- or else be able to configure a non-specific format, as described here.
Allowing an application to configure its "accepted formats" as generic or non-specific formats has these advantages:
- Much easier and more likely to be done correctly by the DSpace administrator.
- As formats are added to the registry, e.g. new version of a format within a family, the configuration will still be correct without any updates.
- Reflects the reality of unspecific input requirements of the application.
Some external format registries, such as GDFR and PRONOM, have a hierarchical type model for formats. They document formats which are subtypes of other "supertypes", or belong to a family of formats headed by a generic format. Each registry has different subtle distinctions in its relationship model, which makes it difficult to create a "normalized" view of it in the DSpace BitstreamFormat model. Another reason not to model it in BitstreamFormat is that the supertype mentioned in the configuration may not have any entry in the BitstreamFormat table, since only formats referenced by Bitstreams are included there.
Rather than attempt a flawed normalization, we will interface to external type hierarchies through the FormatRegistry plugin. Since we only need to answer the question, "Is this format acceptable as input to an application that says it accepts Format X?", we can just add a method to ask that question directly of the external registry: Is this format equivalent to or a subtype of X?
Here is an illustrative example:
- Start with a document identified as PRONOM "fmt/98" (HTML 3.2) format.
- We want to get plain text out of it (e.g. for full-text indexing).
- There is a filter configured that accepts the PRONOM "fmt/96" (generic "HTML")
- Asking the registry, we discover "fmt/98" is a subtype of "fmt/96".
- The Bitstream is therefore acceptable to that filter, start processing.
[edit] Data Format Validation
Validation is a necessarily distinct task from format identification. Not only is it a waste of time to try validating a format when it has not been precisely identified yet, there is also the possibility of false positives. Validators and format identification have different goals, anyway: for example, a PDF validator only has to ensure the document conforms to the specification of the PDF version(s) it validates, while the identifier must accept any arbitrary byte stream without crashing and identify all the formats it knows. A loose validator may not discriminate between different versions of a format, while the identifier must do so. It is valuable to be able to identify formats even if we cannot validate them.
Validation is probably most valuable as part of the ingest process. Governed by the archive's or collection's policy, submissions could be rejected or queued for administrative review if any Bitstreams do not pass validation for their identified formats. This helps ensure that the contents of those Bitstreams will be readable to users when disseminated, and that preservation operations (like migration) will be successful.
Note the similarity to monitoring quality-of-identification on ingest; a validation policy could be implemented by the same policy mechanism.
[edit] Migration, and Verifying Integrity of Format Migration
Archive administrators often rely on migration to preserve obsolete formats. It is also necessary to verify that a migration succeeded, i.e. compare the old and new versions of the Bitstream, making sure they are equivalent.
There are various techniques and software packages to migrate and verify digital objects. They can be modeled as application programs with an input BitstreamFormat (or range of formats), and an expected output BitstreamFormat.
The migration tool is matched to the subject of a migration by its input format, and by whether it can produce a non-obsolete output format.
The validation tool is matched to an existing pair of source and target Bitstreams, presumably the source and result of a migration. It must match both formats. It returns a Boolean value, true if the the target accurately represents the source. It may also generate a report or stream of warnings which should be handled the same way as e.g. warnings on ingest procedures.
In the DSpace configuration, migration and validation tools are listed with their input and output data formats described as namespaced external format identifiers. External format registries are the source of stable and persistent format identifiers.
[edit] Media Filter
The existing MediaFilter mechanism relies on data formats for two purposes: first, it looks for Bitstreams whose formats match the input format configured for each filter; and second, it depends on setting the format (as well as the name) of output Bitstreams to a certain known value which can be checked later to confirm that a filter has already been run.
Like the migration and verification tools, media filters are configured with formats named by stable and persistent format identifiers from external format registries.
The media filters in DSpace 1.4.x were configured and designed to work with fairly generic data formats, e.g. "PDF", but not a specific version of PDF. They should be mapped to similarly generic formats in the registries in use.
On the output side, the MediaFilter is configured with an external format identifier to impose on its output, and later to look for as a cue to detect the Bitstreams it created. (Though this is a poor technique for tracing its actions; administrative metadata documenting the relationship between source and output bitstreams would be easier to detect reliably and also more obvious to administrators and other applications.)
