TechMDExtractor

From DSpace Wiki

Jump to: navigation, search

author: Grace Carpenter
date: August 2006

TechMDExtractor is a command-line tool for running Jhove on the DSpace asset store. It will determine if each bitstream is a valid and/or well-formed instance of the format it purports to be. If an identifier is specified, processing will be limited to the given Community, Collection, or Item. If verbose processing is specified, all the extracted technical metadata will be sent to standard output.

Contents

About the Design

In order to make Jhove work with DSpace, I had to create two classes that wrap two of the main Jhove classes. These classes, org.dspace.app.techmdextractor.jhove.DSJhoveBase and org.dspace.app.techmdextractor.jhove.DSConfigHandler, essentially re-write the code in the corresponding Jhove classes (edu.harvard.hul.ois.jhove.JhoveBase and edu.harvard.hul.ois.jhove.ConfigHandler, respectively). DSJhoveBase initializes the Jhove modules, and also provides the main entry points for DSpace to parse bitstreams. DSConfigHandler has code to parse the DSpace-specific elements of the Jhove configuration file (jhove.conf).


Configuring TechMDExtractor

  1. Apply the dspace-preingest patch to your DSpace installation, and follow the instructions for configuring it. (TechMDExtractor has some build-time dependencies on the Pre-ingest project--it implements two of its interfaces: org.dspace.workflow.PreIngestFilter and org.dspace.workflow.FilterResult.)

  2. Check out the TechMDExtractor project from CVS.

  3. Modify the file TechMDExtractor/config/jhove.conf to reflect the specifics of your DSpace installation. In particular, the things that must be modified are:

    • the <tempDirectory> element must contain a directory with appropriate permissions for the Jhove executable to write to
    • the <dspace:format-name> element that follows each <module>/<class> element must contain the short description of the format as it appears in your bitstreamformatregistry table. Jhove.conf contains the default short descriptions for DSpace formats, so you don't have to worry about this if you haven't edited the bitstreamformatregistry table.
  4. If you wish, configure logging for the non-DSpace-specific code in Jhove by editing TechMDExtractor/config/jhoveLogging.properties.

    Note that the Jhove code actually uses two different logging APIs: java logging for most of Jhove, and log4j for the DSpace-specific initialization and top-level execution code. For debugging set-up problems, you should be able to get most of the information you need from the regular DSpace logs. If you want to debug format-specific parsing issues, you should modify the file TechMDExtractor/conf/jhoveLogging.properties, which will be placed in your [dspace]/config/ directory at build-time.

  5. From the TechMDExtractor directory, type ant install. After the build process has completed, verify that the following jars are in your [dspace]/lib directory:

    • tmdExtractor.jar
    • jhove.jar
    • jhove-handler.jar
    • jhove-module.jar

    Running ant install should also place the above jars in your [dspace-source]/lib directory, for use in the Workflow Pre-ingest step.

    The files TechMDExtractor/config/jhove.conf and TechMDExtractor/config/jhoveLogging.properties should have been copied into your [dspace]/config directory.

  6. Don't forget that the dspace.cfg file in your [dspace]/config directory must be modified, as specified in the Workflow Pre-ingest instructions.

    Note that the Jhove initialization code (in org.dspace.app.techmdextractor.jhove.JhoveExtractor) also checks for the configuration variable jhove.sax.class. This is because I always get errors when parsing the jhove configuration file, although they don't cause the code to fail. See the "Known Issues" section of the documentation for more information.

Running TechMDExtractor

From the [dspace]/bin directory, type

dsrun org.dspace.app.techmdextractor.ExtractorManager -h

You'll get a list of command-line options for running the program. Note that the code for the TechMDExtractor is based heavily (OK, stolen;) ) from the MediaFilter code, so many of the options are similar.

Files Changed

  • config/dspace.cfg

Files Added

The source code may be found online under CVS here:
http://libaxis1.mit.edu/viewcvs/sandbox/TechMDExtractor/

  • config/jhove.conf
  • config/jhoveLogging.properties
  • src/org/dspace/app/techmdextractor/jhove/DSConfigHandler.java
  • src/org/dspace/app/techmdextractor/jhove/DSJhoveBase.java
  • src/org/dspace/app/techmdextractor/jhove/JhoveExtractor.java
  • src/org/dspace/app/techmdextractor/jhove/JhoveFilterResult.java
  • src/org/dspace/app/techmdextractor/jhove/JhovePreIngestFilter.java
  • src/org/dspace/app/techmdextractor/jhove/JhoveTechMD.java
  • src/org/dspace/app/techmdextractor/jhove/ExtractorManager.java
  • src/org/dspace/app/techmdextractor/jhove/TechMDExtractorException.java
  • build.xml

Known Issues

  • SAXParser problem: the SAX parser complains when it parses the jhove.conf file. The messages I get are:

    [Warning] jhove.conf:6:39:SchemaLocation: schemLocation value = 'http://hul.harvard.edu/oi/xml/xsd/jhove/1.3/jhoveConfig.xsd' must have even number of URI's. [Error] jhove.conf:6:39: cvc-elt.1: Cannot find the declaration of element 'jhoveConfig'

    If you use Jhove 'out of the box', you won't receive these errors. I believe that Jhove as a stand-alone uses the default Java SAX parser (Crimson?), whereas DSpace is using Xerces. It seems that the different parsers probably need to be configured differently. I don't think the error messages are a problem for the config file, but I'm not sure how this affects the parsing of XML docs submitted to Jhove. I started to play around with this, and the TechMDExtractor code actually checks the dspace.cfg file to see if a parser is specified (jhove.sax.class=sax parser name). Needs investigation.

Personal tools