TechMDExtractor
From DSpace Wiki
author: Grace Carpenter
date: August 2006
TechMDExtractor is a command-line tool for running Jhove on the DSpace asset store. It will determine if each bitstream is a valid and/or well-formed instance of the format it purports to be. If an identifier is specified, processing will be limited to the given Community, Collection, or Item. If verbose processing is specified, all the extracted technical metadata will be sent to standard output.
Contents |
About the Design
In order to make Jhove work with DSpace, I had to create two classes that wrap two of the main Jhove classes. These classes, org.dspace.app.techmdextractor.jhove.DSJhoveBase and org.dspace.app.techmdextractor.jhove.DSConfigHandler, essentially re-write the code in the corresponding Jhove classes (edu.harvard.hul.ois.jhove.JhoveBase and edu.harvard.hul.ois.jhove.ConfigHandler, respectively). DSJhoveBase initializes the Jhove modules, and also provides the main entry points for DSpace to parse bitstreams. DSConfigHandler has code to parse the DSpace-specific elements of the Jhove configuration file (jhove.conf).
Configuring TechMDExtractor
Apply the dspace-preingest patch to your DSpace installation, and follow the instructions for configuring it. (TechMDExtractor has some build-time dependencies on the Pre-ingest project--it implements two of its interfaces: org.dspace.workflow.PreIngestFilter and org.dspace.workflow.FilterResult.)
Check out the TechMDExtractor project from CVS.
Modify the file
TechMDExtractor/config/jhove.confto reflect the specifics of your DSpace installation. In particular, the things that must be modified are:- the <tempDirectory> element must contain a directory with appropriate permissions for the Jhove executable to write to
- the <dspace:format-name> element that follows each
<module>/<class> element
must contain the short description of the format as it appears in your
bitstreamformatregistry table.
Jhove.confcontains the default short descriptions for DSpace formats, so you don't have to worry about this if you haven't edited the bitstreamformatregistry table.
If you wish, configure logging for the non-DSpace-specific code in Jhove by editing
TechMDExtractor/config/jhoveLogging.properties.Note that the Jhove code actually uses two different logging APIs: java logging for most of Jhove, and log4j for the DSpace-specific initialization and top-level execution code. For debugging set-up problems, you should be able to get most of the information you need from the regular DSpace logs. If you want to debug format-specific parsing issues, you should modify the file
TechMDExtractor/conf/jhoveLogging.properties, which will be placed in your[dspace]/config/directory at build-time.From the TechMDExtractor directory, type
ant install. After the build process has completed, verify that the following jars are in your[dspace]/libdirectory:tmdExtractor.jarjhove.jarjhove-handler.jarjhove-module.jar
Running
ant installshould also place the above jars in your[dspace-source]/libdirectory, for use in the Workflow Pre-ingest step.The files
TechMDExtractor/config/jhove.confandTechMDExtractor/config/jhoveLogging.propertiesshould have been copied into your[dspace]/configdirectory.Don't forget that the dspace.cfg file in your
[dspace]/configdirectory must be modified, as specified in the Workflow Pre-ingest instructions.Note that the Jhove initialization code (in
org.dspace.app.techmdextractor.jhove.JhoveExtractor) also checks for the configuration variablejhove.sax.class. This is because I always get errors when parsing the jhove configuration file, although they don't cause the code to fail. See the "Known Issues" section of the documentation for more information.
Running TechMDExtractor
From the [dspace]/bin directory, type
dsrun org.dspace.app.techmdextractor.ExtractorManager -h
You'll get a list of command-line options for running the program. Note that the code for the TechMDExtractor is based heavily (OK, stolen;) ) from the MediaFilter code, so many of the options are similar.
Files Changed
- config/dspace.cfg
Files Added
The source code may be found online under CVS here:
http://libaxis1.mit.edu/viewcvs/sandbox/TechMDExtractor/
- config/jhove.conf
- config/jhoveLogging.properties
- src/org/dspace/app/techmdextractor/jhove/DSConfigHandler.java
- src/org/dspace/app/techmdextractor/jhove/DSJhoveBase.java
- src/org/dspace/app/techmdextractor/jhove/JhoveExtractor.java
- src/org/dspace/app/techmdextractor/jhove/JhoveFilterResult.java
- src/org/dspace/app/techmdextractor/jhove/JhovePreIngestFilter.java
- src/org/dspace/app/techmdextractor/jhove/JhoveTechMD.java
- src/org/dspace/app/techmdextractor/jhove/ExtractorManager.java
- src/org/dspace/app/techmdextractor/jhove/TechMDExtractorException.java
- build.xml
Known Issues
SAXParser problem: the SAX parser complains when it parses the jhove.conf file. The messages I get are:
[Warning] jhove.conf:6:39:SchemaLocation: schemLocation value = 'http://hul.harvard.edu/oi/xml/xsd/jhove/1.3/jhoveConfig.xsd' must have even number of URI's. [Error] jhove.conf:6:39: cvc-elt.1: Cannot find the declaration of element 'jhoveConfig'If you use Jhove 'out of the box', you won't receive these errors. I believe that Jhove as a stand-alone uses the default Java SAX parser (Crimson?), whereas DSpace is using Xerces. It seems that the different parsers probably need to be configured differently. I don't think the error messages are a problem for the config file, but I'm not sure how this affects the parsing of XML docs submitted to Jhove. I started to play around with this, and the TechMDExtractor code actually checks the dspace.cfg file to see if a parser is specified (
jhove.sax.class=sax parser name). Needs investigation.
