Configure media filters

From DSpace Wiki

Jump to: navigation, search

[edit] Files:

  • [dspace]/bin/filter-media
  • [dspace]/config/dspace.cfg
  • A scheduling program (e.g. Unix cron or Windows Scheduled Tasks)

[edit] Instructions:

  1. In DSpace, “media filters” are what control both full-text indexing and automated creation of thumbnail images. Both can be scheduled by calling the filter-media script (which in turn calls the org.dspace.app.mediafilter.MediaFilterManager class).
  2. The list of all currently enabled “media filters” is available in your dspace.cfg configuration file under the section labeled:
    #### Media Filter plugins (through PluginManager) ####
  3. If you wish to disable or enable a specific “media filter”, you can remove or add them from the sequence list in that section: plugin.sequence.org.dspace.app.mediafilter.MediaFilter = org.dspace.app.mediafilter.PDFFilter, org.dspace.app.mediafilter.HTMLFilter, org.dspace.app.mediafilter.WordFilter, org.dspace.app.mediafilter.JPEGFilter
  4. (Make sure that each separate line ends with a backslash \ character! A list of what each “media filter” does is available in the Notes section)
  5. For Linux or Mac OSX, you can schedule the filter-media shell script to run by adding a cron entry similar to the following to the crontab for the user who installed DSpace:
    0 2 * * * ''[dspace]''/bin/filter-media
  6. (The above entry would schedule filter-media to run nightly at 2am. You would need to change [dspace] to the full path of your DSpace installation directory.)
  7. For Windows, you will be unable to use the filter-media shell script. Instead, you should use Windows Scheduled Tasks to schedule the following command to run at the appropriate time of day:
    ''[dspace]''/bin/dsrun.bat org.dspace.app.mediafilter.MediaFilterManager
  8. (The above command should appear on a single line.)

[edit] Notes:

  • Below is a listing of all currently available Media Filters, and what they actually do:
    • HTMLFilter – extracts the full text of HTML documents for full text indexing.
    • JPEGFilter – creates thumbnail images of GIF, JPEG and PNG files
    • BrandedPreviewJPEGFilter – creates a branded preview image for GIF, JPEG and PNG files (disabled by default)
    • PDFFilter – extracts the full text of Adobe PDF documents (only if text-based or OCRed) for full text indexing
    • WordFilter – extracts the full text of Microsoft Word or Plain Text documents for full text indexing
  • Please note that the filter-media or MediaFilterManager will automatically update the DSpace search index by default (see Re-index DSpace). This is the recommended way to run these scripts. But, should you wish to disable it, you can pass the -n flag to either script to do so.
  • The following additional options are also available for either the filter-media or MediaFilterManager scripts:</p>
    • -v = verbose mode (prints out all extracted text and additional messages)
    • -f = forces reprocessing of all bitstreams (by default only unprocessed bitstreams are processed by this script)
    • -n = do not update the search index after completion
    • -i <handle> = only process bitstreams within a particular community/collection/item represented by the <handle>
    • -m <#> = only process a maximum number of bitstreams (specified by <#>) . This is useful if you want to process bitstreams little-by-little in order to avoid taxing your server too much.
Personal tools