Configure media filters
From DSpace Wiki
[edit] Files:
- [dspace]/bin/filter-media
- [dspace]/config/dspace.cfg
- A scheduling program (e.g. Unix
cronor Windows Scheduled Tasks)
[edit] Instructions:
- In DSpace, “media filters” are what control both full-text indexing and automated creation of thumbnail images. Both can be scheduled by calling the
filter-mediascript (which in turn calls theorg.dspace.app.mediafilter.MediaFilterManagerclass). - The list of all currently enabled “media filters” is available in your dspace.cfg configuration file under the section labeled:
#### Media Filter plugins (through PluginManager) ####
- If you wish to disable or enable a specific “media filter”, you can remove or add them from the sequence list in that section:
plugin.sequence.org.dspace.app.mediafilter.MediaFilter = org.dspace.app.mediafilter.PDFFilter, org.dspace.app.mediafilter.HTMLFilter, org.dspace.app.mediafilter.WordFilter, org.dspace.app.mediafilter.JPEGFilter - (Make sure that each separate line ends with a backslash
\character! A list of what each “media filter” does is available in the Notes section) - For Linux or Mac OSX, you can schedule the
filter-mediashell script to run by adding acronentry similar to the following to the crontab for the user who installed DSpace:0 2 * * * ''[dspace]''/bin/filter-media
- (The above entry would schedule filter-media to run nightly at 2am. You would need to change [dspace] to the full path of your DSpace installation directory.)
- For Windows, you will be unable to use the
filter-mediashell script. Instead, you should use Windows Scheduled Tasks to schedule the following command to run at the appropriate time of day:''[dspace]''/bin/dsrun.bat org.dspace.app.mediafilter.MediaFilterManager
- (The above command should appear on a single line.)
[edit] Notes:
- Below is a listing of all currently available Media Filters, and what they actually do:
-
HTMLFilter– extracts the full text of HTML documents for full text indexing. -
JPEGFilter– creates thumbnail images of GIF, JPEG and PNG files -
BrandedPreviewJPEGFilter– creates a branded preview image for GIF, JPEG and PNG files (disabled by default) -
PDFFilter– extracts the full text of Adobe PDF documents (only if text-based or OCRed) for full text indexing -
WordFilter– extracts the full text of Microsoft Word or Plain Text documents for full text indexing
-
- Please note that the
filter-mediaorMediaFilterManagerwill automatically update the DSpace search index by default (see Re-index DSpace). This is the recommended way to run these scripts. But, should you wish to disable it, you can pass the-nflag to either script to do so. - The following additional options are also available for either the
filter-mediaorMediaFilterManagerscripts:</p>-
-v= verbose mode (prints out all extracted text and additional messages) -
-f= forces reprocessing of all bitstreams (by default only unprocessed bitstreams are processed by this script) -
-n= do not update the search index after completion -
-i <handle>= only process bitstreams within a particular community/collection/item represented by the<handle> -
-m <#>= only process a maximum number of bitstreams (specified by<#>) . This is useful if you want to process bitstreams little-by-little in order to avoid taxing your server too much.
-
