Talk:About Data Formats

From DSpace Wiki

Jump to: navigation, search

[edit] Jodi A. Schneider

Lots of good detailed info here. The only improvement I would suggest: please provide more detailed citations for external links, so that they can be tracked down later if they move. Jodi.a.schneider 08:18, 28 June 2007 (EDT)

[edit] Robert Tansley

No argument that formats need revamping; in particular the 'FormatIdentifier' based on extensions was supposed to have been ditched within weeks of 1.0 coming out :-) However I don't think the (only) problem with the format model is the lack of linkage to external registries or granularity. BitstreamFormats can be more granular that MIME types and were intended to be from the start, that's why MIME type isn't a key; the shortfalls of MIME types were recognised from day one. The reason for the lack of granularity is that no one has updated the FormatIdentifier.

There is a further problem: it is assumed that each file has one format. This often is not the case; many are 'container' formats (e.g. zip, tar.gz, .avi etc) where properties of constituent files/parts (compressed files, stream encoding) may be needed. Many require further representation information -- e.g. how useful is it to know that a file is 7-bit ASCII? XML? This means that any 'put in file, get out format ID' approach is going to have limitations.

Further there a 'multi-file' formats, though perhaps that's a higher level of representation information than you're talking about here. (We haven't even started on how to deal with a .exe file, or even a .java file!)

Re files with "multiple" formats: What is the use case for identifying formats within a container like Zip or tar? Containers are antithetical to preservation, anyway, so anyone that cares about the actual files (as opposed to the container as a whole, e.g. JAR library is most useful as a unit) is better off unpacking it before or as part of submission. I'd say the same for wrapper formats that apply compression, encryption, and metadata to one file -- treat them as packages, perhaps, filing the metadata as DSpace metadata and unwrapping the content. LarryStone 18:45, 28 June 2007 (EDT)
RT: That doesn't sound like a decision that should be made at a platform level; uncompressing every package file could cause some objects to stop working. Some JARs contain resources (e.g. images, data etc) which is useful in its own right but needs to be in the JAR to function. Some users might not *know* the file is a compressed format (.msi, .rpm, .pak etc.) The other example I gave was .avi files which will have multiple streams with different encodings, and it clearly won't scale to have a separate registry entry for each permutation. PREMIS did some work on the 'onion model' talking about this stuff (which was the part of PREMIS I was involved in). I don't think containers can be ignored and 'you must uncompress all containers' sounds unreasonable -- though that doesn't necessarily mean you need to tackle it head-on in the first iteration! And sounds like we're in violent agreement on all the other stuff!

Another comment: means of identifying formats, and particularly the granularity (and 'depth') to which we are able to identify formats will change over time. It seems reasonable to assume that our ability to do this will improve over time.

Hence I think we should approach this from service-oriented point of view: i.e. that we think about building services that can identify formats and technical format properties to the best current knowledge, and that for a given file what we know about it may change and improve over time.

You've anticipated my design for the renovated BSFs. There will be a separate service to handle identification, since it is a distinct problem as you say. It'll actually be implemented as a stack of possibly cooperating methods since there is no one solution and it has to be easily customized and extended. LarryStone 18:45, 28 June 2007 (EDT)

Thus I think we should consider format information essentially as a cache of our best knowledge at a given moment in time; the latest format sniffer service is 'authoritative' (will give the best answer).

I also think that worrying too much about capturing and storing exactly which format a particular bitstream is on ingest is a red herring, as our ability to automatically identify formats will only improve over time. The only reason to do this is to identify apparently 'corrupted' files and to indicate to the depositor how robust to changes in technology the file they're depositing is considered.

All true, one of my anticipated use cases is re-identifying or improving the identification of Bitstreams. The Bitstream will include a measure of confidence or "quality" of the identification, to make it easy to judge improvements. LarryStone 18:45, 28 June 2007 (EDT)

In practical terms I guess this doesn't affect your work too much (it certainly doesn't invalidate it in any way) other than to suggest that ways to integrate and provide format identification services are the most important factor here (more important than the way formats are stored).

User:RobertTansley

Personal tools