Posted by: Ian | January 7, 2011

When a good idea turns bad

When I started on the Broker side of the OA-RJ work, I had the idea that we should package up the data to be transferred in a common standard, something that most repositories already understood. This would lead to an easy win for repository managers, meaning it would be relatively easy for repositories to join a Brokerage service… and thus we (the OA-RJ people) could see a healthy interest in our service.

The principle is sound.

The practice is that there is actually no common packaging system that works, and is understood by both EPrints & DuraSpace (let alone the dozen other players in the Repository market!)

I had high hopes for METS: One can create a .zip with a manifest in a METS+EPDCX format that both EPrints & DSpace can import (yay!) but then one runs into a number of difficulties (boo!).

METS

The first I considered was the size of the zip file being passed: Given that the Broker is potentially transferring a large number of items (Nature Publishing Group alone has over 100 journals, and Funding councils fund 1,000’s of grants annually) to a large number of repositories (my record so far is 147 authors in one journal article), it makes sense to investigate “pass by reference” rather than “pass by content” for the binary objects. If one can pass a URL for each of the files, and rely on the importer at the far end to then collect those files when it wants them, one can save a lot of data being sent, and this should speed up the “posting” part of the Broker’s transfer. Neither EPrints nor DSpace support “pass by reference” out the box, however it was not hugely difficult to write an EPrints SWORD importer that handled this (EPrints can already do the “get remote content” part)

This lead to another interesting wee twist – EPrints supports the idea items having multiple documents associated with them, and documents which can have multiple files (think of a web page: the (x)html page you see, plus supporting images, .css files, and possible <!–#include> files that are used to render the whole.) Again, not a huge issue: METS has a <fileSec> section that lists all the files & where to get them, and a <structMap> that defines the relationship between the files. The next task is to ensure that the “main file” for a given document is identified – which we do in the <structMap> area:

<!-- Most attributes removed for clarity -->
<structMap>
 <div>
  <div>
   <fptr FILEID="eprint-160-document-59-0" />
  </div>
  <div>
   <fptr FILEID="eprint-160-document-60-0" />
  </div>
  <div>
   <seq>
    <fptr FILEID="eprint-160-document-98-1" />
    <fptr FILEID="eprint-160-document-98-0" />
    <fptr FILEID="eprint-160-document-98-2" />
    <fptr FILEID="eprint-160-document-98-3" />
   </seq>
  </div>
 </div>
</structMap>

The three documents are contained within three <DIV> elements, and each file is referenced by the FILEID attribute (which links to an ID for the specific file within the <fileSec> area).

The first two documents are easy, however the third one, the one with four files, is a tad more complex: we need to identify the “main file“, and I chose to also indicate to the receiving importer that there is going to be a set of files, and number the files sequentially. The importer can see this, and act accordingly. To this end, the files are defined as a <seq>-uence, and the order in that sequence indicates the order – with the first file being the “main file

The next poser to address came from the a discussion with JorumOpen people: it is eminently possible to have a deposit who’s contents actually include a file called ‘mets.xml’. This is a problem as the METS importers for DPsace & EPrints assume that all files in the .zip package are at the top level… there is no concept of subdirectories or folders… and obviously one cannot have two files called ‘mets.xml’ in the same directory! Simple to solve: put the files for each document in a sub-directory.

A directory tree of files

A directory tree

This directory structure actually ties in quite neatly with the multiple documents structure of EPrints (which, as it happens, also allows one to upload multiple documents, where files can have the same name, so long as they are in different documents!): The files for each document are placed in a separate sub-directory within the .zip file, and the <fileSec> lists the full filename for each file:

<!-- Most attributes removed for clarity -->
<fileSec>
<fileGrp>
<file ID="eprint-160-document-59-0">
<FLocat xlink:href="f1/Spectator_safety.gif" />
</file>
<file ID="eprint-160-document-60-0">
<FLocat xlink:href="f2/Broker_imported.xml" />
</file>
<file ID="eprint-160-document-98-1">
<FLocat xlink:href="f3/pdf3.pdf" />
</file>
<file ID="eprint-160-document-98-0" >
<FLocat xlink:href="f3/mets.xml" />
</file>
<file ID="eprint-160-document-98-2">
<FLocat xlink:href="f3/pdf2.pdf" />
</file>
<file ID="eprint-160-document-98-3">
<FLocat xlink:href="f3/pdf1.pdf" />
</file>
</fileGrp>
</fileSec>

Again, EPrints doesn’t support this hierarchical structure when decoding .zip files (it assumes all files are at the top level), however it wasn’t a massive job to make the changes to the code to support it.

So – given that we can now support multiple files per document and the same filename appearing in different documents, the only issues left are:

  1. Do we revert to “pass by content”, and accept that the time & bandwidth costs for that outweight the benfits of an easier importer
  2. Writing importers for the various Repository platforms (such that there is a low cost of installation)

EPCDX

The final problem to address is the descriptive metadata within the METS package. EPCDX is the solution that works in simple cases, however it has failings: missing data that is probably useful for the importing repository.

Examples of missing data include such obvious things as the Journal name, Volume & Issue numbers, and page-range details. Bizarrely, the Eprint Application Profile assumes these are part of the Bibliographic Citation, however Citations come in a number of formats and are notoriously difficult to un-pick.

It also misses the ability to define the date an item was published, and a date the item is embargoed until.

Finally, it could be useful to have email and institutional affiliations for each author included in the metadata.

The OA-RJ Broker will pass a record that conforms to the EPCDX standard, then adds some extra fields in its own namespace… meaning a standard EPCDX importer will be able to interpret the bulk of the data unimpeded.


Responses

  1. […] to use METSDSpaceSIP, which can be imported by both EPrints & DSpace – however we found certain flaws which we think we have now […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Categories

%d bloggers like this: