Posted by: mmewisse | March 11, 2015

New blog for the Jisc Publications Router

The latest phase of the projects documented in this blog has moved to a new blog.

we-have-moved-icon

Our new blog will be used to outline the developments and benefits of the The Jisc Publications Router service. It begins with an introductory post that includes links to the service page and information on interacting with the Router.

The Publications Router is a free to use standalone middleware tool that automates the delivery of research publications from data suppliers to institutional repositories. The Router extracts authors’ affiliations from the metadata provided to determine appropriate target repositories before transferring publications to repositories registered with the service. The Router offers a solution to the duplication of effort recording a single research output presents in the increasingly collaborative world of research publications. It is intended to minimise effort on behalf of potential depositors while maximising the distribution and exposure of research outputs.

The Router has its origins in the Open Access Repository Junction project. A brief recap of the various stages of evolution can be found in a post on the history of the project.

If you wish to find out more about the service the Router offers please see the about page.

Posted by: Ian | October 10, 2013

EPrints importers made publicly available

We are really please to say that RJ Broker now has importers available for both EPrints 3.2 and 3.3

The EPrints 3.2 code is essentially unchanged, however it is now publicly available in GitHub (https://github.com/edina/RJ_Broker_Importer_3.2) and has been proved to work in a number of installations.

The EPrints 3.3 plugin is also available on GitHub (https://github.com/edina/RJ_Broker_epm) however it is also available as a direct plugin from the central EPrints Bazaar (http://bazaar.eprints.org/332/). This code has been tested in EPrints 3.3.5 and 3.3.12 (and I believe the patch required to correct the returned atom:id will be applied to EPrints 3.3.13)

The two code-bases are almost identical, and we would be delighted to receive feedback/improvements on either.

Posted by: Ian | July 3, 2013

Setting up redundancy in a live service

At EDINA, we strive for resilience in the services we run. We may not be competing with the massive international server-farms that the likes of Google, Facebook, Amazon, eBay, etc run… however we try to have systems that cope when things become unavailable due to network outages, power-cuts, or even hardware failures: the services continue and the user may not even notice!

For read-only services, such as ORI, this is relatively easy to do: the datasets are apparently static (any updates are done in the background, without any interaction from the user) so two completely separate services can be run in two completely separate data centres, and the systems can switch between them without anyone noticing – however Repository Junction Broker presented a whole different challenge.

The basic premise is that there are two instances of the service running – one at data centre A, the other at data centre B – and the user should be able to use either, and not spot any difference.

This first part is easy: the user connects to the public name for the service. The public name is answered by something called a load-balancer which directs the user to whichever service it perceives as being least loaded.

In a read-only system, we can stop here: we can have two entirely disconnected systems, and let them respond as they need to…. and use basic admin commands to update the services as they need.

For RJ Broker, this is not the end of the story: RJB has to take in data from the user, and store that information in a database. Again, a relatively simple solution is available: run the database through the load-balancer, so both services are talking to the same database, and use background admin tools to copy database A to database B in the other data-centre.

rjb-loadbalancer Again, RJB needs more: as well as storing information in the database, the service also writes files do disk. Here at EDINA, we make use of Storage Area Network (SAN) systems – which means we mount a chunk of disk space into the file-system, across the network.

The initial solution was to get both services to mount disk space from SAN A (at data centre A) into /home/user/san_A and then get the application to store the files into this branch of the file-system…. meaning the files are instantly visible to both service.

This, however, is still has the “single point of failure” which is SAN A. What is needed is to replicate the data in SAN A in SAN B, and make SAN B easily available to both installations

The first part of this is simple enough: mount disk space from SAN B (in data centre B) into /home/user/san_B. We then use an intermediate symbolic link (/home/user/filestore) to point to /home/user/san_A and get the application to store data in /home/user/filestore/... Now, if there is an outage at Data Centre A, we simply need to swap the symbolic link  /home/user/filestore to /home/user/san_B and the application is none-the wiser.

The only thing that needs to happen is the magic to ensure that all the data written to SAN A is duplicated in SAN B (and database A is duplicated into database B)

On Wednesday 29th May 2013, Muriel Mewissen presented a Webinar on the Repository Junction Broker (RJ Broker) for the Repository Support Project (RSP).

This presentation discusses:

  • the need for a broker to automate the delivery of research output to Institutional Repositories
  • the development of the middleware tool for RJ Broker
  • the data deposit trials involving a publisher (Nature Publishing Group) and a subject repository (Europe PubMed Central) which have recently taken place
  • what the future holds for the RJ Broker.

A recording of the webinar and the presentation slides are available on the RSP website.

Posted by: Ian | February 14, 2013

Embargoes in real metadata, take 2

Following on from the earlier discussion, we have ruled out the first option (where we add an attribute to METS):

<div ID="sword-mets-div-2" oarj_embargo="2013-05-29">
      <fptr FILEID="eprint-191-document-581-0"/>
    </div>

as the METS Schema doesn’t allow additional attributes to be added (and the investigation into writing validating schema with additional attributes was fun in its own right) – so this leaves us with the XMLdata within the amdSec solution.

To recap, the amdSec will read something like:

<amdSec ID="sword-mets-adm-1" LABEL="administrative" TYPE="LOGICAL">
  <rightsMD ID="sword-mets-amdRights-1">
    <mdWrap MDTYPE="OTHER" OTHERMDTYPE="RJ-BROKER">
      <xmlData>
        <epdcx:descriptionSet xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/"
                              xsi:schemaLocation="http://purl.org/eprint/epdcx/2006-11-16/
                                                  http://purl.org/eprint/epdcx/xsd/2006-11-16/epdcx.xsd ">
          <epdcx:description epdcx:resourceId="sword-mets-div-3" 
                             epdcx:resourceURI="http://devel.edina.ac.uk:1203/191/">
            <epdcx:statement epdcx:propertyURI="http://purl.org/dc/terms/available"
                             epdcx:valueRef="http://purl.org/eprint/accessRights/ClosedAccess">
              <epdcx:valueString epdcx:sesURI="http://purl.org/dc/terms/W3CDTF">2013-05-29</epdcx:valueString>
            </epdcx:statement>
          </epdcx:description>
          <epdcx:description epdcx:resourceId="sword-mets-div-2"
                             epdcx:resourceURI="http://devel.edina.ac.uk:1203/191/">
            <epdcx:statement epdcx:propertyURI="http://purl.org/dc/terms/available"
                             epdcx:valueRef="http://purl.org/eprint/accessRights/ClosedAccess">
              <epdcx:valueString epdcx:sesURI="http://purl.org/dc/terms/W3CDTF">2013-05-29</epdcx:valueString>
            </epdcx:statement>
          </epdcx:description>
        </epdcx:descriptionSet>
      </xmlData>
    </mdWrap>
  </rightsMD>
</amdSec>

One of the questions I have been asked a few times is “why don’t you put the actual file URL with the embargo date”, and I refer you to the explanation in the original article:

  • A document may actually be composed of multiple files (consider a web page – the .html file is the primary file, however there are additional image files, stylesheet files, and possibly various other files that combine to present the whole document)

In other words, whilst 99% of cases will be a single file for a single document, it’s not always that simple and I don’t believe the metadata should lead you into a false understanding of what is, so things don’t break when it goes wrong,

Posted by: Ian | January 16, 2013

SWORD 1.3 v’s SWORD 2

What’s the difference, and how do they compare?

In summary

SWORD 1.3 is a one-off package deposit system: the record is wrapped up in some agreed format, and dropped into the repository. SWORD 1.3 uses an HTTP header to define what that package format is, and the Individual Repositories use that header to determine how to unpack the  record. Every deposit is a new record.

SWORD 2.0 is a CRUD-based (Create, Read, Update, Delete) system, where the emphasis is on being able to manage existing records, as well creating new records. SWORD 2 uses the URL to identify the record being manipulated, and the mime-type of the object being presented to know what to do with it.

In detail (EPrints-specific)

This is, per force, EPrints specific as I am an EPrints user, with no experience coding in DSpace/Fedora/etc.

SWORD 1.3

With a SWORD 1.3 system, one defines a mapping between the X-Packaging header URI and the importer package to handle it:

  $c->{sword}->{supported_packages}->{"http://opendepot.org/broker/1.0"} =
  {
    name => "Open Access Repository Junction Broker",
    plugin => "Sword::Import::Broker_OARJ",
    qvalue => "0.6"
  };

The importer routine then has some internal logic to ensure it only tries to manage records of the right type (XML files, Word Documents, Spreadsheets, Zip files, etc).

In the case of compressed files, it is customary to also indicate the routine to un-compress the file. For example, the same Importer could manage .zip, .tar, and .tgz files – which are all variations on a compressed collection of files – which has the following collection of mime-types:

application/x-gtar
application/x-tar
application/x-gtar-compressed
application/zip

Therefore our importer would have code like this:

   our %SUPPORTED_MIME_TYPES = ( "application/zip"    => 1, "application/tar"               => 1,
                                 "application/x-gtar" => 1, "application/x-gtar-compressed" => 1,);

   our %UNPACK_MIME_TYPES = ( "application/zip"               => "Sword::Unpack::MyNewZip",
                              "application/tar"               => "Sword::Unpack::MyNewTar",
                              "application/x-gtar"            => "Sword::Unpack::MyNewTar",
                              "application/x-gtar-compressed" => "Sword::Unpack::MyNewTar");

So, a basic SWORD 1.3 deposit is a simple POST request to a defined URL, with a set of headers to manage the deposit, and the record as the body of the request:

  curl -x POST \
       -i \
       -u username:password \
       --data-binary "@myFile.zip" \
       -H 'X-Packaging: http://opendepot.org/broker/1.0' \
       -H 'Content-Type: application/zip'  \
       http://my.repo.url/sword-path/collection

This will deposit the binary file myFile.zip into the collection point in the repository, using the importer identified by the Package http://opendepot.org/broker/1.0.

SWORD 2.0

This is much vaguer, as I’ve not really got a good working example of a SWORD 2 sequence available (the Broker doesn’t do CRUD).

With SWORD 2, the idea is to be able to update existing records, piecemeal:

  • Create a blank record
  • Add some basic metadata (title, authors, etc)
  • Add the rough-draft file
  • Add the post-review article
  • Delete the rough-draft file
  • Add the abstract
  • Add the publication metadata (journal, issue, pages, etc)

With SWORD 2, what routines are used to process the request is based on the mime-type given in the headers.

Within each importer, there is a new function:

sub new {
 my ( $class, %params ) = @_;
 my $self = $class->SUPER::new(%params);
 $self->{name} = "Import RJBroker SWORD (2.0) deposits";

 $self->{visible}   = "all";
 $self->{advertise} = 1;
 $self->{produce}   = [qw( list/eprint dataobj/eprint )];
 $self->{accept}    = [qw( application/vnd.broker.xml )];
 $self->{actions}   = [qw( unpack )];
 return $self;

So, to create a new record, one posts a file with no record id:

  curl -x POST -i \
       -u username:password \
      --data-binary "@MyData.xml" \
       -H 'Content-Type: application/vnd.broker.xml' \
      http://my.repo.url/id/content

This will find the importer that claims to understand ‘application/vnd.broker.xml’, and use it to create a new record. The server response will include URLs for updating the record.

To add a file to a known record:

  curl -x POST -i \
       -u username:password \
      --data-binary "@MyOtherFile.pdf" \
      http://my.repo.url/id/eprints/123

This will use the default application/octet-stream importer, and add the file MyOtherFile.pdf to the record with the id 123.

To add more metadata:

  curl -x POST -i \
       -u username:password \
      --data-binary "@MyData.xml" \
       -H 'Content-Type: application/vnd.broker.xml' \
      http://my.repo.url/id/eprints/123

This will find the importer that claims to understand ‘application/vnd.broker.xml’, and use that code to add the metadata to the record with the id 123.

Note: there is a difference between PUT and POST:

  • POST adds contents to any existing data. Where a field already exists, the action is determined by the importer
  • PUT deletes the data and adds the new information – it replaces the whole record.

Summary

  • SWORD 1.3 uses the X-Package header to determine which importer routine to use, and the importer uses the mime-type to confirm suitability
  • SWORD 2 uses the mime-type to determine which importer routine to use.
  • The URLs for making deposits are different.
Posted by: mmewisse | January 10, 2013

RJ Broker: a Research Output Delivery Service

Back in August 2009, the idea for a system to deposit research output directly to Institution Repositories (IRs) was formulated (like many other great ideas) on the back of a napkin, and presented in this ‘Basic Premise’ post. Development works on the Open Access Repository Junction finished in March 2011 and were followed a year later by the current project on the Repository Junction Broker (RJ Broker).

Development works have progressed over these last three years and a prototype RJ Broker has been designed but many questions were raised along the way. We decided to take advantage of the attendance of many representatives of the RJ Broker stakeholders at the 7th International Conference on Open Repositories (OR2012) in Edinburgh in July 2012, to refine our vision by using the direct input from key stakeholders.  An evening workshop was organised on the 9th July and representatives from our stakeholders: IR managers, funders, publishers, IR software and service developers from the UK, Europe, US and Australia were invited to take part. The summary of this brainstorming is presented here.

RJ Broker: A delivery service for research output

The RJ Broker Team first set the scene with a short presentation on the current and intended RJ Broker functionality.

The RJ Broker is in effect a delivery service for research output. It accepts deposits from data providers (institutional and subject repositories, funder and publisher systems). For each deposit, it uses the metadata for the deposit to identify organisations and any associated repositories that are suitable for receiving the deposit. It then transfers the deposit to the repositories that have registered with the RJ Broker service. The metadata acts as the address card for the deposit which is the parcel.

In order to receive deposits from the RJ Broker, IRs have to register their SWORD credentials with the service to give the RJ Broker an access point to their systems for data input, like the letter box lets mail in the house.

At this stage, the focus for the type of deposit (or content of the parcel) is research publications. However, the way the RJ Broker works is independent of the deposit type. The RJ Broker will transfer parcels of any type, big or small. For example, a deposit can be a publication with several supporting data files, just an article or just the data files. The parcel can be empty or even taped shut.  Indeed in the case  of an article published under Gold Open Access, the address card is the only information needed to provide a notification of the availability of that article in the publisher system. When an article is subject to an embargo period, a sealed parcel is required to enable the delivery to take place straight away even if it can only be opened later, like the presents sent by far away relatives placed under the tree to be opened on Christmas day!

The mind map below was used to inform the discussion of all the questions we were seeking to answer.

Metadata: The all important label

Like the address card on the top of a parcel, the metadata is available for all to see.  Indeed the metadata is always fully open access regardless of the embargo period imposed on the data itself. The metadata is owned by the person who creates it (author, publisher or IR manager) but there is no copyright on it. The metadata can even be considered to act as a advertissement flyer for the data itself which benefits its owner (whether author, publisher or IR manager) and therefore explains why owners support open access for metadata.

Standards are generally a good thing, improving quality and facilitating exchange. For example, the use of a funders’ code field in the metadata would significantly ease reporting on return on investment for the funding agencies.  Several metadata standards are currently being developed, for example CERIF, RIOXX, COUNTER or the OpenAIRE Guidelines. The RJ Broker will support these standards but it is not its duty to ensure these standard are adhered too or that for example all required fields have been entered. In the same way as one expects an address card to provide enough space for the required information to be supplied in order for the parcel to be delivered, one does not expect the postman to fill in a missing house number or any other missing information.

Deposit: What is in that parcel?

The RJ Broker has to assume some responsibility for the object it is trusted with transferring to IRs, mainly the correct identification of appropriate IRs and the subsequent delivery to these IRs. The RJ Broker is also responsible for the safe keeping of the deposit while it is in transit.

Once it has been successfully transferred to the registered IRs, the responsibility of the RJ Broker ends. It may seem tempting to extend the functonality of the RJ Broker to store a copy of every deposit in order to allow later downloads by newly registered IRs or simply to provide a safety backup. However, this is not the purpose of a delivery service. This would also turn the RJ Broker into a repository that could grow to a massive size. Therefore the RJ Broker will only keep recently transferred deposits for a limited period of time to allow IRs time to accept and process these deposits. Similarly, the postman is not required to scan each postcard he delivers for future safe keeping but undelivered items will be returned to the sorting office and held for a while to allow collection.

If none of the identified IRs have registered with the RJ Broker then no delivery is possible. This constitutes a successfull processing of a deposit for the RJ Broker. Future developments cou;ld consider transferring the deposit to open repositories like OpenDepot.org or sending a notification to IRs to advise them to register with the RJ Broker should they wish to receive direct deliveries of research output from the RJ Broker.

The RJ Broker will transfer every deposit it receives. It does not provide an inspection or validation service. Therefore will not flag an empty, a duplicate, incomplete or badly formatted deposit.

Dealing with embargo

The RJ Broker aims to support Open Access (OA) by enabling the dissemination of the reseach output across the UK and beyond. It does not matter for the delivery process whether this OA is gold or green. However, it is important that any embargo period is dealt with appropriately.

A legal agreement between the RJ Broker and each data provider requesting the respect of embargo periods will be signed before any data from that provider is transferred by the RJ Broker. Each IR will in turn have to accept a similar agreement before they can receive data, through the RJ Broker, from providers enforcing an embargo. Data providers have to ensure that embargo periods are correctly noted in the metadata. IRs have to respect any embargo specified in the metadata. The RJ Broker acts as a trusted, enabling technology between both parties, not as a control point,  it does not have any responsibility regarding the enforcement of embargos. Legal agreements are currently being set in place for the early adopters of the RJ Broker. The hope is that a set of standard agreements can be derived from these to promote take up and ease the administration process.

Beside the legal agreement, the RJ Broker will not perform additional checks or require further certification or accreditation from IRs. The aim of the RJ Broker is to disseminate research output widely. It is not its purpose to rate IRs for trust or reliability which is best left to the appropriate authorities.

Tracking a Deposit

The RJ Broker assigns a tracking ID to each deposit which enables data suppliers to check on the onward progress of the deposit after it was successfully delivered to an IR SWORD endpoint.

As mentioned previously, the responsibility of the RJ Broker ends once the deposit has been successfully transferred to the registered IRs.  Institutions follow different procedures, workflows and timetables when it comes to processing deposits left for inclusion in their repositories. Therefore asserting that a deposit has been successfully ingested by an IR is a complexe task which is not part of the RJ Broker’s remit as a delivery service. However, the RJ Broker will provide the data suppliers with a send receipt, as a proof that the deposit has been processed by the RJ Broker, which includes a tracking ID. The data supplier can later use this ID to check on the status of the deposit with the IRs in which it has been transferred, i.e. received, queued for processing, accepted, live or rejected.

Keep it simple!

The discussion was very productive, all topics set in our mind map were covered and answers to all questions regarding the functionality of the RJ Broker were agreed. The unanimous conclusion was to keep it simple!

Notes-part4_cropped

The RJ Broker should aim to be a delivery service only. It will follow a “push only” model. Deposits will be pushed to the RJ Broker by data suppliers and the RJ Broker will push the deposits to the IRs. This enables the RJ Broker to have a streamlined workflow.

Specifically, the RJ Broker will NOT:

  • provide any reporting or statistics
  • filter incoming data
  • improve data or metadata
  • enforce standard compliance
  • be a repository
  • collect (“pull”) data from suppliers

I would like to thank everyone who took part in the workshop and help us shaped the functionality of the future RJ Broker service! Development and trials are on-going with a first version of the RJ Broker due for release to UK RepositoryNet+ in Spring 2013. Watch this space!

List of Attendees

Tim Brody (University of Southampton, UK), Yvonne Budden (University of Warwick , UK), Thom Bunting (UKOLN, UK), Peter Burnhill (UK RepositoryNet+, UK), Pablo de Castro Martin (UK RepositoryNet+, UK), Andrew Dorward (UK RepositoryNet+, UK), Kathi Fletcher (Shuttleworth Foundation, USA), Robert Hilliker (Columbia Univeristy, USA), Richard Jones (Cottage Labs, UK), Stuart Lewis (The University of Edinburgh, UK), John McCaffery (University of Dundee, UK), Paolo Manghi (OpenAIRE, Italy), Muriel Mewissen (RJ Broker, UK), Balviar Notay (JISC, UK), Tara Packer (Nature Publishing Group, USA), Marvin Reimer (Shuttleworth Foundation, USA), Anna Shadbolt (University of Melbourne, Australia), Terry Sloan (UK RepositoryNet+, UK), Elin Strangeland (University of Cambridge, UK), Ian Stuart (RJ Broker, UK), James Toon (The University of Edinburgh, UK), Jin Ying (Rice University, USA)

Posted by: Ian | December 11, 2012

Embargoes in real metadata

One very important function the RJ Broker needs to support is that of embargoes: Publishers and other data suppliers are going to be much more willing to be involved in a dissemination programme if they believe their property is not being abused. To be blunt about it: most journals make money from selling copies – if they’re giving articles away for free, who would buy them… ?

So – the RJ Broker needs to ensure that embargo periods are defined, and clearly defined in the data that’s passed on….. and that’s why recipients of the data from the RJ Broker need to sign an agreement: to assert they will actually honour any embargo periods for records they receive.

We know from previous conversations that one cannot embargo metadata, so embargo only applies to the binary objects attached to the metadata record.

The first question is “Is there a blanket embargo for all files, or can different files have different embargoes?”, and the second question is “Is there a difference between ‘document’ and ‘file’?”

Actually, thinking about it, a blanket embargo can be mimic’d by having the same embargo for all files, however variable embargoes cannot be (sensibly) implemented using a single field. Thinking about the difference between “files” and “Documents” comes from the EPrints platform: they have this concept that a document may actually be composed of multiple files (consider a web page: the .html file is the primary file, however there are additional image files, stylesheet files, and other files that combine to present the whole document).

The third question is how to encode this embargo information.

The Broker has already defined it’s basic metadata file as being a METs file, with the record metadata encoded in epdcx (see SWAP and epcdx).

Looking round the net, there are several formal structures for defining administrative metadata; archival metadata; preservation metadata; etc…. but none seemed to actually define a nice, simple, embargo date.

In the end, I have loaded up two options, and we’ll investigate which one makes more sense as things get used.

The easier one, but the one that breaks the METS standard is to add an attribute to each Structure element in the METS file:

<structMap ID="sword-mets-struct-1" LABEL="structure" TYPE="LOGICAL">
  <div ID="sword-mets-div-1" DMDID="sword-mets-dmd-eprint-191" TYPE="SWORD Object">
    <div ID="sword-mets-div-2" oarj_embargo="2013-05-29">
      <fptr FILEID="eprint-191-document-581-0"/>
    </div>
    <div ID="sword-mets-div-3" oarj_embargo="2013-05-29">
      <fptr FILEID="eprint-191-document-582-0"/>
    </div>
  </div>
</structMap>

This has the beauty that the embargo metadata is directly linked to the document it belongs to, and that information is immediately available to any import routines.

The second is to write a more convoluted, but formally correct, structure within the METS Administrative section:

<amdSec ID="sword-mets-adm-1" LABEL="administrative" TYPE="LOGICAL">
  <rightsMD ID="sword-mets-amdRights-1">
    <mdWrap MDTYPE="OTHER" OTHERMDTYPE="RJ-BROKER">
      <xmlData>
        <epdcx:descriptionSet xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/"
                              xsi:schemaLocation="http://purl.org/eprint/epdcx/2006-11-16/
                                                  http://purl.org/eprint/epdcx/xsd/2006-11-16/epdcx.xsd ">
          <epdcx:description epdcx:resourceId="sword-mets-div-3" 
                             epdcx:resourceURI="http://devel.edina.ac.uk:1203/191/">
            <epdcx:statement epdcx:propertyURI="http://purl.org/dc/terms/available"
                             epdcx:valueRef="http://purl.org/eprint/accessRights/ClosedAccess">
              <epdcx:valueString epdcx:sesURI="http://purl.org/dc/terms/W3CDTF">2013-05-29</epdcx:valueString>
            </epdcx:statement>
          </epdcx:description>
          <epdcx:description epdcx:resourceId="sword-mets-div-2"
                             epdcx:resourceURI="http://devel.edina.ac.uk:1203/191/">
            <epdcx:statement epdcx:propertyURI="http://purl.org/dc/terms/available"
                             epdcx:valueRef="http://purl.org/eprint/accessRights/ClosedAccess">
              <epdcx:valueString epdcx:sesURI="http://purl.org/dc/terms/W3CDTF">2013-05-29</epdcx:valueString>
            </epdcx:statement>
          </epdcx:description>
        </epdcx:descriptionSet>
      </xmlData>
    </mdWrap>
  </rightsMD>
</amdSec>

As you can see, this follows the rules for the epcdx structure. This was a deliberate choice as it is already used for the primary metadata, so the importers will already have routines for following the structure.

What will be interesting is which is more usable when it comes to writing importers.

Posted by: Ian | October 18, 2012

“Will Triplestore replace Relational Databases?”

It is not possible to give a definitive answer, however it is important to look at this technology which has been causing a stir in the informatics field.

Basically a triplestore is a purpose-built database for the storage and retrieval of triples (Jack Rusher, Semantic Web Advanced Development for Europe). Rather than highlighting the main features of a triplestore (by way of making a comparison to a traditional relational database), we will give a brief overview of the how and why of choosing, installing, and maintaining a triplestore, giving a practical example not only of the installation phase but also of the graphical interface customization and some security policies that should be considered for the SPARQL endpoint server.

Choosing the application

The first task, obviously, is choosing the triplestore application.

Following Richard Wallis’ advice (see reference) 4Store was considered a good tool, useful for our needs. There are a number of reasons why we like this application: firstly, as Richard says, it is a open source project and it “comes from an environment where it was the base platform for a successful commercial business, so it should work”. In addition, as their website suggest, 4store’s main strengths are its performance, scalability and stability.

It may not provide many features beyond RDF storage and SPARQL queries, but if your are looking for a scalable, secure, fast and efficient RDF store, then 4store should be on your shortlist. We did investigate other products (such as Virtuoso) but none were as simple and efficient as 4Store.

Hardware platform

At EDINA, we tend to manage our services in hosted systems (similar to the concept that many web-hosting companies use.)

After considering the application framework for a SPARQL service, and various options of how triplestores could be used within EDINA, we decided to create an independent host for the 4Store application. This would allow us to both keep the application independent of other services and allow us to evaluate the performance of the triplestore.

It was configured with the following features:

  • 64bit CPU
  • 4 GB of RAM (with the possibility to increase this amount if needed)
  • Linux (64-bit RedHat Enterprise Level 6)

Application installation

At EDINA, we try to install services as a local (non-root) user. This allows us the option of multiple independent services within one host, and reduces the chance that an “exploit” (or cyber break-in) gains significant access to the hosts operating system.

Although some libraries were installed at system level, almost everything was installed at user-level. Overall, the 4Store installation was quick and easy to configure: installing the software as normal user required the installation paths to be specified (ie --prefix=~~~), however there were no significant challenges. We did fall foul when installing the raptor and rasqal libraries (the first provides a set of parsers that generate triples, while the second is useful to handle query language syntaxes [such as SPARQL]): there are fundamental differences between v1 & v2 – and we required v2.

Configuration and load data

Once finished the installation, 4Store is ready to store data.

  1. The first operation consists to set up a dataset, which (for this example) we will call “ori”:
    $ 4s-backend-setup ori
  2. then we start the database with the following command:
    $ 4s-backend ori
  3. Next we need to load the triples from the file data set. This step could take a while, depending on the system and on the amount of data.
    $ 4s-import -v ori -m http://localhost:8001/data/ori /path/to/datafile/file.ttl

    • This command line includes some options useful for the storing process. 4store’s import command produces no output by default unless something goes wrong (ref: 4Store website).
    • Since we would like more verbose feedback on how the import is progressing, or just some reassurance that it’s doing something at all, we add the option “-v”.
    • The option “-m” (or “–model”) defines the the model URI: It is, effectively, a namespace. Every time we import data, 4Store defines a model useful to map a SPARQL graph for the imported data. By default the name of your imported model (SPARQL graph) in 4store will be a file (URI derived from the import filename) but we can use the –model flag to override this with any other URI.

The model definition is important when looking to do data-replacement:
$ 4s-import -v ori -m http://localhost:8001/data/ori /path/to/datafile/file2.ttl

By specifying the same model, we replace all the data present in the “ori” dataset with the information contained in the new file.

Having imported the data, the server is ready to be queried, however this is only via local (command-line) clients. To be useful, we need to start an HTTP server to access the data store.

4Store includes a simple SPARQL HTTP protocol server, which can answer SPARQL queries using the standard SPARQL HTTP query protocol among other features.
$ 4s-httpd -p 1234 ori
will start a server on port 1234 which responds to queries for the 4Store database named “ori”.

If you have several 4store databases they will need to run on separate ports.

HTTP Interface for the SPARQL server

Once the server is running we can see an overview page from a web browser at http://my.host.com:1234/status/
There is a simple HTML query interface at http://my.host.com:1234/test/and a machine2machine (m2m) SPARQL endpoint at http://my.host.com:1234/sparql/

These last two links provide the means to execute some queries to retrieve information present in the database. Note that neither interface allow INSERT, DELETE, or LOAD functions, or any other function which could modify the data into the database.

You can send SPARQL 1.1 Update requests to http://my.host.com:1234/update/, and you can add or remove RDF graphs using a RESTful interface at the URI http://my.host.com:1234/data/

GUI Interface

The default test interface, whilst usable, is not particularly welcoming.

We wanted to provide a more powerful, and visually more appealing, tool that could allow to query the SPARQL endpoint without affect the 4Store performance. From the Developer Days promoted by JISC and Dev8D, we were aware of Dave Challis’ SPARQLfront project. SPARQLfront is a PHP and javascript based frontend to RDF SPARQL endpoints. It uses a modified version of ARC2 (a nice PHP RDF library) for most of the functionality, Skeleton to provide the basic HTML and stylesheets, and CodeMirror to provide syntax highlighting for SPARQL.

We installed, configured and customized SPARQLfront under the default system Apache2/PHP server: http://my.host.com/endpoint/

The main features of this tool are the highlighted syntax that help during the query composition and the chance to select different output formats.

Security

EDINA is very conscious of security issues relating to services (as noted, we run services as non-root where possible).

It is desirable to block access to the anything that allows modification to the 4Store dataset, however the 4Store server itself provides no Access Control Lists – therefore is unable to block update and data connections directly. The simplest way to block these connections is to use a reverse-proxy in the Apache2 server to pass on calls we do want to allow connect to the 4Store server, and completely block direct access to the 4Store server using a firewall.

Thus, we added proxy-pass configuration to the apache server:

 #--Reverse Proxy--
 ProxyPass /sparql http://localhost:1234/sparql
 ProxyPassReverse /sparql http://localhost:1234/sparql
 ProxyPass /status http://localhost:1234/status
 ProxyPassReverse /status http://localhost:1234/status
 ProxyPass /test http://localhost:1234/test
 ProxyPassReverse /test http://localhost:1234/test

 

Note: do NOT enable proxyRequests or proxyVia as this opens your server as a “forwarding” proxy to all-and-sundry using your host to access anything on the web! (see the Apache documentation)

We then use the Firewall to ensure that all ports we are not using are blocked, whilst allowing localhost to connect to port 1234:

 # allow localhost to do everything
 iptables -A INPUT -i lo -j ACCEPT
 iptables -A OUTPUT -o lo -j ACCEPT
 # allow connections to port 80 (http) & port 22 (ssh)
 iptables -A INPUT -p tcp -m multiport -d 12.23.34.45 --destination-ports 22,80 -m state --state NEW -j ACCEPT

 

Your existing firewall may already have a more complex configuration, or may be configured using a GUI. Whichever way, you need to allow localhost to connect to port 1234, and everything else to be blocked from most things (especially 1234!)


Cesare

Posted by: Ian | September 17, 2012

We’re live!!

Yes folks, after a long gestation, name changes, and a jiggle of personnel… we are now LIVE!!

http://ori.edina.ac.uk

Full public documentation is coming soon – however there is some bare-bones stuff on the web-site, and some example clients available at http://lucas.ucs.ed.ac.uk/ori/

 

Older Posts »

Categories