Posted by: Ian | October 5, 2010

The Broker process, take 3

Working out how the broker actually flows is one of the critical pieces of work for the project, and we’ve been through several designs & concepts over the past few months.

The latest scheme seems to be stable, and scalable 🙂

The whole process is in three parts:

  1. Ingest
  2. Transfer
  3. Confirm

Ingest

The Ingest process is when the item comes into the broker, and where we deduce where to send it to.

In essence, this is a loop over each author listed in the article, where we:

  • Identify the organisation associated with that author
    • Identify the [appropriate] repository for that organisation
      • Note whether there is a SWORD endpoint for that repository.
      • If we have cascaded down to here, then:
        • we can increment a Transferable field for every Organisation/Repository combination
        • We set the Transferable flag for the appropriate author to be true

Let us set up an example: Assume we have an item for deposit, with 6 authors. We identify that two authors come from the same institution and that one does associate to an organisation we know about. We also identify that each organisation [fortuitously] resolves to a single applicable repository.

Having processed this example item, we have a table like this:

Org ID Repo ID SWORD? Transfer Date Returned URI Live date Live URI Note
OrgID_1 RepoID_A
OrgID_2
OrgID_3 RepoID_B Y
OrgID_4 RepoID_C Y

Transferable: 2

Transfer

There will be a regular process that runs, to transfer items from the Broker to the target repositories. This process could be kicked off by a timer (every n hours) or by an event happening (such as an ingest process finishing)

Here we find every transferable item (the “transferable” counter is positive).

  • For every table entry with a SWORD Endpoint flag and no Transfer Date
    • Transfer the item. After a successful transfer:
      • Set the date of the transfer
      • Set the returned URI for the item in the target repository
      • Decrease “Transferable” by one
      • Increase “InTransfer” by one

Returning to our example, we successfully transfer both items to their target repositories (we only know they have arrived in the repository, not that they have been accepted by the repository and made available.)

We would have a table something like this:

Org ID Repo ID SWORD? Transfer Date Returned URI Live date Live URI Note
OrgID_1 RepoID_A
OrgID_2
OrgID_3 RepoID_B Y 01/02/2010 http:/……./123
OrgID_4 RepoID_C Y 01/02/2010 http:/……/456

Transferable: 0
InTransfer: 2

Confirm

In an ideal world, repositories would tell us when they accept an item, however the reality would be that repositories wouldn’t – which means that the Broker needs to find out when the item becomes available.

This is relatively easy to do: we know the URL for the deposited item (the SWORD transfer gave us that), so all we need to do is poke that location and see if the title of the item appears in the text of the page.

Find every item “InTransfer”

  • For each repository that has a transfer date, but no live date and no note
    • Prod the location the item should appear in. If there is a valid response
      • Store <today> as the “Live Date”
      • Store the “Live URI” (both of these could be set by another process, if needed)
      • Decrease the “InTransfer” count

If, in our example, we discover that one item has gone live, we would update the records thus:

Org ID Repo ID SWORD? Transfer Date Returned URI Live date Live URI Note
OrgID_1 RepoID_A
OrgID_2
OrgID_3 RepoID_B Y 01/02/2010 http:/……./123 02/03/2010 http:/…../123
OrgID_4 RepoID_C Y 01/02/2010 http:/……/456

Transferable: 0
InTransfer: 1

Other features/benefits for this system

  • A target repository could tell us when they decline a deposit, in which case we can record that in the Notes column
  • If a target repository doesn’t respond within some long time period,  the broker could stop looking for it, storing something like stale in the Notes column
  • By adding a Transferable field to the author record within the repository, and manipulating the field when a target repository gains a recognised SWORD-endpoint, we can determine when every author for an item has been identified, cross-referenced to an organisation, and had a copy of that item transferred to the appropriate repository….. or, to put it another way: we can find records where there are still authors were we’ve not been able to identify a transfer location.

When new item in the Organisation/Repository/SWORD-endpoint comes into the Broker

When a new SWORD-endpoint, repository, or organisation becomes known to the Broker, then the broker could actually send a back-catalogue of known items to the new repository.

  • If its an organisation, then the Broker needs to scan every item in the repository for an author without an associated Broker_OrgID, and checks to see if that author is associated with that author, and therefore if there’s a repository for that org, and therefore a SWORD endpoint…. etc
  • If its a repository, then the broker finds all the items with the identified organisation associated with that repository, and does the org -> repo -> SWORD -> Transferable sequence of checks.
  • If a known repository gains a SWORD endpoint, then the same flow applies: find all occurances of that repository, update the line to be Transferable, and wait for the Transfer process to, well… transfer.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Categories

%d bloggers like this: