Posted by: Ian | January 11, 2010

Similarities, and when they bite you in the behind

Way back in July, the repository webometics people ( told us of their database, and I thought “Yipee: someone has done the hard work of merging OpenDOAR and ROAR together… I’ll just use them as my reference, rather than reinvent the wheel and do the merging here.”

Unfortunately they only update their data twice a year, which ruled that out… and I cheerfully went off to write my own code to merge the two datasets (after all, how difficult can it be? Both OpenDOAR & ROAR list the same things: collections for Scholarly Research & Academic Publications…)

It took me a while to figure out what were the common fields, how to structure a data-set that would allow me to find common repositories, and how to munge’n’merge things reasonably quickly…. but realisation has struck: this can never be automated!

Just looking at the names for repositories (which I can link as they have matching repository URIs) – which one is right?

This is fairly easy… if you know the first is from ROAR, and spot that “AURA” is listed as an Acronym in the OpenDOAR data (which has the second name listed)
– Aberdeen University Research Archive: AURA
– Aberdeen University Research Archive

This one is tricky: how do you tell if “Giuliana” should be included or excluded?
– Archivio Marini
– Archivio Giuliano Marini

Are these two the same? As a human, I read them as being different
– CCLRC ePublication Archive
– STFC ePublication Archive

In the following pair of examples, the second name appears to be in the local language and the first name has the important information in trailing bracket
– Academic Archive On-line (Jönköping University, Sweden)
– Publikationer från Högskolan i Jönköping
– Academic Archive On-line (Karlstad University, Sweden)
– Publikationer från Karlstads Universitet

But what on earth does one do here? The words in the braces are a guide to the content and not, of themselves, significant
– e-Print archive (physics, mathematics, related fields)
– e-Print archive

… and I just dispare over this one:
– Cracow University of Technology Digital Library
– Biblioteka Cyfrowa Politechniki Krakowskiej (Digital Library of Cracow University of Technology)

Now I know why the repository-metrics people only do bi-annual updates: they are having to hand-check things.

… but I am determined… I must be able to reduce the number of hand-checks to below the 500+ I currently have!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: