Added by Ellen Collins on 20 January 2010 16:15

As the RIN?s new research officer, I?ve spent a bit of time over the last few weeks familiarising myself with the higher education sector. One issue that?s really jumped out at me ? and which the RIN is doing quite a bit of work on at the moment ? is access to datasets.

We recently had a very interesting presentation from Adam Farquhar of the British Library about DataCite ? a collaboration designed to make data discoverable and useful for academics. But issues around data sharing were unexpectedly brought home to me when I started fiddling around with some of the datasets which relate to the higher education sector.

I was taking a quick look at the SCONUL statistics to try and get a sense of how the sector has changed over the last ten years. This dataset helpfully groups institutions into, for example, RLUK members, the pre- and post-1992 universities and so on. Great. But the crucial bit of information needed to make these statistics really useful is missing from the SCONUL website. It?s interesting to know that in 1998, RLUK members had, between them, 217 libraries while the post-1992 universities had 243 ? but without knowing how many institutions are part of each group it?s impossible to make a valid comparison about library provision. Similarly, when the post-1992 total went down to 223 the following year, was that due to some wholesale library closure programme ? or did one institution simply forget to return the survey that year?

SCONUL and LISU, who compile the statistics, couldn?t have been more helpful, and sent me through a comprehensive list of group memberships for the whole ten year period. But for me, this really illuminated some of the issues that need to be considered when making datasets widely available. A great deal of tacit knowledge is involved when undertaking research: it can be easy to forget what you know about your project that others may not. Unless this explanatory information is considered and made explicit, the data could become useless ? or even worse, misleading.

The RIN is currently midway through a project looking at the value of UK data centres. It strikes me that one of the important things that a repository devoted to looking after data can do is to enforce standards around contextual information. The UK Data Archive, for example, has very clear guidelines for researchers wanting to share and upload information. But it?s not yet clear that all data repositories have this approach.

So my question to the academic community is this. Do we need to take a look at the standards of contextual information for data? Is this a problem in the public arena, but not the academic one? Or have you experienced challenges when trying to use somebody else?s dataset? Let us know in the comments.

