Yesterday I was reading an article in The New Yorker which contained a statistic that shocked me.
But first, a tangent. Citing things on the web is a challenge. The New Yorker article I want to refer to here is called ‘The Cobweb‘, subtitled ‘Can the Internet be archived?’, by Jill Lepore. When I first came across it on Twitter I’m sure it was called ‘What the web said yesterday’, but the title has since been changed. The old title is still there in the page source though (at least it is today, 21 January 2015) and is visible on my browser tab, and all over Twitter. I really hope this is a carefully crafted in-joke for the few who notice. If it is, it’s quite clever; but given what I’ve seen elsewhere I’m not willing to place a bet.
And, as a tangent to my tangent, the only visible date for ‘The Cobweb’ (including in the URL) is 26 January 2015, but here I am talking about it on 21 January 2015, and I first saw a link to it on 20 January 2015. To quote the article: “The Web dwells in a never-ending present. It is—elementally—ethereal, ephemeral, unstable, and unreliable.” Agreed. Introducing future dating into the equation doesn’t help. I know print issues of magazines are often released before the cover date, but this is the web – at least add a ‘Date published’ field. As I said, citing things on the web is a challenge.
Now back to that statistic in ‘The Cobweb’, the article formerly known as ‘What the web said yesterday’, dated early next week.
The average life of a Web page is about a hundred days.
This causes the archivist in me to break out in a cold sweat, and makes the PhD student in me wish he had spent more time listening to the archivist in me.
I hope the academic community helps pull the average up a little but we are far from immune. Some of you likely saw the recent article by Martin Klein et al., ‘Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot‘. From the abstract:
We find one out of five STM [Science, Technology and Medicine] articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.
These concerns are part of my work life as well as my research. The Find & Connect web resource – to take the largest project I have worked on in recent years – contains over 15,000 web pages of its own and links to thousands more external pages. Guarding against link rot is a Sisyphean task. I could point to a set of Senate submissions where we have had to update all our links twice due to shifting content. I could also point to a large GLAM [galleries, libraries, archives and museums] organisation where a change in their repository meant we had to manually search for over 80 digital resources so we could update their ‘persistent’ links. There was no way to do this programmatically. I asked.
The Centre where I work is by no means perfect, but we try to remain aware of the problem. An example is Bright Sparcs (now part of The Encyclopedia of Australian Science), “a register of people involved in the development of science, technology, engineering and medicine in Australia, and related resources”. The site went live in August 1994, and – through careful use of identifiers and simple URL structures – every one of those URLs still resolves to a web page about the same subject today.
The Bright Sparcs data on the web turned 20 last year (which, if it was an Australian Terrier, would be nearly 100 according to this commercially sponsored dog age calculator). We held an event to celebrate. For those interested in finding out more, Tim Sherratt – who was unable to attend but was key to the successful development of the site in the 1990s – Storified the Twitter stream from the event and included links to some interesting articles from the time. Online resilience and Bright Sparcs has also been covered by my colleagues, for example in Helen Morgan et al., ‘Standing the Test of Time: Building better resilience into online archival descriptive networks‘, presented at ICA 2012 in Brisbane.
(I have been wondering if there are any comparable resources online where URLs from the early days of the web – let’s say pre-1995 – still work today. If you think of any please let me know.)
Enough trumpet blowing – as I said, we are far from perfect. So what should we do?
First, when trying to track down missing content we can try internet archives, the largest being the Wayback Machine. It’s fantastic; but it’s also huge, often unwieldy and sometimes imprecise. And if you don’t have a URL it’s completely useless.
Second, people working on the web – particularly those of us working for large, persistent organisations (I’m looking at you, universities, governments and GLAM sector), and even more so those of us supported by public funds (still looking at you) – must be actively tackling the issue of impermanence online.
If changing your repository changes your links, you’re doing it wrong. If the public can’t find a persistent link to your content, you’re doing it wrong. If you write a specification that doesn’t cover the way URLs or URIs or DOIs or handles or whatever are generated and maintained through time (including beyond the life of the technology being specified) you’re doing it wrong. We all have legacy issues and systems to deal with, but if you don’t have persistence on your radar for when development day comes, you’re doing it wrong.
Third, those of us who are researchers and students should always assume others are doing it wrong. If we need to access content on the web tomorrow, or the next day, or in a few months, or next year, we have to save offline copies, bring them into our personal or institutional work flows, and back them up using systems and processes we control. And we need to be teaching this to other students, including undergraduates. Remember the 100 day average? That’s only about ten days longer than a fairly typical university semester.
Finally, if you come across this problem – particularly if a publicly funded institution, GLAM organisation or government makes a change that breaks a lot of your links, or lets content drift, or keeps throwing you 404 errors, or insists on landing you on a “We can’t find the content you’re looking for – why not try searching for it?” page as if we hadn’t thought of that – and why should we even need to search if we have a URL? I mean what’s the point of citing things in the first place if you’re just going to keep moving or removing them whenever you feel like it? *deep breath*
And finally, if you come across this problem – particularly with a publicly funded institution, GLAM organisation or government – don’t just rant, take the time to let them know. We need to inform people of the downstream impact of their policies. The web is not much older than Bright Sparcs and we all still have a lot to learn.
So many aspects of our work, our hobbies and our leisure are mediated by and captured in online structures and content. The transitory nature of some of these digital interactions is part of their value. I am not an ‘archive everything’ archivist, in part because I know it is futile, in part because I recognise the value of the ephemeral. The problem is, too much of the web is ephemeral by accident rather than by design. We need to stop the rot.
References:
Lepore J (2015) The Cobweb [also known as ‘What the web said yesterday’]. The New Yorker. January 26, 2015 Issue. [Published online c.20 January 2015] URL: http://www.newyorker.com/magazine/2015/01/26/cobweb (date accessed: 21 January 2015)
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253 (date accessed: 21 January 2015)
Morgan H, Smith A, Evans J (2012). Standing the Test of Time: Building better resilience into online archival descriptive networks. International Council on Archives Congress. Brisbane, Australia, 20-24 August 2012. URL: http://ica2012.ica.org/files/pdf/Full%20papers%20upload/ica12Final00185.pdf (date accessed: 21 January 2015)
January 22, 2015 at 11:42 am
Great post. Please have a look at Creating Pockets of Persistence which proposes a combination of pro-active archiving what is linked to and decorating links to address the reference rot problem. See also the Hiberlink project.
January 22, 2015 at 11:59 am
Thanks Herbert, and thanks too for the links. It’s great to see people actively working in this space.
January 22, 2015 at 10:09 pm
Agree, excellent post which highlights a lot of odd practice. As follow-up to Herbert’s comment, the PhD student in you might be interested in some grey literature on the web from the Hiberlink project that provides an analysis of rotten links in doctoral theses,
http://www.slideshare.net/edinadocumentationofficer/reference-rotandetheses
that forms part of the ETD2014 conference,
http://www2.le.ac.uk/library/etd2014/plenaries/reference-rot-and-etheses
January 22, 2015 at 10:21 pm
Thank you Peter. I look forward to looking at the slides and presentation on rotten links in doctoral theses too. It’s something that’s been on my mind!