internet-archive-logo2On The Atlantic, Meredith Broussard was inspired by an earlier The Atlantic article on the difficulty of preserving web news content online to take a look at the difficulty of preserving that specific web news content online. (David Rothman covered the earlier article for TeleRead last month.) Given that the web content that article talked about was a news series on a 46-year-old event that had itself been forgotten, the whole issue seems to partake of several layers of self-referentialism. But that only serves to underscore its severity.

I was just starting out at college when the World Wide Web entered use, and I still remember the days when web pages consisted of simple HTML. (Indeed, my own fossilized ‘90s-vintage homepage dates directly back to my first Lynx bookmarks file, which I simply hand-edited into an index.html file.) But compared to the modern web, that’s effectively the equivalent of stone tablets. Modern web sites are governed by one or more content management systems (CMS) which weave together multiple sets of source files to create slick-looking sites that catch the eye.

The problem is that the complicated nature of this process makes archiving extremely difficult. Many news sites change their CMSes for new ones over time, which can change the site’s entire URL schema in one fell swoop—and suddenly extant links to older stories no longer work. Sometimes even the sites themselves are no longer able to find them. Broussard points out that the Internet Archive’s Wayback Machine does a great job in preserving older content, but it doesn’t have the kind of indexing and search function needed to make it useful if you don’t know exactly where or when the article you want was published.

Anyone who’s ever had to search for older stories on the web has learned how this works. Often is the time I’ve found an older story by googling it, clicking a link that no longer works, and then copying the link’s URL to take to the Wayback Machine and see if it works there. But even that only works if there are still enough links to the older content out there to give you a handy way marker.

And this is a problem that we’re no stranger to here at TeleRead. The very earliest articles in our present database go back to 2002, leading me to misestimate how old TeleRead was in an article I wrote about another early e-book site. As David pointed out in response, earlier versions are available in the Internet Archive, but how would I have known that without going to look? And if I wanted to find some specific post from that era, how could I do so since that version of the site no longer even exists to site-search?

Apart from that, some things got shuffled around or lost in our move to NAPCO’s servers and back. In researching older TeleRead articles for backlinks in current stories, I still run across situations where I know I wrote some article about a particular subject, but a Google on a keyword that should get it doesn’t pull it up and I have to use WordPress’s internal post search function to locate it instead. Even in cases where older articles can be found, the images that used to go with them are often missing without trace. If that sort of problem can affect as relatively simple a site as TeleRead, it’s easy to see how badly it could hit far more complex news sites, especially ones with decades of history to keep track of. And who knows how it will affect TeleRead if we should make any other site changes in the future?

This is an important issue for future historians and researchers. After all, our culture is by and large a digital one now, and many of our most important day-to-day news sources don’t even have print versions to archive anymore. If we lose track of that aspect of our culture, how will we get it back?

One bright spot is that—unless blocked by robots.txt files—the data is still being stored on the Internet Archive in some form—which means that, sooner or later, if they are able to implement a better search and indexing function on it, historians will derive more benefit out of all those saved pages then. But it may be small consolation for people who want to dig up ancient Internet history now.


  1. That’s why print media matters so much and, if fact why many of the new technologies seem to come with a self-destruct mechanism built in. Check Youtube and you’ll find videos of people managing to repair and get operational a 100-year-old diesel engine. Shops can still build that parts in lathes and the like. But a decade or so after the start of the space shuttle, NASA was having trouble getting parts for its test equipment. Late 1970s computer chips not only weren’t getting built, they couldn’t be built.

    NASA ran into a similar problem with its data from the Mars landers of the 1970s. The data, stored on computer tape was either lost or not readable. They were only able to get to the original data because a NASA engineer had broken the rules, printed out the results, and stored them in his garage.

    Biographers and historians are running into similar problems. The golden age for writing biography was when letter writing was common and telephones had not come into general use. Educated people wrote letters almost daily and often those letters were preserved. When phones came along, their local contacts moved there, but long-distance contacts continued via mail. Now email, texting and essentially free long distance means that few people write letters any more outside formal contexts. That door into people’s lives has been closed.

    Even the preservation of emails is becoming doubtful. There was a window in which people planned all sorts of stuff they didn’t want to get out via email. When lawyers began using that in court, lessons were learned. Whatever it was that Hillary Clinton wanted to hide of her activities as Secretary of State, she knew that if she wanted to delete them, she needed to move all her email onto a private server that she controlled.

    That may or may not work for her, given her lack of tech savvy. Email is stored at both ends, the sender and the receiver. That may do her in—that is assuming that our legal system will ever prosecute important people, which is doubtful. More and more people are coming to believe that laws are now only for “the little people.”

    Paper documents have never been like that. Even before copy machines, carbon-copies were made and file away as a matter of course. Accessing them was a bit harder than mass deletes of files on a hard drive in your home. Most important people, those with the most to hide, did not even understand those physical filing systems.

    And once those paper files are opened up, interesting things can appear. Only recently have enough of the diplomatic and military files from World War One (not Two) been open long enough for them to be indexed and studied by scholars. And as a result, the real dynamics of how WWI developed is only just now coming out. As I heard one historian remark yesterday, arguments that the military pushed the war aren’t credible any more. In every case and in every country, it was the civilian authorities who ordered the war. The military may have limited the options of czars and prime ministers, but in the end the civilian leaders made choices that turned out terribly.

    What I worry about is that the increasing use of digital will make it far easier to amend history in an Orwellian fashion. Do you really think that the NY Times wants contemporary historians to check out the coverage of the 2008 presidential campaign? It’d be much less embarrassing if certain stories could simply disappear. Ditto the global warming hysteria of about a decade ago. Now we’re supposed to think it’s always been just about climate change, as if the climate wasn’t always changing.

    The upside is that it is now easier than every to take a snapshot of these articles and preserve them, especially with tools like Instapaper. For every move there is a countermove.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail