Internet Archive archives digital texts… on paper. WTF.

The Internet Archive reports on its blog that it is concerned about the original copies of books being digitized for libraries and other institutions being discarded or moved to “off site repositories” when they are returned.  Their solution is to take these original books and archive them for future use:

A reason to preserve the physical book that has been digitized is that it is the authentic and original version that can be used as a reference in the future. If there is ever a controversy about  the digital version, the original can be examined. A seed bank such as the Svalbard Global Seed Vault is seen as an authoritative and safe version of crops we are growing. Saving physical copies of digitized books might at least be seen in a similar light as an authoritative and safe copy that may be called upon in the future.

While I applaud Internet Archive’s dedication to archiving and storage of backup material, I say they’re taking a step backward here. You don’t preserve backups of microfiche newspaper articles by saving the newspapers. Likewise, storing digital documents on paper is wasteful and energy/storage-demanding; a single hard drive could save everything in those shipping containers depicted above (not to mention the headache of accessing a single book stored therein).

What the IA ought to be doing is working to improve and use digital storage and backup systems. Yes, they are not perfect as-is; but considering how easy it is to back up a single hard drive in multiple redundant systems, all of which can be designed to cross-check each other to eliminate “electron-flipping,” you could accomplish the same thing with just four drives placed in four safe sites. Want to be safer? Try eight drives.

Let’s face it: Paper is far from the perfect storage medium, as those shipping containers ably illustrate. Let’s be sensible about archiving and storage, and not let romanticism over paper lead us astray.

15 Comments on Internet Archive archives digital texts… on paper. WTF.

  1. Mastering from digital media is too wishful for me. Try reading digital media a century from now. The efficacy of paper masters is their dependability, both eye and machine readable. Paper supports screen display by
    • BACK-UP: providing capacity for regeneration of the screen copy as may be needed due to systems failure. (i.e. proprietary take-down, copy right infringement take-down)
    • MASTERING: providing capacity for augmentation, enhancement or perfecting of faulty screen copy (i.e., adding foldouts or color to Google book copy, missing image of binding, or increased image enhancement)
    • AUTHENTICATION: providing capacity for resolution of forensic, production or provenance questions (i.e., cotton content of 19th century paper, distinction between copy and source faults, evidence of copy manipulation or sophistication)

  2. Ross Presser // June 14, 2011 at 11:34 pm //

    @Gary Frost: I agree 1000%. Paper is much more dependable than electronic media (at least until we invent memory diamonds). But clay tablets are even more dependable than paper.

  3. I vote for the clay tablets, with a digital copy on minidisc and an audio file, bundled together. A truly 3D book.

    😀

    Seriously, there is not right or wrong way, but the idea of the internet archive stacking up reams and reams of paper and ink–that once were digital Books–does twang the Ironic Lyre a bit…

  4. Meredith you bring up an interesting thread concerning print-out of digital such as paper copy of e-mails. But the shared print repository movement of research libraries is another issue. What will be stored is print for which a screen copy has been produced as well as some print for which has not yet been imaged. So the materials are born print and were born print from eras of analog production.

    Librarians know that both print and screen collections are faulty. They also know that validation and certification of any collection is costly. The next challenge that we face is confirmation that any title that will be displayed both to paper and screen. This composite display is characteristic of the great majority of books now produced by digital technologies. After all, publishers actually love selling single book titles twice to different markets.

  5. What about the fact that digital formats can become obsolete? From http://www.dpconline.org/events/previous-events/306-digital-longevity:

    The 1086 Domesday Book, instigated by William the Conqueror, is still intact and available to be read by qualified researchers in the Public Record Office. In 1986 the BBC created a new Domesday Book about the state of the nation, costing £2.5 million. It is now unreadable. It contained 25,000 maps, 50,000 pictures, 60 minutes of footage, and millions of words, but it was made on special disks which could only be read in the BBC micro computer. There are only a few of these left in existence, and most of them don’t work. This Domesday Book Mark 2 lasted less than 16 years.
    Digital media have to be stored, and the physical medium they are stored on, for instance a computer’s hard disk drive or a CD-rom have finite lifespans. But the primary problem is of obsolescence. Computer formats sink into oblivion very rapidly. Howard Besser, of the UCLA School of Education & Information Studies says: “Fifteen years ago Wordstar had (by far) the largest market penetration of any word processing program. But few people today can read any of the many millions of Wordstar files, even when those have been transferred onto contemporary computer hard disks. Even today’s popular word processing applications (such as Microsoft Word) typically cannot view files created any further back than two previous versions of the same application (and sometimes these still lose important formatting). Image and multimedia formats, lacking an underlying basis of ascii text, pose much greater obsolescence problems, as each format chooses to code image, sound, or control (synching) representation in a different way.”

  6. Digital files can become obsolete. http://www.dpconline.org/events/previous-events/306-digital-longevity

    “rom have finite lifespans. But the primary problem is of obsolescence. Computer formats sink into oblivion very rapidly. Howard Besser, of the UCLA School of Education & Information Studies says: “Fifteen years ago Wordstar had (by far) the largest market penetration of any word processing program. But few people today can read any of the many millions of Wordstar files, even when those have been transferred onto contemporary computer hard disks. Even today’s popular word processing applications (such as Microsoft Word) typically cannot view files created any further back than two previous versions of the same application (and sometimes these still lose important formatting). Image and multimedia formats, lacking an underlying basis of ascii text, pose much greater obsolescence problems, as each format chooses to code image, sound, or control (synching) representation in a different way.”

  7. It seems many people assume that, once loaded into a digital medium, a digital file will thereafter not be touched for millennia; at which point, it will be discovered to be so much digital slag. Not only is that just plain silly, it’s counter to the point of storage and backup.

    Only a fool would put files onto a digital medium, then not update that file when a newer medium comes along. The vast advantage of digital mediums is that, when a new medium does come along, the transfer to that other medium is incredibly easier than it was in previous mediums.

    A regular example given is floppy disk files: If someone tried to recover them today, the lack of available floppy drives would make that incredibly difficult or impossible. However, if the floppy disk content was downloaded to a newer medium when they were still usable, say, the period when PCs could play floppies, Zip disks and CDs, the transfer would be effortless (if a bit time-consuming at the small sizes of floppies).

    As digital mediums have improved, the ability to transfer files from one format or medium to another, during the transition period from one medium to another, has only become simpler.

    The key is to be proactive in archiving files and updating them to new formats, not waiting for 30 years and then trying to kluge together a workaround. I’ve been updating files like that for the past two decades as formats and storage mediums have evolved, and still have documents originally stored on those same floppies, now still readable on a modern PC.

  8. Digital media and their hand held device displays have already proven themselves incapable of reliable culture transmission. The famous base-line here is hand writing, carbon ink and papyrus which reliably conveyed Gnostic gospels across 16 centuries.
    Which is not to say that digital media and their hand held device displays have not created new culture. There is some opinion that digital communication, network and display mode will change culture and modify perceptions as much as much as previous culture shifts, including new gospels.
    Perhaps the way forward is compilation. Economies, governance and communications are trending that way.

  9. While many old texts have passed down information from the past, there are other cases where ancient languages are being lost, leaving no one to translate all that paper-based text. And just like digital media, only properly-stored paper media lasts very long.

    I don’t mean to imply that digital media are perfect; just that modern media have the better potential for archiving more and archiving it longer, as technology develops, and assuming we don’t get lazy about our responsibilities to our archives. You want to be lazy? Use paper… it’s about the laziest medium around, and still difficult to store and access over long periods of time.

    But if you’re willing to do the job right, digital is the way to go: Multiple redundant copies are child’s play; files can be updated to new formats in batches and done in seconds; and they can be accessed worldwide in seconds. Do that with paper.

  10. Good points Steven. I certainly didn’t say that access routines are equivalent in paper and screen. Actually they are very different. Paper access was based on human classification. Screen access is based on automated extraction of content and metadata. One gives you very few index results while the other gives you too many.
    Generally the attributes of paper print are its constraints. It is far from a “lazy” medium since it requires persistence and determination to extract information from paper. Some argue that this very effort enhances comprehension. It’s a kind of “Google last” approach.
    The culture transmission issues include overt content. Screen based display does not confirm absence, deletion or modification of content, paper copy does.

  11. Paper copy can only “confirm absence, deletion or modification of” its own content if you’re dealing with the original manuscript, which, in many cases, archivists are not. Screen-based display can confirm absence, deletion or modification of content if it is compared to redundant copies of itself as a cross-check, and it can do it much faster than the same checks applied to digital and paper.

    I call paper “lazy” because textual data can be stored away “as-is” for long periods of time, without worrying about periodic updating to new mediums and formats. On the other hand, if languages change, the need for periodic updating returns, and you lose paper’s chief advantage over digital media.

  12. Wow! Interesting Steven; we may be engaging some live chat that is so rare at TeleRead. The quick scrolling here and high traffic have always seemed to me an adverse environment for discussion of the future of books.
    My idea of overt content of print is a forensic issue. If I pick-up a paper document I can, for example, inspect both sides as well a view it by transmitted light. With a printed book I can inspect the binder’s endpapers and sewn structure and a thousand other features to confirm that it is not derived from facsimile or modified collation or enhanced imaging or erroneous color or cropped margins or a thousand other modifications unapparent in the screen simulation.
    I have never encountered your remark that evolution of language would limit reliable print transmission. The Gnostic gospels in Greek letter Egyptian Coptic were read right off. It would seem that shifts of meaning would equally influence all display modes. If you are referring to automated translation you encounter immediate legibility quality. Here also screen display is especially crippled as network interruption, browser default line length, screen drawing errors, rights management encryption, application compatibilities, inept navigational routines and may other obstacles to immediacy of meaning produce illegibility of screen display.
    As for chief advantages of print over screen; with paper both storage and display functions are provided for a single, one-time cost. Try that with screen transmission.

  13. I applaud of the actions of the Internet Archive. Saving paper copies of books and newspapers that have been digitized is absolutely vital and reflects admirable foresight. Throwing away digitized items would be disastrous.

    As part of my research I use multiple massive databases of text every day. These repositories are enormously valuable, but they are also deeply flawed. For example, a major digital archive of newspapers contains many newspaper pages that are unreadable. The image quality is so poor and the text is so degraded that the print is indecipherable.

    If the original newspaper pages are still available somewhere then in some cases it will be possible to salvage the information. Images can be captured under multiple wavelengths of light. Future technologies might allow direct assaying of ink remnants non-destructively. The availability of the original paper artifacts keeps the door to the future open.

    The Google Books archive contains volumes with pages that are missing, pages that are folded, pages that are blocked with hands, pages that have been scanned while in motion. Keeping the paper books in a reliable and safe storage location is essential so that pages can be rescanned. Indeed, Google indicates that it will attempt to rescan defective pages when they are reported.

    Garson O’Toole
    QuoteInvestigator_com

  14. A digital archive can only be as good as the legibility of the original content, and the care taken to properly digitize it. A great deal of digital archives were badly done in the first place, or done from bad originals. But this is hardly a reason to step backward and dote over the originals… it’s a reason to improve digitization methods.

    When I referred to lost languages, I was speaking of actual languages:

    Every 14 days a language dies. By 2100, more than half of the more than 7,000 languages spoken on Earth—many of them not yet recorded—may disappear, taking with them a wealth of knowledge about history, culture, the natural environment, and the human brain.

    -National Geographic

    Many of these languages include written languages, manuscripts left behind by those who originally spoke the language. As those people die, and languages are lost, the ability to read old language texts goes with it.

    Digital archives provide for the keys to translating old languages alongside the languages themselves, can be updated as modern languages evolve, and more easily translated into modern languages for scholars’ use.

  15. Archive the books on DVD. Then archive a few computers that can read DVDs, a few printers that can connect to them, some spare parts, lots of consumables, and a wood-driven generator that can produce the power to run it. Problem solved.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.

wordpress analytics