‘Digital Text Masters’ (Digitizing the classic public domain books)

Self Portrait of Rembrandt van RijnThe recent TeleBlog articles about the Project Gutenberg (PG) text Tarzan of the Apes (see 1, 2), suggest that not all is well in the existing corpus of public domain digital texts. My personal experience the last twelve years in digitizing several public domain books has helped me to see a number of problems which I’ve mentioned in various forums, including the PG forums, and The eBook Community. For the sake of not turning this already long article into a whole book, I won’t cover here the complete list of problems I found, plus those found by others. To summarize what I believe should be done to resolve most of the known problems, when it comes to creating a digital text of any work in the public domain, we should first produce and make available what we call a “digital text master,“ which meets a quite high degree of textual accuracy to an acceptable and known print source. From the “master,” various display formats, and derivative types of texts (e.g., modernized, corrected, composite, bowdlerized, parodied, etc.) can then be produced to meet a variety of user needs. (Btw, what better example to illustrate the concept of a “digital text master” than to show the self-portrait of the great 17th century Dutch master painter, Rembrandt van Rijn, whose attention to detail and exactness is renowned.)

We must have some fixed frame of reference by which we produce digital texts of the public domain, otherwise it leads to problems of all kinds (a couple of these problems, but by no means all of them, are illustrated by the Tarzan of the Apes PG etext.) This is especially true for projects which have the intent of producing unqualified texts of public domain works (thereby implying faithfulness and accuracy) — such projects have an obligation to offer a digital text faithful and reasonably accurate to a known source book so the user knows what they are getting. This is somewhat like food labeling, so one knows what ingredients they are getting in their food.

Fortunately, Distributed Proofreaders (DP) is dedicated to this very goal, and their finished digital texts are being donated to the Project Gutenberg collection at a fairly fast clip. However, DP came on the scene relatively late in the game, so the most popular, classic works were already in the PG collection when DP arrived. As a result, DP has mostly focused on the lesser-known works, many of which are good, but will never be widely read compared to the great classics.

Unfortunately, however, in the PG Collection the great classic works, such as Tarzan of the Apes, are of unknown faithfulness and accuracy to an unknown (not recorded) source work (is that enough unknowns?) Even if they were digitally transcribed with “rigor” (to be clear, I believe a number of the earlier PG texts are of high-quality), how does one know? In effect, PG does not support the concept of a “digital text master,” preferring to be a “free-for-all archive” of whatever someone wants to submit. Until recently when the policy was changed for new texts, PG wouldn’t even tell you the provenance of what had been submitted — that information was intentionally stripped out.

The ultimate losers here are the users of the digitized public domain texts. They are, by and large, a trusting group, and simply assume that those who created the digital texts did their homework and faithfully transcribed the best sources. One reason for this TeleBlog article is to point out to users that if it is of concern to them, they should be more demanding and wary of the digital texts they find and use on the Internet. Be good consumers!

Especially beware of boilerplate statements that say such-and-such a text may not be faithful to any particular source book — if not, what is it “faithful” to? Should you spend a significant part of your valuable free time reading something of unknown provenance and faithfulness? This is especially true in education, where it is important the digital texts students use be of known provenance, and that the process of text digitization was guided by experts (and in effect “signed” by them) to assure faithfulness and accuracy — to be trustworthy.

The “Digital Text Masters” Project

For the above reasons, a few of us are now studying a non-profit project to digitally remaster the most well-known public domain works of the English language (including translations.) We will focus on about 500 to 1000 works in the next decade or so, so unlike DP which is understandably focused on numbers because there’s a lot to digitally transcribe (and they will do a good job at getting those books done), our focus will be on a very small number of the great works, and we will give them the full, royal treatment with little compromise. We will “do them right” and when in doubt, will come down on the side of rigor even if it appears to some to be overkill.

Here’s what we tentatively have in mind:

  1. Of course, we have to begin generating the ranked list of works we’d like to digitally master over time. This list will not be etched in concrete, but will continue to morph. We will not only focus on fiction (although fiction may dominate the early works due to fiction’s overall simpler text structure and layout), but we will consider some of the great works of non-fiction which had significant influence on human progress.

  2. For each Work, we will consult with scholars and lay enthusiasts to select the one (or more) source books that should be digitally mastered. The Internet now makes it very easy to bring together a large number of experts and enthusiasts and draw upon their collective wisdom.

    (Importantly note that for some Works there may be more than one source edition selected to be digitally mastered. We do NOT plan to choose one particular edition and call that “canonical” and then eschew all others. Selection of source books to digitially master is on a case-by-case basis. If someone wants to put in the work/resources to focus on a particular source book — their work of love — we won’t stop them so long as what they do follows all the requirements and the resources are there to properly get the job done.)

  3. We will find, or make ourselves, archival-quality scans of the selected source books. (Purchasing source books will be considered.) Archival quality means the master scans will likely be done at about 600 dpi full color with minimum distortion, and saved in lossless format. Calibration chart scans should accompany each scan set allowing for quality checking and normalization. Care will be taken to assure complete, quality scans of all pages, including the cover, back and spine. In essence, we don’t want someone to hurry to scan the source book, but rather take their time and do it right. Derivative page scan images (such as lower-rez versions) will be made available online alongside, and linked from, the digital text masters.

  4. We will use a variety of processes to generate a very highly accurate text transcription of the source book. Such processes include OCR, multiple key entry, and a mix of the two in various combinations, along with running machine-checking algorithms for anomalies. The goal is a very low error rate. DP may be used, if DP agrees to participate (they are overwhelmed as it is with their current focus on the more obscure works), but we need to investigate multiple ways to do the actual textual transcription and to get a good measure of the likely error rate. Preservation of the actual characters used in the source books (including accented and special characters) will be done using UTF-8 encoding.

    The process used to digitally master a given text will be meticulously recorded in a metadata supplement, including special notes particular to the source book. (Unusual and unique exceptions requiring special decisions and handling are likely to be encountered in most source books — this is where the consulting expertise of DP will be of great help.)

  5. An XML version of the digital master will be created using a high-quality, structurally-oriented vocabulary, such as a selected subset of TEI. Original page numbers, exact line breaks, unusual errors which have to be corrected rather than flagged, and other information will be recorded right in the markup.

  6. Library-quality metadata/cataloging will be produced for each digital master.

  7. Several derivative user formats will be generated and distributed for each original digital master.

  8. A database archive of all the digital text masters, associated page scan images, and all derivatives will be put together to allow higher level searching, annotation, and other kinds of interactivity.

  9. A robust system will be setup to allow continued error reporting and correction of the existing digital text masters in the archive. Even though we plan very low error rates, we know some errors will slip through.

  10. The project will bring together scholars and enthusiasts to build a library of annotations for each digital text master (especially useful for educational purposes), as well as encourage the addition of derivative editions (fully identified as such) for each Work.

  11. We also would like to build real communities around the various works. For example, for each of the most popular works we may build an island in Second Life (or whatever will supplant Second Life in the future — Google is rumored to be working on a “Second Life killer.”) We want to make the books come alive, and not just be staid XML documents sitting in a dusty repository following the old-fashioned library model.

  12. We will heavily promote the Digital Text Masters archive, especially for education and libraries since the collection will find ready acceptance there because of its quality, trustworthiness and metadata/cataloging. It will also be easier to produce and sell authoritative paper editions.

I could go on and list a few more, and expand on each one in more detail, but that should give a representative overview of the general vision.

We are looking at a few funding/revenue models (one of which is quite innovative) to help launch and maintain the project. The highest costs may be for double or triple key entry should we have that done commercially for any if not all source books — the remaining major cost may be for the design and maintenance of the database as well as developing other tools, some of which might be useful for other projects to digitize texts such as DP, and of course to benefit PG as well.

The project plans to start small and controlled, especially in the early phase where R&D to shake out remaining unknowns will be conducted, and work our way up from there. Proper governance and management will be put into place as early as possible. Ties to academia and education, the library community, and various organizations involved with digitizing or adding to the public domain (such as the Internet Archive, Open Content Alliance, the Wikipedia, etc.) will be actively pursued, but the general vision must not be compromised by those ties. I am in informal talks now with several organizations.

We need you!

We Need You!Of course, the most important thing we need is you. If you agree with the general goals and approach of the “Digital Text Masters” project, and are interested in being involved in some capacity, then step forward. We are especially looking for people who enjoy the great classics, are detail-oriented, and believe in doing things right the first time even if it takes more effort. If interested, contact me in private, jon@noring.name, and let’s talk about your thoughts and what interests you. A teleconference among the interested people (who will be known as the founders) is being planned once we get a minimum critical mass brought together.

I look forward to hearing from you. And it would not surprise me if we’ll see a number of comments below, both critical and supportive, of the idea (if you support it, I hope you will comment!) I’m already anticipating some of the arguments to points which were not covered in this article for brevity sake.

(p.s., the use of the Digital Text Masters collection as a “test suite” to improve the quality of processes to auto-generate text from page scan images is discussed in my comment to the original Tarzan of the ApesTeleBlog article.)

(p.p.s., the “We Need You!” graphic is by Ben Bois. Although associated with OpenOffice, it is nevertheless a cool graphic!)

11 Comments on ‘Digital Text Masters’ (Digitizing the classic public domain books)

  1. Karen Lofstrom // February 13, 2007 at 1:51 am //

    Quietly, without much fanfare, DP has already started redoing some works that were done before DP joined Project Gutenberg. I don’t think that TPTB (The Powers That Be in DP) are opposed to this in principle; they’re just moving carefully.

    I should think that DP would want to participate in the project. We already have the infrastructure and now, a five round system that is producing extremely high-quality works. Indeed, a Dickens novel would be a welcome break from working on The Diseases of the Horse and other such odd works meandering through the system.

    We aren’t limited to offering works to PG, so far as I know. I would guess that TPTB would want to make any results available both to you and to PG. But that’s just a guess.

    I can’t speak for TPTB. However, you shouldn’t fear to approach them.

  2. Karen, that is very good news about DP remastering some of the older PG works! It’s something several of us, for a long time, have encouraged them doing. I’ve mentioned it at least two times in the TeleBlog, and several times on gutvol-d over the last few years. With the new PG policy of allowing provenance information in the new texts (bravo to Michael Hart and Greg Newby), this is definitely a great development where the winners are the readers.

    And I’ve not heard about the five round system. Does this mean five proofreading rounds? Looking at the DP home page, it still shows only three. One thing that interests me is how accurate is the three proofreading system? That’s one thing I hope DTM will be able to answer.

    For others reading this, DP is focused on the production side of the texts. As far a I know they still only donate their work product to PG, and make no other use of the texts. In addition, they have not (as far as I know) instituted an archival scanning requirement or recommendation, although individuals scanning books could do so if they want. (For one book I submitted to them, an original of Burton’s Kama Sutra of Vatsyayana which still needs to be proofed, the scans were done at archival quality and then downsampled to meet DP’s preferences.)

    I’ve even toyed for a while with the idea of starting a “Distributed Scanners” group (which could still be launched to support the “Digital Text Masters” project) — just find a bunch of persnickety people (maybe with a touch of obsessive-compulsiveness to assure things are done right <laugh/>) with suitable scanners (e.g., high-grade sheet feed scanners and the Plustek OpticBook) and steer them to scan the great classics according to a set of requirements and guidelines. DS would, of course, build a database for archiving the scan sets along with donating copies of the originals to the Internet Archive (the scan sets, at archival quality, will take up a lot of space, like five or more gigs per book, but with terabyte drives getting dirt cheap, and burning DVD’s also getting cheap, space is no longer an issue.)

    So our vision for DTM, should it get launched, is much more comprehensive and wide-ranging. It certainly could take advantage of the DP system and if DP wants to be involved. (I don’t want to be presumptuous here — Juliet will understandably require that there be meat and real potential in the DTM proposal for her to commit any official DP mindshare to it.) A copy would go to PG, which is important, and we would get a copy as well. DTM would, of course, take the lead in securing the archival-quality and QC’d scans.

    And DTM would work to start building the text annotation project (either alone, or preferably as part of David Rothman’s LibraryCity venture should he succeed at getting that funded in time) — important to make the texts much more useful in the educational market. Btw, since DTM will probably do a similar approach to mastering as what DP will do in the future with PGTEI, I see the works done by DP, which DTM won’t do, to augment the DTM works in the database we set up — again more synergies between the projects.

    Since DP and PG are closely linked, PG will benefit from this as well. DTM is not intended to compete, but to complement PG and DP, and possibly open up doors the other projects will have more difficulty in opening. Visualize the association as a Venn diagram, with three circles intersecting with each other — DTM will have some of its circle outside of the others. The result is that the projects in toto will cover a bigger area.

  3. Karen Lofstrom // February 13, 2007 at 1:56 pm //

    It’s three proofreading rounds, then two formatting rounds — that’s five — and then a post-processor pulls everything together, makes sure that it’s consistent, and then sends it on to someone who checks and uploads the text.

    As for redoing old texts — someone commented on it in the forums, but I don’t think there’s been any announcement. I”m not sure what’s been tackled other than Robert Louis Stevenson. We’re doing a multi-volume collection of his complete works.

    The question of displaying the images as well as the etext is a vexing one. Many people in DP would like to do so and I understand that we’re holding the scans that we made and we control. However, we’ve done a fair number of books based on borrowed scans from Gallica, Google Books, and other such sites. Pictures of two-dimensional texts or pictures that are out of copyright can’t be copyrighted and if we were to display the “borrowed” pictures, we’d probably be OK but … it seems better not to push it.

    I hope that I haven’t gotten myself in hot water here, by seeming to speak for DP. I’ve been volunteering there for four years, but I’m just a foot-soldier, not one of the generals.

  4. Thanks, Karen, for clarifying.

    We’ll see how this unfolds. Already I’ve gotten a couple private messages from “high placed people” with “well-known organizations”. We’ll see if there’s any legs to these and other inquiries.

    No matter what happens, I think the idea of “digital text masters”, no matter how it is implemented, is striking a positive chord with a lot of people. I believe a lot of people do care about the faithfulness and provenance of the public domain texts they read, especially when it is brought to their attention. The typical reaction appears to be “I hadn’t thought about it before, but yeah, maybe I should be more concerned.”

  5. Michael Lockey (Vasa) // February 13, 2007 at 7:27 pm //

    First of all, I am no part of any cabel; just another foot soldier at DP. By and large, this whole argument seems rather futile: was something done “wrong” by anyone? New tools and methodologies evolve; and that which was available years ago, now is antiquated. It does NOT mean that work is wasted: if nothing else, it served as the inspiration to demand better. How did the first, poor, translations of Gilgamesh make it into print? OTOH, should those first translators feel their work has been discounted? Egos, egos.

    We throw words like ‘bowlderized’ around as insults- but has anyone ever read his introduction? Poor Tom did his best, and is treated as the worst. We each fall somewhat shy of others expectations.

    Right now, I’m trying to get my head around the Dublin Protocols which, I feel, would solve a lot of these issues. Shall we all, then, stop work on everything, until we know we’re working on the ultimate edition?

    At least book burning is easier in this day:

    pip *.*/del

    Come on: we’re supposed to be the good guys! There’s lots of real enemies. We don’t have to emulate a Canadian Prime Minister, who described his party as like the early settlers who would find themselves attacked by Indians, circle their wagons, and start firing inwards…

  6. Project Gutenberg and DP quality standards are improving, and so are the tools and procedures in use, and although I believe there will be a market for re-done, certified very high-quality versions of what some may consider the canon of English literature, I also think that a lot of other works warrant attention, in particular those works with considerable added value, such as reference works, or old magazine runs, which are often much more difficult to get access to. The canon of English literature is relative easy to get hold of, but if you could scan and process a set of early magazines, you would add materials that or currently difficult to access.

    Projects like Google Books are changing the playground as well, a tremendous number of books are now becoming available, which means our focus can move from scanning to harvesting and adding value (by proofreading and propper tagging) to what is already available.

    Another point where PG is gradually improving is metadata. We now have a working catalog. It would also be very nice to build a working ‘reading room’ application around the correction, where people can read, annotate works, and share these annotations with friends or the community.

    For languages other than English, we are where PG was 10 to 20 years ago. For both Dutch and Philippine related works, where I have been very active to grow the collection, we are still not through the canon of literature. That is, if such a thing can be demarcated, I have my reservations on an elitist view of literature, and would just as happy an obscure penny novel. Sometimes, these things can be hidden gems, and sometimes they have rightly been delegated to the realm of obscurity. Especially in developing countries, like the Philippines, works have become very difficult to obtain, due to low printing volumes, low quality paper, two devastating wars, and adverse climate conditions. Similarly, Dutch language literature from what is now Indonesia is extremely difficult to obtain, as they were often printed in low volumes on cheap paper, and very few copies have ever reached Europe.

    The current long copyright terms are also very harmful for the promotion of literature. In the time-span between commercial non-profitability and entering the public domain, where they are again free to build upon, works get so far out-dated and out-of-touch with living culture, that they have effectively died… One of PG’s purposes, in my opinion, is reviving such works from oblivion.

  7. Thanks, Jeroen, for bringing up some important points.

    I agree that among the huge corpus of “non-canon” public domain books (in whatever language), there are a significant number which are truly gems that people should know about, and should warrant special digital mastering. They should be added to the “canon.”

    Two further comments on this point:

    1. The proposed “Digital Text Masters” will not etch in stone the “canon”, but rather will be quite flexible as to what works it encompasses. The emphasis, though, especially in the early years, will be on those public domain books oft-used in education (both K-12 and post-secondary) and regarded by experts and lay enthusiasts to represent the best or the most influential of the public domain.

      If someone believes a particular book should be added to the DTM collection, when it might otherwise not be considered, they may make their case. If they are willing to share in some of the costs (if any) and human effort of the digital mastering, they may get the go-ahead to include that book in the DTM collection. The details of the whole selection process still need to be shaken out, and will probably evolve over time.

    2. Once DTM is established and most of the bugs shaken out of it, DTM can certainly diversify into other books and periodicals that make sense to digitize to DTM quality and for inclusion in the DTM database/archive. It’s hard to know the future, really, but we should definitely be ready to diversify. I have some ideas, but they are still nascent so I won’t describe them here.

    Finally, DTM is not intended to replace Distributed Proofreaders, which we fully support in its task to digitize the large number of public domain books without regard to “canon.” In fact, I see some sort of working relationship develop between the two organizations (even if DTM proofing ends up not using DP or its system), and with PG which appears to be evolving to a general text archive from many sources.

    For example, since DTM is intended to be a formal, more heavily-funded organization (while DP will necessarily always be more of a grassroots, volunteer-driven group even though it does have 501c3 status), I see DTM advancing the various technologies which could be shared with DP and with PG. So in some respects DTM might become the “technology development” arm of the large, multi-organizational community to digitize the public domain texts.

    Something to think about, at least. If DTM becomes as successful at generating sustaining revenue as I believe it can, I envision DTM donating funds to DP and PG, preferably in a matching donation sense in order to spur others to donate to those organizations.

    Well, I am getting well ahead of myself, since we haven’t yet organized, have no funds, nor produced anything! But I wanted to share how I see DTM relate to DP and PG, and of course to share some long-term goals should DTM get launched.

    If nothing else, this exercise provides one vision of the future of the effort to digitize public domain texts, and it will add to the “public domain idea database” that others may draw from.

    A related point I think is important to mention which I haven’t yet:

    I observe that the public mindshare these days is on the scanning of public domain texts (which is great to see happen!) The downside of this is that in various quarters this quite public focus on scanning is hiding the advantages to users of having structured digital texts. Even though we know the advantages, we are so far not getting the message out — many see having scanned images plus raw, unproofed OCR text as more than sufficient to use these texts.

    I see DTM as helping to get the message out that there’s significant benefit to society to create proofed and structured digital texts of many if not all the public domain works. Focusing on the great classics, though they be few in number compared to the entire corpus of public domain texts, helps in explaining the benefits of structured digital texts. It’s a little tougher when the texts being showcased are really obscure, arcane, and in some cases truly bizarre works. But talk about Mark Twain, or the Brontë sisters, and everyone recognizes them.

  8. I believe Google have been using book scanners which read the distance of the pages using infra red 3D scanners, including the curve of the pages. So that when scanned they appear as flat images with little or no black depth marks on that often comes with book scanning. We usually carry out scanning using both ways. But the fastest way is always to slice the book and feed scan the pages if you are able to.


  9. I have to admire your dedication and enthusiam that you and the few others give to the works of the English language transcripts.

  10. Great article. Has the project ever got off the ground? It’s 2013 today. Is it realistic that a project like this can ever succeed in our lifetimes? And, if a dominant language like English is struggling in this way, what do you think the situation is for digital literature in other languages, especially smaller languages? “Bleak” is a severe understatement.

    I for one am ready to give up — not in the sense of reading no matter what editions of classic works. Nope: I’m just as strict in that regard today as I used to be. I pain-stakingly search for the most reliable edition of *any* book I’m about to read, and read that edition exclusively.

    What has changed the overall situation in recent years, I think, is the emergence of mobile devices and platforms, and specifically the iPad. I love reading books in the *optimal* digital format, which definitely is EPUB. But, as mentioned above, is it *realistic* for us to see reliable EPUB editions of the canons of *all* the world’s languages within our lifetimes? It must be feared that it is not realistic.

    Therefore, much as I love the EPUB format, I must frequently turn to PDF files to read photographed, reliable original editions of classic works. I sometimes need to create those photographed PDF files myself, in my local library, using the great Scanner Pro app on the iPhone. (Who would have thought that a *telephone* might one day serve as a device to digitize classic literature? Today, this is real.)

    And this is where the situation has changed drastically when you compare 2013 and 2007. Back in 2007, the Kindle e-reader device was merely getting off the ground, but reading photographed PDF files on Kindle devices has always been a struggle. Certainly, the Kindle DX device with the larger screen (I purchased it right away) made PDF files more palatable, but it was still a huge pain reading books in that way. (Kindle DX is no longer developed today.) Particularly when it comes to *annotating* — to me, a crucial component of reading books — the Kindle e-ink devices are next to unusable, when you talk about annotating *photographed* PDF files, rather then PDF files created through text conversion. An e-ink Kindle device gives you practically zero functionality in that regard.

    But all of those concerns have been alleviated through the arrival of the iPad in 2010. On the iPad, reading photographed PDF files of reliable, original paper-printed editions is a thing of beauty. It’s *not* as convenient as reading EPUB editions (for which I use the superb Marvin software), but it *is* very much practicable (unlike reading from a traditional computer or notebook with a vertically oriented screen), and if you use superior PDF software on your iPad (my preference is GoodReader), then, also, the most advanced annotation functionality is at your ready disposal.

    In fact, the GoodReader software is so superb that, paired with the iPhone’s exquisitely sharp “Retina” screen, it is perfectly usable even on the iPhone. Yes, believe it or not, the tiny screen is so sharp it makes reading photographed PDF files of original paper-book editions possible even on the iPhone (if you must, as “real life” issues are likely to demand for at least a few minutes or quarters of an hour every day). GoodReader keeps the two copies of your scanned digital book — one on the iPad, another on the iPhone — in sync, so that whatever annotation you happen to make on the iPhone while on the go, you will discover on your iPad when you resume reading and annotating that scanned PDF file there. Back in 2007, only 6 years ago, I would have thought any of this functionality and ease of use a miracle, but it appears we are living in an age of miracles: it’s all real today.

    And so, while I would very much welcome if a project like the Digital Text Masters were developed in future for all the world’s languages, this concern appears to be considerably less urgent from the perspective of 2013, than it used to be back in 2007.

    Instead, my main concern today would be with the *availability* of *at least* those photographed, reliable original paper-printed editions in PDF files. But, very frequently, they are still *not* readily available today — particularly for languages other than English. The Google Books (or, by extension, Internet Archive) online scans database helps, but it is far from comprehensive as of today.

    My second biggest concern nowadays is the excessive length of copyright protection, which really is detrimental to the appreciation and well-being of literature on a global scale, detrimental to the public domain, the world’s cultural heritage and therefore the public at large. I would have thought the original “life of author + 25 years” more than sufficient in terms of copyright protection, but that sounds like utopia today — as does successful suppression of online piracy, which to a large degree thrives due to the public’s perception of the excessive nature of contemporary copyright laws, along with excessive pricing of (many, thought not all) e-books.

    Sometimes, the price is right, but the quality is not; Winston Churchill’s World War II tomes, earning him a Nobel Prize for literature, are currently on sale from Amazon at $2 per volume, which is a *great* price, but I hear that those electronic editions are beset with dozens of typos, infesting every page. Who’s going to buy that? Conversely, at other times, the quality of an e-book is top-notch, but the price is blown out of all proportion. This appears to be the case for the electronic versions of books from the fabulous paper-bound “Library of America” series, where works of the American “canon” are published observing the reliability principles outlined by Jon Noring in the above article. But, again, who is going to spend dozens of dollars on an electronic edition of a public-domain book, when they can get the purportedly “same” book for free from Project Gutenberg? That the PG edition may be “corrupt” or at the very least “suspect” due to the lacking specification of provenance, will escape the notice of the general, non-expert reader.

    Another vexing aspect of commercially sold e-books is that they frequently, for no obvious reason, are not available for international purchase world-wide. This, sadly, applies to many or all of the electronic “Library of America” volumes; so the electronic editons of those typically public-domain literary works appear not only to be overpriced, but are not even available for purchase for those readers ready to make the substantial investment.

  11. Jon Noring sent me a note via e-mail earlier today, stressing his primary focus in the Digital Masters project was the production of the “master” format (“95% of the work”), from which “reading versions” could then later be produced with relative ease (“5% of the work”) — that the project’s focus wasn’t the production of the reading versions themselves.

    This provoked me to compose, in reply to Jon’s message, the following musings related to his concept of master formats, and how it chimes with my long-term goals regarding digital literature. I promise that this is my last long-winded TeleRead comment post for the time being!

    To me, the difference between 2013 and 2007 is the following: in 2007, I wouldn’t have even *considered* reading Leo Tolstoy’s _Anna Karenina_ in a scanned 19th century original Russian edition. But in 2013, I intend to do just that (I found this great edition: https://www.sugarsync.com/pf/D6495512_1736831_99821?directDownload=true and will be studying it on the iPad and iPhone in GoodReader). Who could seriously do that, though, with a novel of 1000 pages, using a traditional monitor or even a notebook screen? While it *can* be done, it is so extremely inconvenient and different from the typical book-reading experience, that it isn’t realistic to be engaged in a long-term study of literature in this way. Even health issues might arise, I’m afraid. This is where the introduction of the iPad has made all the difference. But the iPad, the hardware *by itself* would be useless, too, without the complement of first-class PDF software, such as GoodReader (which happens to be Russian, by the way), offering all the annotating functionality that a serious/scholarly reader of literature might need.

    So whereas in 2007, my *primary* focus would be on producing the digitized master format (because subsequently obtaining a “reading version” from it would be more or less trivial, as you say), my primary focus today, in 2013, would be *obtaining, or at least specifying the source material* for those future master formats. If we get as far as actually *producing* the master format — wonderful! If not — well, there would at least be the scanned files available as PDFs. So *that* is the dramatic change between 2007 and 2013: that nowadays, even the *source material* for a future master format can be a *reading version*, thanks to the iPad (or a similarly capable device) and GoodReader (or similarly capable software).

    In this sense, whereas in 2007, a “reading version” could only arise *from* a master format, today in 2007, a “reading version” can both *precede* a master format as well as *follow* it. :-) And that is truly a revolution. The reading version *following* from a master format would be much superior over the reading version *preceding* the master format, or in other words: EPUB is incomparably better than PDF, but *both* of those reading versions would be usable on a realistic, day-to-day basis — unlike in 2007, prior to the 2010 arrival of the iPad.

    Even the *initial*, very first, seemingly preparatory task — determining the *source material* for the future master format — shouldn’t be underestimated. Lots of controversies can arise as to which paper-bound original edition, exactly, it is that is to serve as the model for the master format. Yes, you provided for that, Jon, by allowing for simultaneous editions of the same work to be produced as master formats — even then, though, what a general reader would expect, I think, is some guidance as to *which* of the 2 (or more) editions of the same work would be *most* recommended for reading. One can hardly expect a reader to read a single literary work from several editions simultaneously. :-)

    So I believe that even these very first initial steps — determining the optimal source material for the future master formats — can be extremely demanding on time to accomplish. Let alone the actual physical scanning of books. While the iPhone (it truly is amazing), using the Scanner Pro app, is capable of scanning books in such high quality that they are then 100% sharp and perfectly legible on the larger iPad screen, the images created on the iPhone are unlikely to comply with the strict scanning quality requirements you mentioned in the blog article (600 dpi, archival quality, etc.). Still, I think that in the absence of “archival quality” scans, even scans produced by volunteers using telephone cameras, as long as the images are 100% sharp and legible on tablet-sized screens (and that *is* true of pictures made by the iPhone camera), are, for the time being, better than no scans at all.

    So, here is where I would see my focus in the upcoming years and decades: unlike back in 2007, the focus wouldn’t be on the actual *production* of the master format and the “reading versions proper” resulting from it — but on the steps *preceding* the production of the master format; all the *preparatory steps* needed so that the master format can, in later years, be actually produced. Whereas in 2007, after determining the source material for the master format, I would move right ahead into the production of the master format itself, because “abandoning” the project at that stage would have made no sense, due to an electronic “reading version” still not being available, in 2013, I very well can afford to stop right there: I can *determine* the source material for the future format, and I can also physically *scan* it (if it is not already scanned by Google Books, for example; and scanning it in a quality that is realistically achievable for a busy non-professional or volunteer digitizer, which will often mean those iPhone-type camera scans). But that’s it, I can stop right there: as soon as the scanned pages are available, a “reading version” is instantly available, too. *That’s* where 2013 differs from 2007.

    And so, while the preferred 2007 work-flow for digitizing books would be:

    Book 1: determine source material –> scan it –> produce master format –> produce reading version
    Book 2: determine source material –> scan it –> produce master format –> produce reading version
    Book 3: determine source material –> scan it –> produce master format –> produce reading version
    etc., etc. (the list could contain the *entire* canon)

    … in 2013 my preferred work-flow is as follows:

    Book 1: determine source material –> scan it
    Book 2: determine source material –> scan it
    Book 3: determine source material –> scan it
    Book 4: determine source material –> scan it
    Book 5: determine source material –> scan it
    Book 6: determine source material –> scan it
    etc., etc.

    I have “shortend the work-flow”, and that’s why I can do 6 books in 2013 instead of only 3 books in 2007. (That’s just a random example, of course.) This means I’m now skipping the last 2 steps: the production of the master format and of the “reading version proper” (EPUB). This does not mean I’m *abandoning* those steps for good — just *deferring* them to an unknown future (perhaps a very distant future) because the *priorities* between 2007 and 2013 have shifted significantly. Whereas in 2007, the priority would have been to get as many books as possible into the “reading version proper” (EPUB), the 2013 priority would be to have as many books as possible *scanned* (but only after determining the most reliable edition!), knowing that the last two steps can always be accomplished later on, and not just by me, but by anyone, using the provided scans. And if they are not — well, there is always that supplementary, “temporary” (but temporary for decades, centuries??) solution of reading the photographed PDF files on iPad-like devices using GoodReader-like annotation- and sync-enabled software.

    As to the “ultimate goal”, Jon, I would imagine a site like Wikipedia, but devoted to the literary “canons” (a questionable term, I know) of all the world’s literatures. Project Gutenberg and DP are fantastic efforts, but appear to be focused almost exclusively on literature in English. My model for the “canon(s) site” would be Wikipedia. When you open a typical Wikipedia page, say this one: http://en.wikipedia.org/wiki/Earth … in the left column you can *instantly, on-the-fly* switch over to the display of the same page in practically all the other languages of the world.

    My dream would be to have a similar site, but devoted to the “canons” of each of the world’s languages. With a single click of your mouse or tap of your finger, you could move from the English-languge canon to the German-language canon, or French, Japanese… whatever! All the books would be listed and catalogued there with the emphasis on clarity of organization. *Some* books would already have gone through all 4 stages outlined above (determine source –> scan it –> master format –> reading version proper), while other books would only have gone through 3, 2, or just the initial 1st stage (= determining the optimal edition to scan, but the scan not yet being avaialable).

    The site’s 2 main differences from Project Gutenberg would be:

    1) It would be required to get *every* book at *least* through stage #1 (determining the optimal source edition), before a reading version (even the “inferior PDF” solution) or purchase link (see below) would be made available there. The precise problem of Project Gutenberg (or its Slovak counterpart project “Zlatý fond SME”) is that it *skips* this vital step #1. In fact, Project Gutenberg skips *all 3* initial steps, and moves right ahead to stage #4, providing a “reading version proper” — but, at the expense of quality and reliability, which to me at least, makes most if not all Public Gutenberg releases “unfit for consumption”.

    2) The second difference to Project Gutenberg would be that, because it would be a “canon site”, it would also list literary works still under copyright protection. Never mind the squabbles as to what exactly should or shouldn’t be included in a language’s “canon” — it’s an entirely different issue for the settling of which, mechanisms (preferably academic ones) could be devised (observing the “When in doubt, include rather than exclude” principle), just like Wikipedia has guidelines on “notability” determining who or what deserves or does not deserve to have his/her/its own Wikipedia entry. And so, the “canon site” would list copyright-protected books, too, with direct links to purchase them from respective electronic vendors — again, in the *most reliable electronic edition available*. If no reliable electronic edition is available, there would be no purchase link, either. This might (it would be hoped) *motivate* electronic publishers and copyright-holders to *release* a reliable electronic edition.

    In fact, I would all be for not only hyperlinking the copyright-protected book titles to the vendor sites, but for showing the book prices, too, directly on the canon pages. As I mentioned, e-books are frequently overpriced, and if canon site users observed something like (inventing an example here), “XYZ by Faulkner, $7; ABC by Hemingway, $16”, they might start asking questions: “Why is it $16 for Hemingway, but only $7 for Faulkner? They were contemporaries, weren’t they, and have both been dead for many decades now. Why is one of them a lot more expensive than the other?” I believe this might be a way to exert pressure on copyright-holders with the aim of the eventual lowering of e-book prices to reasonable levels. Let publishers overprice the current bestsellers (especially those of a questionable nature) for all I care, but for books which are held to be parts of a nation’s *canon*, of the nation’s cultural *heritage*, they shouldn’t be so unashamed as to keep overpricing *electronic* books by *classic* writers who have been dead for many decades. Yes, it’s a naive hope, supposing that by transparently listing prices of books by long-dead classic writers, one might perhaps induce a *drop* of those prices… but even if that hope didn’t materialize, listing the prices transparently directly on the canon site would be a convenient service for prospective canon book readers/buyers. Because e-book prices tend to fluctuate, go on-sale and off-sale, a dynamic mechanism would need to be devised to keep the prices fully updated on the canon site at all times; indeed, this could be a joint effort by the canon site along with e-book publishers, to offer “special sales” on canon books on certain days or times of year — the resulting revenues might even increase, if such marketing measures were properly executed.

    Enough of dreaming for today. :-) But that would, approximately, be my “e-book vision” for the upcoming years and decades.

2 Trackbacks & Pingbacks

  1. Project Gutenberg News » Digital Text Masters: A Future for Public Domain eBooks?
  2. The ePub Books Project - Part 1: An Introduction | ePub Books - Information & Resources on the IDPF ePub Standard

Leave a comment

Your email address will not be published.


wordpress analytics