Scanned public domain books lately to be read as a PDF or DJVU file on your PDA? Why not share them through The Internet Archive? TIA will take any book it can legally distribute. I wrote a small how-to for Distributed Proofreaders volunteers who wish to (pre-)publish high-quality scans of the books they are processing, and this how-to might be useful to others too.

In the USA, where The Internet Archive is based, a work is generally understood to be in the public domain if it was published before 1923.

12 COMMENTS

  1. Great post Branko Collin and thanks for your work creating a “how to” web page. I think that book scans have wonderful potential.

    A major problem with book scans concerns unreadability on small screens. Full page scans are sometimes too large to fit directly on a small screen. Although, it is possible to resize a page scan by scaling, the resized page might yield type faces that are too small to read. Here is an idea for fixing the problem with software.

    Take a scanned page from a book and perform automated analysis to identify the locations of the words on the page. Next, break the scan into a collection of rectangles such that each rectangle contains a word. Note each rectangle is simply a collection of pixels that form a rectangular shape. Next, reassemble the rectangles to yield a “reflow” of the scanned information. The reassembled page could have a smaller width so that it could more easily fit onto a small screen and retain readability. (A fancy version of the software might try to fix hyphens by pasting rectangles together when appropriate, but that is not necessary and the text should be readable even with redundant hyphens.)

    All manipulations are performed on pixels and groups of pixels. This method does not perform full optical character recognition (OCR) on a scan. The reason for avoiding OCR is simple. Current OCR technology introduces too many typos.

    Here is my question. Is there some readily available software that does this? Are there any ebook readers with this capability built-in? (I thought of this idea more than ten years ago but I am sure that it must be a very old idea.)

  2. Thanks for your quick and informative response Branko and thanks for the links. The fact that DJVU can output bounding box data is great. However, there are some complications when attempting to reflow a document. If the bounding box for each word is tight then the existence of a descender letter like “g” would move the bottom of the bounding box down. Also, the existence of an ascender letter like “b” would move the top of the bounding box up. This is undesirable if the bounding boxes are going to be used to guide reassembly.

    Instead, you want all the rectangular bounding boxes to have uniform height. Further, you want the baseline for each word to be the same distance from the bottom of the rectangular bounding box. In essence the bounding boxes must be “regularized”. To make this easier it would probably be desirable to preprocess the scan to rotate it and make sure that each line of words is horizontal. I think that extracting regularized bounding boxes for each word from a scan is doable and desirable. (But, other blog readers must have more knowledge than I do on this topic.)

    The overall goal is to allow scanned ebooks to be read on screens with different sizes and different pixel resolutions. Perhaps, the actual reflow of a document could be performed by an archival organization like archive.org. The archive could prepare more than one version of a document using different font sizes and different pixel widths for a line.

  3. Oh, OK. Sure, he could, but finding those boxes isn’t my area of expertise. By the way, most OCR systems will give you word bounding-box information if you give them the right options. Unfortunately, the information you can get easily from DjVu’s OCR system is a bit spare; it doesn’t give you info like whether the text is in an italic or bold font. Most OCR systems will do that, though. Very useful for semantic analysis.

    You can also look at this example of how our system worked.

  4. The papers that Bill Janssen referenced look great! Thanks for pointing them out. They seem to address the exact topic that I was suggesting. This type of work/research could be wonderful for improving the value of scans without relying on inexact OCR or painstaking human proof-reading. Sorry that I have not responded more quickly, but I have some high-priority demands on my time right now, and I want to look at the papers before commenting further.

  5. The Google Library Project and the Open Content Alliance are at last beginning to scan millions of books. This means that one of the most important data formats for books in the coming years will be a collection of page scans. It is essential to maximize the value of this information by using creative and innovative strategies. The work done at Xerox on reflowing data from scans is very valuable. The organizations running ebook repositories and the groups designing ebook readers must try to take into account the great importance of reflowing scans in my opinion. (Maybe they already are and I am just unaware?)

    There are technical questions such as “Where should the computation needed to perform a reflow be performed?” For example, the computation might be performed by the book repository server. An individual that wants to read an ebook would indicate the type of screen he or she plans to use. He or she might give the model type of a PDA or ebook machine. Alternatively, he or she might give the size and resolution of a screen. Next the book repository would generate on-the-fly an ebook that is readable on the target screen. Alternatively, the ebook reader device itself could perform the reflow computation. Here are some questions that I think are relevant:

    Is it possible for an ebook device to perform a reflow or will the computation be excruciatingly slow?

    What is the best format and specification for the data that is extracted from scans that allows for reflow? Can this data be practically shoe-horned into an existing open specification?

    Is Xerox or some other organization asserting patent rights in this domain?

    What does Brewster Kahle of the Open Content Alliance think about reflowing data obtained from scans?

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.