Adding books to The Internet Archive

By Branko Collin -

June 27, 2006

252

Scanned public domain books lately to be read as a PDF or DJVU file on your PDA? Why not share them through The Internet Archive? TIA will take any book it can legally distribute. I wrote a small how-to for Distributed Proofreaders volunteers who wish to (pre-)publish high-quality scans of the books they are processing, and this how-to might be useful to others too.

In the USA, where The Internet Archive is based, a work is generally understood to be in the public domain if it was published before 1923.

Related

12 COMMENTS

David Rothman June 27, 2006 at 8:40 pm

I couldn’t agree more with the above, Branko. Better to have the images for sharing than nothing at all! Meanwhile I’m delighted to see that how-to link up there. David

Log in to leave a comment
Garson Poole June 28, 2006 at 12:07 am

Great post Branko Collin and thanks for your work creating a “how to” web page. I think that book scans have wonderful potential.

A major problem with book scans concerns unreadability on small screens. Full page scans are sometimes too large to fit directly on a small screen. Although, it is possible to resize a page scan by scaling, the resized page might yield type faces that are too small to read. Here is an idea for fixing the problem with software.

Take a scanned page from a book and perform automated analysis to identify the locations of the words on the page. Next, break the scan into a collection of rectangles such that each rectangle contains a word. Note each rectangle is simply a collection of pixels that form a rectangular shape. Next, reassemble the rectangles to yield a “reflow” of the scanned information. The reassembled page could have a smaller width so that it could more easily fit onto a small screen and retain readability. (A fancy version of the software might try to fix hyphens by pasting rectangles together when appropriate, but that is not necessary and the text should be readable even with redundant hyphens.)

All manipulations are performed on pixels and groups of pixels. This method does not perform full optical character recognition (OCR) on a scan. The reason for avoiding OCR is simple. Current OCR technology introduces too many typos.

Here is my question. Is there some readily available software that does this? Are there any ebook readers with this capability built-in? (I thought of this idea more than ten years ago but I am sure that it must be a very old idea.)

Log in to leave a comment
Branko Collin June 28, 2006 at 3:37 am

I am not aware of any software that does this, although Brewster Kahle once mentioned there is software that will link an OCR-ed word to its position in a scan, so that position (and presumably the rest of its dimensions are known.

Log in to leave a comment
Branko Collin June 28, 2006 at 3:46 am

Yeah, see this thread at TIA. Apparently the DJVU OCR has the ability to output an XML file that contains the bounding box information of a word (sample here. The XML file stores bounding box information per word.

Log in to leave a comment
Garson Poole June 28, 2006 at 7:02 am

Thanks for your quick and informative response Branko and thanks for the links. The fact that DJVU can output bounding box data is great. However, there are some complications when attempting to reflow a document. If the bounding box for each word is tight then the existence of a descender letter like “g” would move the bottom of the bounding box down. Also, the existence of an ascender letter like “b” would move the top of the bounding box up. This is undesirable if the bounding boxes are going to be used to guide reassembly.

Instead, you want all the rectangular bounding boxes to have uniform height. Further, you want the baseline for each word to be the same distance from the bottom of the rectangular bounding box. In essence the bounding boxes must be “regularized”. To make this easier it would probably be desirable to preprocess the scan to rotate it and make sure that each line of words is horizontal. I think that extracting regularized bounding boxes for each word from a scan is doable and desirable. (But, other blog readers must have more knowledge than I do on this topic.)

The overall goal is to allow scanned ebooks to be read on screens with different sizes and different pixel resolutions. Perhaps, the actual reflow of a document could be performed by an archival organization like archive.org. The archive could prepare more than one version of a document using different font sizes and different pixel widths for a line.

Log in to leave a comment
Bill Janssen June 28, 2006 at 11:53 am

There’s a great deal of prior art on this. UpLib, for instance, does what Garson suggests, and extracts bounding boxes for each word of each page, using a variety of techniques. Those interested might want to look at the paper entitled “Reflowable Document Images”, which describes a general technique for finding word bounding boxes developed by Tom Breuel.

Log in to leave a comment
Branko Collin June 28, 2006 at 6:17 pm

Bill, is the William C. Janssen who has co-authored that paper related to you?

Log in to leave a comment
Bill Janssen June 28, 2006 at 10:50 pm

Yeah, but the wordbox finding was all Tom’s work. For a more extensive treatment (that doesn’t have me as a co-author :-), take a look at Two Geometric Algorithms for Layout Analysis.

Log in to leave a comment
Branko Collin June 29, 2006 at 4:39 am

I meant; if you co-authored these papers, you are allowed to say so. 🙂 Now I was wondering what you were “hiding”.

I guess it would also be nice for Garson Poole to realize he could direct questions to you.

Log in to leave a comment
Bill Janssen June 29, 2006 at 1:56 pm

Oh, OK. Sure, he could, but finding those boxes isn’t my area of expertise. By the way, most OCR systems will give you word bounding-box information if you give them the right options. Unfortunately, the information you can get easily from DjVu’s OCR system is a bit spare; it doesn’t give you info like whether the text is in an italic or bold font. Most OCR systems will do that, though. Very useful for semantic analysis.

You can also look at this example of how our system worked.

Log in to leave a comment
Garson Poole June 29, 2006 at 3:48 pm

The papers that Bill Janssen referenced look great! Thanks for pointing them out. They seem to address the exact topic that I was suggesting. This type of work/research could be wonderful for improving the value of scans without relying on inexact OCR or painstaking human proof-reading. Sorry that I have not responded more quickly, but I have some high-priority demands on my time right now, and I want to look at the papers before commenting further.

Log in to leave a comment
Garson Poole July 1, 2006 at 1:06 am

The Google Library Project and the Open Content Alliance are at last beginning to scan millions of books. This means that one of the most important data formats for books in the coming years will be a collection of page scans. It is essential to maximize the value of this information by using creative and innovative strategies. The work done at Xerox on reflowing data from scans is very valuable. The organizations running ebook repositories and the groups designing ebook readers must try to take into account the great importance of reflowing scans in my opinion. (Maybe they already are and I am just unaware?)

There are technical questions such as “Where should the computation needed to perform a reflow be performed?” For example, the computation might be performed by the book repository server. An individual that wants to read an ebook would indicate the type of screen he or she plans to use. He or she might give the model type of a PDA or ebook machine. Alternatively, he or she might give the size and resolution of a screen. Next the book repository would generate on-the-fly an ebook that is readable on the target screen. Alternatively, the ebook reader device itself could perform the reflow computation. Here are some questions that I think are relevant:

Is it possible for an ebook device to perform a reflow or will the computation be excruciatingly slow?

What is the best format and specification for the data that is extracted from scans that allows for reflow? Can this data be practically shoe-horned into an existing open specification?

Is Xerox or some other organization asserting patent rights in this domain?

What does Brewster Kahle of the Open Content Alliance think about reflowing data obtained from scans?

Log in to leave a comment

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com. Cancel reply

You must be logged in to post a comment.

TeleRead.com is now a static archival site, but we're very much alive at TeleRead.org. Big thanks to Nate Hoffelder of The-Digital-Reader.com, who teamed up on the preservation project with ReclaimHosting.com.