Top: when wrapping plain text, Microsoft’s text editor Notepad displays jagged edges. Bottom: an HTML version of Anna Karenina in Microsoft IE.

Last week I demonstrated how to remake Project Gutenberg’s (PG) e-texts into well laid-out p-texts or e-paper texts by converting an HTML document from the project into a PDF file using a word processor. But what if the e-text is only available in PG’s much maligned text format (what I call PVT: Plain Vanilla Text)? In that case you need to put in an extra conversion step, from PVT to HTML.

Such a conversion step can also be handy if you just want to read the book on your PC, handheld, or phone, because unlike PVT, HTML wraps lines to the width of the screen or window.

First, don’t make things harder than they need to be. There are several projects on the web that already have performed the PVT to HTML step for you.

Blackmask used to be such a site; and of the ones that still exist Manybooks and Gutentalk have a good reputation. Project Gutenberg now also produces automatically converted e-texts, from PVT to Plucker, with hopefully more formats to follow.

Manybooks recently started to take things a step further: with its Custom Iliad PDFs and Custom HTMLs it lets you influence the conversion step, by letting you set margins, fonts, font sizes and so on.

If you want a bit more control, there are a number of conversion tools that you can download and run at home. For instance:

There are probably more. The latter is a highly specialized tool that many Distributed Proofreaders use. It may be a bit daunting, but nevertheless try and use it. It comes with a manual. The other tools are all command line based. Since GuiGuts expects the raw output from Distributed Proofreaders’ initial rounds, it isn’t perfect for PVTs. For instance, instead of expecting underscores as mark-up for italicized phrases, GuiGuts expects HTML-like i-tags (<i> and </i>).

PG’s HTML

If you have a choice, go for the Project Gutenberg HTML version. I can almost guarantee that it delivers a richer text than the conversion tools can produce, because it was produced by hand, or in the case of a Distributed Proofread text from a richer source format. PG’s HTMLs can contain images, music, well-formatted tables, hyperlinked footnotes and much more. See for example The Dead Men’s Song that I wrote about earlier. The choice for the HTML file, if available, will often be a no-brainer.

Share and enjoy

Once you have created your prettified classic, you may wish to share it with others. After all, a lot of hard work can go into such a conversion, and it would almost be a waste if you were the only one to enjoy the fruits of that labour.

The most logical place to make your PDF book available would perhaps be Project Gutenberg itself. But although PG will gladly accept many formats for a given book, it will typically refuse what it calls “blind format conversions“. A blind format conversion is one where the person producing the new version does not have access to the original scans.

Luckily, there are projects that will accept your PDFs, such the Internet Archive (in its Open Source Books section) and Lulu. With the latter you can even make some money on the side, should your version be particularly attractive.

Bob Russell at Mobileread has set up a thread to discuss the best place to keep track of these prettified e-texts.

Of course, once you start hand-tweaking your rich format versions, it helps to know what exactly makes up PVT. In a week or so I will be taking a very, very close look at this format.

2 COMMENTS

  1. I just wish the PG HTML wasn’t so bass-ackwards. A lot of times, for example, rather than just use the blockquote tag, whoever has done the html will define a “blockquote” span section in CSS that does the same damned thing. Now I wouldn’t object if they were using blockquote tag and then defining a CSS appearance for that, but they’re using CSS to replace the functionality of the blockquote tag. And that actually makes it harder to convert to other formats than it needs to be (since most converters know what a blockquote tag is, but notsomuch some completely idiosyncratic “blockquote” span).

  2. I just wish the PG HTML wasn’t so bass-ackwards. A lot of times, for example, rather than just use the blockquote tag, whoever has done the html will define a “blockquote” span section in CSS that does the same damned thing.

    You are right, that is silly.

    However, HTML is an incomplete language for marking up books. Often the PG volunteer has to make up styles anyway, in order to capture more esoteric elements. And so you end up making inventory of the styles and elements used anyway. Marking up a blockquote as a DIV is an inconvenience, but it appears to me a minor one.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.