HTML is simple and, in conjunction with CSS, offers enough markup to format most any test for satisfactory reading.

So why do we need XML?

XMl is a data format, one that allows every user to create identifiable structures that match the content, not some arbitrary, ill-fitting one.

If formatting alone were all we needed, maybe HTML would be sufficient. But in HTML we never know whether the “1999” we see in the text is a number in an address, a price, part of a phone number, the year, a song title.

XML is self-documenting, and though “price,” say, has its ambiguities (wholesale price or retail price, in U.S. dollars or yen, etc.), the semantics of a <price> tag convey a great deal more than those of a <p> or <strong> tag.

In a text, we have two types of information — that contained directly in the content, the stuff you get when you read and comprehend the language, and also information about the text. Some of this latter is contained in the presentation or formatting; the size, location, alignment, font, and so on, of some text reveals headings or maybe that some phrase is a book titles. Some of this information is pure metadata, information about the text that may not even be displayed or printed. And a great deal of information about the structure of the content is carried in the markup.

We can’t use HTML to carry this information; and we shouldn’t expect anyone who already uses XML to discard this structural information when their texts are published.

*   *   *

When proponents of a hobbled OpenReader argue that an HTML-based vocabulary is what publishers want or need in order to get into e-books, I’m not sure who they mean. They aren’t talking about commercial publishers whose texts are marked up in more informative manner than HTML provides.

They aren’t talking about information companies, which pull information out of databases (some of which is indeed offered up as HTML on the web); that information is all elaborately detailed in database schemas that surpass HTML’s complexity within a few moments’ elaboration.

They aren’t talking about scholarly work, which requires clear distinctions in the markup for texts to be studied. Nor similar texts in govenrment — laws, regulations, court cases, and such. Nor similar texts in business, especially in large corporations whose documentation efforts comprise large publishing entities in their own right — IBM, Boeing, and so on, are the standard examples but really, any company that has any material whatsoever that validates against a DTD or other schema.

Of course HTML is comfortable and familiar, and famously easy enough to master. Someone new to the web can use it without pause, and there are so many tools for working with it I wouldn’t dare to estimate their number. These people might like e-books to be in HTML. So would people who are publishing web pages as e-books.

But an OpenReader that lets users specify their own markup vocabulary will allow anyone to use HTML if that’s what they have and want. An open, XML approach doesn’t restrict any users. Only the requirement to take rich data, dumb it down to HTML and prevent readers from ever seeing or understanding the information in the markup hurts anybody.

So that’s a major reason why I say OpenReader should have no restrictions on markup vocabulary.*

*   *   *

There are two visions of the e-book that inform our expectations. One models itself after the print book — an e-book is the vehicle for disseminating information electronically. This is a one-sided vision because it says information useful to the reader is only of that first sort described above.

The other vision says everything about the e-book should be shared with the reader. This is an open approach, one in which “open” means everything is open. The most famous example of this type of publishing would be the world wide web, in which every page is revealed in all of its detail to every single reader who wants to look.

Raise your hand if you’ve never once looked at the source of a web-page. Well, you probably won’t ever want to look at the source of an e-book either. But the rest of us will likely find the occasion in which this information is wanted. We must reject any arguments that casually dismiss access to the source as unnecessary or unwanted.

*   *   *

Before microcomputers — before word processing and then desktop publishing, before cheap, numerous fonts, before laser printers — the number of books published in the U.S. was something like 50,000 a year. When we made the tools of publishing available to anyone, the number of books published began to expand. There were 700,000 books published last year**. Over a thousand percent increase in two decades. Almost all the increase has to be attributed to the new tools.

I say that e-Ink devices and UMPCs and Internet Tablets will make e-book reading easy and popular. And that within a decade or two we’ll see another ten-fold increase in publishing.

*   *   *

If we make e-books right, then an e-reader will be the program we read anything in. This will be the program*** that most readily adapts to our reading needs — formatting for sure, but also annotation, bookmarks, library control. Bookmarks in a browser work for dozens of websites, but get into the hundreds or thousands (yes) and wow are bookmarks hard to navigate. Really capable library tools meant for managing e-books will see us through.

If every document we read went through the e-reader — books, of course, but also work documents, research papers, email messages — then you can see why I envision another ten-fold increase in publishing and maybe a ten-fold increase on top of that. Of course “publishing” means something different under these circumstances and I don’t expect the number of books that require payment to increase this way.

But documents meant to be read — as opposed to being edited — ought to be considered apt material, just as apt as the books from HarperCollins, Simon & Schuster and Random House. When we consider annotation as something that ought to be shared among every e-reader, then you might think of the e-reader as the tool for best displaying a document and for most easily creating and reviewing annotations.

And is XML needed for that? Not necessarily for display, and obviously annotations can be effected in HTML. But that metadata, the information that’s going to help us with our library features and with sharing loads of information about the information, is best dealt with in the same way that XML is designed: meant to be extended if necessary, with markup that is self-documenting and self-explanatory.

Since our e-reader isn’t going to edit the markup, you have to consider it as information in the text — in our e-book — that provides compatibility with many other tools, editors among them. This is no insignificant feature. We want as versatile a format as possible. And we want our readers to have the information they need when they read the content, when they study the text and its structure, when they want to edit it.

When I say anything other than XML will “hobble” the text, I mean it is the reader who will be hobbled, unable to use the single copy for all purposes. In the age of software, of electronic access, “open” has come to mean “the source is available to all.” It should be true of e-books, which are more than just the surface appearance of the text.

*   *   *

Restricting OpenReader to an HTML vocabulary benefits few, not many. I don’t see the market as demanding it and I dispute the contention that commercial publishers seek the limited utility of HTML.

Is XML adoption just a matter of timing? Do the limited thing now and add the fuller capability later, because it won’t be needed for years or decades (according to Jon Noring)? Will using XML now truly slow down acceptance of OpenReader by e-reader makers and publishers?

Or will it accelerate OpenReader’s adoption?

I think the more useful, more open format will speed things up. In 1999 HTML as the e-book basis was the only viable compromise among contending e-book makers. In 2006, with XML becoming the basis for Microsoft Word documents and other Microsoft programs, for OpenOffice documents of every sort, for government documents, then it’s just a matter of keeping pace with the need, not running out ahead of the pack.

OpenReader needs to accept any XML vocabulary, not just the restricted elements from HTML. Readers should get a text’s full source and not just a deracinated display version.


* Jon Noring likes to belittle this by calling it “roll your own” vocabulary. That, of course, encompasses the major design goal of XML — a non-restrictive, malleable markup. it is only when you require in advance that all documents being shared among different groups that you want a rigid vocabulary. So various industries might specify terms and their meanings — XBRL, for example, or DocBook. And things like MathML or Chemistry ML pretty much also require everyone to use the same vocabulary for specific aspects. Using XML namespaces, however, I can use Chemical ML for my formulas and then my own vocabulary for my text. If I call it “individually tailored” or “adjusting to the content” or “flexible” do you get a different image as to its practicality than labeling it “roll your own”?

** I’m publishing this essay without checking this fact from my memory. If I have misremembered this figure and consequently overstate the case, I think the argument still holds true.

*** Lest there be any mistake, I’m not saying that your e-reader will be foisted onto any and every program. But think about Microsoft Word for a moment. It has a special reading view, in which all the extraneous, distracting controls are removed and nice wide margins are provided as well as optimal line lengths and so on. You might be able to read every single type of document in your favorite e-reader but that doesn’t mean you won’t also read your Word documents in the reading view and your email in that program’s reading view and so on.


Note: For space reasons, I have not discussed the issues of accessibility and of displaying text when an e-reader does not know in advance what elements a text might contain. I will address them in forthcoming posts. Suffice it to say that these are not barriers at all to using author-selected XML vocabularies in e-books.

4 COMMENTS

  1. Well stated, Roger.

    It frustrates me watching “easier” known standards like HTML get picked up because the flexibility offered by XML is too “complicated”; the “easier” option *always* grows in an inelegant way to cover the same functionality , as the implimentors begin to encounter the same problems that XML is meant to address. The example you gave in a previous article over the usage of over the necessity of captured this perfectly.

  2. Of the text-based e-book formats (I’m excluding PDF and Sophie’s predecessor, TK3), the only non-HTML-based format I’m familiar with is FictionBook2. And yet none of them provide the capabilities you can find in a web browser. I just don’t understand why we are having to argue so strenuously at this point in time to move on from HTML.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.