Indiana Jones in the Temple of DoomI have been quite perplexed in reading the many comments about IDPF’s “ePub” format following the release late last year of its underlying specs. A number of very smart people, including several developers who naturally dig deeply into tech specs, have painted ePub as a dark and mysterious digital publication (e-book) format, unlike anything else in the Universe™.

The way some have discussed ePub, if Indiana Jones were to explore the deep caverns of ePub, he would probably find something exotic and other-worldly, maybe even the remnants of a long-lost civilization. [note 1]

In reality, though, the opposite is true. ePub is internally quite recognizable and familiar, very similar to traditional web content that we all know and love.

ePub and web content share a number of important commonalities:

  1. ePub represents the textual content of the publication using standard, ordinary XHTML 1.1. So may web content. [note 2] [note 3]

  2. ePub styles the XHTML documents with CSS. So does web content. It is expected that ePub reading systems will interpret the CSS pretty much as web browsers do.

  3. ePub may include JPEG and PNG images. So does web content. ePub may even include multimedia – the same multimedia that we love to view and listen as we surf the Internet with our web browser.

So what is the main difference between ePub and “web content”?

The answer to this question, given in the next section, puts ePub into proper perspective, making it much less mysterious, and a lot more familiar. But to answer in brief: there’s very little difference between the two where it really matters.

It also suggests a next-generation successor to ePub that is much more compatible with the “web paradigm”, while offering the advantages the current ePub offers for both the digital publishing industry and the reading experience. A merger of the two “worlds”, benefitting both and compatible with each other, is certainly possible – and intriguing to explore.

I see today’s web browsers evolving to be the center of this merger – to be the platform for reading all types of digital publications. Today’s “wall of separation” between the web browser and “e-book readers” is, in reality, quite artificial, and in some ways even unfortunate.

Techese Alert: what follows will be somewhat technical, but hopefully non-technical folk will be able to understand the general points.

The primary difference: How the content is “stitched together”

Singer Sewing MachineMost web sites (which can and should be thought of as a type of online “digital publication”) are comprised of multiple HTML content documents. ePub may also comprise multiple XHTML content documents.

The fundamental difference between the two “formats” (yes Martha, there is an implied “web site format”) is the mechanism of how the content documents are “stitched together” or organized into a coherent whole. That is, how does the user agent (browser; reading system) know which content document to display next when multiple documents represent the “publication”?

There are two ways to instruct a user agent as to what content document to display next:

  1. Internal Reference (IR)

    This is the “web browser model” where all the documents are linked together using hypertext links hard-coded within content. No need to explain this any further – it’s how the whole Web is put together!

  2. Document Organization Template (DOT)

    This is the method used by ePub. A separate file, not part of the publication content, contains information which “sews together” or organizes the publication’s content documents. In the case of ePub, the document organization is accomplished using the <spine> element in the OPF “Package” document. (ePub also allows hard links for the reader to optionally jump to other content.)

    In general, we could term any externally-designated means of organizing content documents a “document organization template” or DOT for short. [note 4]

    The user agent will use the DOT information to either create a seamless reading experience, and/or build the links for the end-user to actuate. With a DOT, it is possible to construct an entire book using a number of independent content documents, yet to the reader the book could be seamlessly presented just like it is one document (depending upon the sophistication of the reading system), without the reader needing to actuate a link.

Obviously, there are advantages and disadvantages to each method, and variations on each (ePub is not the final word on the DOT approach), which I won’t delve into here. [note 5]

What does this mean? (The bigger picture…)

From the discussion above we can make a few diverse observations:

  1. ePub itself is essentially a flavor of a more general class of digital publication frameworks where a DOT is used to “stitch” together multiple XML documents (such as XHTML, Digital Talking Book, etc.), all styled using CSS. Hard links may also be added within documents to allow the user to optionally move to a new document.

    The OpenReader format developed a few short years ago is an example of a quite similar DOT-based framework whose DOT is more powerful than that of ePub. Our analysis showed that if one were to build either an OpenReader or an ePub reading system, the jump required to read the other format would be very, very small.

    Thus, competing “formats” that use some sort of DOT to organize XHTML content documents (styled with CSS) are quite compatible with each other. It is actually not correct to call them incompatible or that they contribute to the Tower of eBabel since the same reading application for one can be adapted with relative ease for another, and inter-conversion is possible.

  2. Plug-ins for current web browsers to handle ePub and similar DOT-based frameworks should not be difficult to develop. Future web browsers may even natively incorporate the ability to seamlessly present ePub and similar DOT-based framework publications.

    The simplest approach to leverage today’s web browsers for reading ePub Publications, and which avoids the plug-in route, is a straightforward ePub to “web site” converter. This allows the end-user to be able to read their ePub Publication on any device they own which has a web browser. (I’ve thought about such a converter and will be happy to talk to anyone interested in developing this tool. This could be a collaborative open source project.)

  3. As noted in the previous section, we may contemplate merging the “web paradigm” with the “DOT paradigm”, thus supporting both, and getting the advantages of both.

    For example, I see a new generalized DOT standard which future web browsers and specialized digital publishing reading systems can use to organize and present documents to the user. This generalized DOT should be able to handle much more non-linear publications than the quite limited (read: mostly linear) DOT used in ePub.

    Of course, I don’t see DOT as a replacement to Internal Referencing used today on the web, but rather as an option to use when suitable – both could be used simultaneously, as OpenReader demonstrated.

Summary

It is important that we not erect an artificial wall between ePub and web content. Certainly they are different in some ways, particularly how the content is organized. But they share many more similarities than differences.

It is my hope that we will begin to focus on the similarities, and enterprising individuals in the developer community will start building ePub reading systems based on leveraging existing web browsers. This is not to say effort should not be put into building powerful, stand-alone ePub reading systems which optimize the e-book reading experience and take full advantage of the feature-set in ePub.

Anything which promotes the direct reading of ePub Publications on a variety of devices will help ePub to more quickly reach critical mass in the digital publication community. The winners will be both publishers and consumers.


Referenced notes:

Note 1

At the risk of appearing a little too facetious, a couple of folk have even put forth “Grassy Knoll” type conspiracy theories centered around ePub and [insert the name of your most hated company]. It would not surprise me if these conspiracists will next tie ePub with the Trilateral Commission, the Bilderberg Club, or the Bohemian Grove, citing the proof of such a nefarious connection is patently obvious to all. Maybe Coast to Coast AM needs to cover the ePub Conspiracy? <smile/>

Note 2

XHTML is basically a strict, XML-based flavor of HTML. Web browsers seamlessly handle all kinds of flavors of HTML, including XHTML.

Note 3

ePub may also include content documents formatted using the Digital Talking Book markup vocabulary. Since DTBook is quite similar to XHTML, and is renderable with CSS in most standards browsers, it is not necessary to needlessly complicate the discussion in the main section.

Note that ePub was specifically designed to be fully DTBook-compatible, even embracing DTBook’s NCX (navigation center document.)

Note 4

I’m not wedded to the term “document organization template” (“DOT”) as no doubt someone with greater imagination will come up with a more accurate and better-sounding term. DOT will suffice for use in this article.

Note 5

As mentioned in the main article, the DOT information for ePub is placed into the <spine> element located in the OPF Package file.

It is important to note that the OPF Package performs a number of other tasks for digital publication use not easily done with the “bare” web paradigm, such as assigning “publication-level” metadata, and providing machine-understandable (and accessible) navigation aids, such as a table of contents, using DAISY’s NCX.

To provide a historical perspective, back in 1999, when the original OEBPS format was designed, the first proposal was simply a “web site” packaged into some container for single-file distribution. But as the specific requirements came pouring in from publishers, retailers, developers, and other stakeholders, it became clear that a new construct, which became the Package file, was necessary. This explains the early divergence of OEBPS/OPS from the “pure web paradigm”.

Today’s ePub (specifically the underlying OPS, OPF and OCF specs) meets a long list of requirements which were established by publishers and several other stakeholder groups.


25 COMMENTS

  1. Thank you. I was starting to wonder if I had somehow missed reading the “real” specification, since what I saw seemed pretty straightforward.

    Leveraging XHTML makes so much sense, not just in bootstrapping readers (as you say), but on the back-end too. When I wanted to export epub for my platform I just grabbed existing open tools that generated XHTML and wrapped those in some custom code to generate the OPF package document and zip things up. It only took a few hours to throw together something that validated, and that time included reading the epub spec.

  2. Jon, while I agree with much that you say here, you’re trying slip a whopper past your audience (IMO):

    It is actually not correct to call them incompatible or that they contribute to the Tower of eBabel since the same reading application for one can be adapted with relative ease for another, and inter-conversion is possible.

    It’s exactly the proliferation of these very-similar-but-not-mine formats that causes the Tower of eBabel that David is so fond of. And clearly they are incompatible, because there is no app developed for either which also works with the other.

    Close-isn’t-bad is exactly the wrong way to think about ebook formats. Different-but-not-really-different is a killer for the industry.

  3. It would be nice if there were some sort of directory of ePub-supporting software. I’m interested in the format, but when I went looking for a converter the other day, the only one I could find any information about was the BookGlutton one which does not currently support CSS, rendering it somewhat less-than-useful for me. Also, what apps can read it? Sure, there’s Adobe Digital Editions for the PC, but (for example) are there any J2ME readers that’ll run on my new mobile phone?

  4. Good question, Dan.

    First, the YahooGroup ePub-Community is a place where information about ePub-related tools may be posted.

    Second, the tools to produce ePub from a given XHTML document (or set of documents) and associated CSS and JPEG/PNG images, will be quite easy to construct.

    This is the amazing thing that everyone is overlooking, and which I tried to address in this article. ePub is, for all intents and purposes, a packaged XHTML 1.1 “web site”. Yes, one has to add a Package file (which is easy to compose — a tool to auto-generate the Package which queries for the metadata and such should be almost trivial to write) and then ZIP it properly (again, a tool to do so should be easy to build — already existing open-source code has probably been written that can be adapted.)

    I’d be happy to consult with anyone who is contemplating building the simple ePub creation tools when the XHTML and CSS are already created.

    I also note that for books the XHTML documents are very easy to create in a text editor since one doesn’t have to worry about adding all the stuff today’s web pages typically have, such as various menus, multiple panes, and such.

    Liza’s comment above illustrates how easy it is to create ePub from text editors and such. With a couple tools to speed-up the process, it can be made easy even for novices to produce ePub using humble NotePad for authoring the XHTML and CSS (which can start off as “fill-in-the-blank” templates.)

    To summarize, let’s not view ePub as this complicated format that requires some commercial tool (such as InDesign) to create. This is the primary purpose of my “demystified” article, to empower small publishers and others who wish to produce ePub to go ahead and learn the few things they need to learn to do this, as well as to encourage developers to write the simple tools as just noted. Let’s get moving!

  5. OK, it’s just a packaged “web site”. But it uses the “Package” instead of what a web site uses, which is typically an “index.html” or “default.html” file. Fair enough.

    On the eBabel front, let’s examine the other web site packaging formats. Googling for “web page archive format” gives one an interesting list. Microsoft’s MHTML, introduced in IE 5, documented in RFC 2557, and apparently also supported by the Opera browser. And Apple’s WebArchive, used by Safari. The Library of Congress’ WARC, which they’re using to preserve Web site captures, and is a draft ISO standard. WARC is based on ARC_IA, the format used by the Internet Archive. There’s also MAF, the Mozilla Archive Format for Firefox 3, Christopher Ottley’s project.

    Can you say something about how similar and/or different ePub is to these other pre-existing and widely used formats?

  6. Here’s information on the epub converter that I developed, which expects TEI as the source XML format. It leverages entirely free tools and is free itself. Just give it a TEI document (it expects a chaptered work) and it spits out an epub book with each chapter as an individual XHTML page.

    If there’s interest I’d be happy to adapt it to other markup formats, or package it as a binary executable (right now it requires some installation of tools in the Python language).

  7. Thanks Bill for pointing out the other “web site packaging” formats. Other than MHTML, I’ve not heard of the others. So give me a few days to study them and compare them with IDPF’s OCF (which of course is based upon what the OpenOffice folk use, closely related to JAR.)

    It definitely would behoove IDPF to look again at packaging technologies.

  8. “Second, the tools to produce ePub from a given XHTML document (or set of documents) and associated CSS and JPEG/PNG images, will be quite easy to construct.”

    I’m not denying that the concepts involved in ePub seem pretty easy to grasp, and that it probably wouldn’t be terribly difficult for those with the right skills to write a program to do the conversion.

    My point, however, is that I shouldn’t have to. Unlike other standards (such as the JPEG and PNG image formats you mention) there does not seem to be any “reference implementation” for ePub. If it’s not hard to write code to handle, then it would be advantageous if someone were to step up and create an open source ePub-handling library which developers can simply plug in to their application. Sure, it will have some problems to begin with, but every project does, and by making it open source, anyone with the right skills could join in and help instead of it becoming the sort of proprietary spaghetti we all dread.

    As far as I myself am concerned, at the moment my ebook library consists mainly of HTML documents (usually exploded from Microsoft Reader files, although I also have some Baen books which are natively HTML) which are of varying code standards. Some have CSS and images and some do not. That means that as well as an ePub packaging utility I’d also need something which can take any old HTML file and process it into whichever XHTML version ePub is expecting, rather than simply saying “hey, this isn’t XHTML, rejected!”. In addition, it also needs to be able to follow links, as for example Baen books are often split into chapter files, whereas MS Reader books are a mixture depending on the application that was used to author them.

  9. @Dan: The BookGlutton tool currently supports inline CSS as well as external link references to CSS files. This is mandated by the OPS spec:

    “The link element allows for the specification of various relationships with other documents. Reading Systems must recognize external style sheet references specified via the href attribute and the associated rel attribute (for the values rel=”stylesheet” and rel=”alternate stylesheet”.)”

    This means single file XHTML conversions can still use CSS, either in external files (via URL) or inline (via style elements).

    As for the flavor of XHTML needed for epub, XHTML 1.1 is bleeding-edge enough that not only will you not find many web sites out there using it, you’ll have trouble conforming to it. It’s extremely intolerant of much current Web content. The BookGlutton tool will convert HTML to XHTML Strict, which for all practical purposes (such as being viewable in Digital Editions) is enough.

    Soon we’re releasing an upgrade to the converter which will support multi-file zip archives with JPG, SWF, GIF, XML, CSS and PNG files.

    Aaron

  10. To comment on Aaron’s comment, the more precise way to describe XHTML 1.1 is that it does not support all the elements and attributes that one may use in legacy (“tag soup”) HTML. XHTML 1.1 is very close to XHTML 1.0 Strict, close enough that many XHTML 1.0 Strict documents can be DOCTYPE “rebranded” (for those who want to validate to a DTD) and they will work “as is”. Any differences can readily be fixed using a simple substitution script or even a quick edit using a text editor.

    In fact, here are the actual differences between XHTML 1.0 Strict and XHTML 1.1:

    1. ‘lang’ attribute becomes ‘xml:lang’ for XHTML 1.1.

    2. On the <a> and <map> elements, the ‘name’ attribute is not supported in XHTML 1.1. Use ‘id’ instead.

    For those with really legacy HTML, “HTML Tidy” may be used to get you much of the way to XHTML 1.0 Strict.

    Btw, here’s an interesting article describing the benefits of XHTML 1.0 Strict over “tag soup” HTML. I probably don’t need to comment why IDPF, from the beginning in 1999, specified XHTML.

  11. To make another point on Aaron’s comment, ePub (and the underlying OPS spec) is an e-book format, not a web page. Typical HTML authoring for web pages is a whole lot different than that used for books. In fact, the markup needed for book use is almost trivial compared to that needed for today’s complex “space shuttle cockpit layouts” used in most web pages, which almost always require a tool to author. So one has to be careful not to compare apples with oranges – both are fruit, but are quite different fruit from each other.

  12. Comically, in my work with epub that id/name change is the only problem I hit.

    The publicly-available TEI/XHTML stylesheet will automatically convert xml:id (from TEI) to HTML id attributes. It will also helpfully create anchor links to those paragraphs.

    If you ask it to generate HTML 4.0, you get:

    <p id=”id123″><span name=”id123″/>

    which as Jon points out is not valid XHTML 1.1 because ‘name’ has been removed from the spec. (In fact it wasn’t valid anyway because ‘name’ was never allowed on a span.)

    …but if you ask it to generate XHTML you get:

    <p id=”id123″><span id=”id123″/>

    …which is also invalid XHTML 1.1 because of the duplicate IDs.

    I couldn’t find the right combination of arguments to the base TEI stylesheets to get the behavior I wanted so I had to modify them directly. I’ll be passing a bug report on to the TEI folks.

  13. To put a finer point on Jon’s comments, HTML Tidy will not reliably produce epub files which validate in epubcheck. It will produce XHTML Strict, but not strictly, if you get my drift. In other words, it might leave 574 name attributes and disallowed br tags (between divs, for example). Removing these by hand to get a compliant epub is not worth the effort. Even a script would be wasting cycles trying to conform to XHTML after converting from “tag soup.”

    This is a conflict between declared and actual doctype, as seen in the validation errors for the OPS spec itself:

    http://validator.w3.org/check?uri=http%3A%2F%2Fwww.idpf.org%2F2007%2Fops%2FOPS_2.0_final_spec.html&charset=%28detect+automatically%29&doctype=XHTML+1.1&group=0

  14. I disagree that it’s not worth the effort. For one thing, removing all the name attributes is trivial in XSLT. I do agree that cleaning up arbitrary content might be intractable, but most publishers will have a workflow that starts with at least moderately-consistent source documents, and any inconsistencies can be tweaked at the end of the pipeline.

    But the larger point is, if validation is not part of your workflow because it always produces error messages (even if they’re non-critical), then people learn to ignore the validation output altogether, or even disable it. The whole process of validation becomes meaningless, and ultimately systems will produce seriously non-conforming documents.

    This is bad for everyone: the consumer, because some ostensibly-portable documents now look bad on some devices; the device-maker, because they get blamed for not handling invalid markup; and the publisher, because readers are turned off by the whole ebook experience.

    Software engineers are the first line of defense here, and we should whenever possible build tools that include validation, and push hard for documents that can be accurately encoded.

  15. Of course validation is part of our workflow. It should be for everyone, I agree. And publishers should be expected to conform, sure. But I disagree that it’s worth the effort to strip name attributes. In many cases, it has no effect on the appearance of documents. In others, it’s actually necessary for legacy systems, or new systems based on legacy engines.

    Being strict for strictness sake stifles innovation and development. The e-book format crowd needs to learn lessons from the Web and loosen up.

  16. When we have HTML with <br> at the block level, that HTML is badly broken.

    Publishers will discover as they move to ePub that the XHTML they produce will likely be better formed and much more repurposeable.

    It’s about time that the e-book industry grows up and does things right – the winners will be the publishers, small and large.

  17. Jon Noring said “Future web browsers may even natively incorporate the ability to seamlessly present ePub and similar DOT-based framework publications.”

    That would be great! The acceptance of e-books would accelerate if Firefox, Safari, Opera, IE and other browsers could read a dominant e-book format. This approach would leverage the pre-existing wide deployment of web browsers on large and small devices.

    The publisher Tor has been releasing e-books in a simple format that is readable by browsers. It consists of a zipped file containing an HTML file together with a folder containing image files. The images are in JPG and GIF formats. However, reading this format raises a difficulty. How can a reader using a browser create a “bookmark” or “stopping point” within a very long HTML file when there is no convenient anchor point?

    The “bookmark” would point to a phrase within a text or an offset within an HTML file. Do any browsers implement this function? This should be doable in the current generation of web browsers I think for simple HTML files. It probably would be tricky in complicated HTML files.

    I have not used an ePub reader. How do you create a “bookmark” within a text while reading? Does it automatically paginate? Do page numbers change when fonts are resized?

    Much thanks to the posters on this thread. I appreciate the expertise that is displayed and I am grateful for your attempts to move e-books forward.

  18. The Openberg Lector addon for Firefox (https://addons.mozilla.org/en-US/firefox/addon/5275) used to allow one to read epub documents (among other formats) in the browser.

    Lector was reviewed on Teleread a year ago (http://newteleread.com/wordpress/blog/2007/12/04/an-e-reader-that-accepts-any-xml/).

    Unfortunately it has not been updated for the latest version of Firefox, and the link to its homepage redirects to an empty blog. It’s a great shame, because this was a promising project. Does anyone know any more about it?

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.