Part one in a series exploring the state of e-book publishing today. Today’s installment is one of several by New York editor Roger Sperberg about the publishing’s failure to use XML markup as the base for creating an electronic future for the book industry.

eReadster — eFail!What’s XML for? Perhaps if publishers understood that question we would be farther along the road to e-books — and to whatever the thing is that subsumes e-books into a richer medium without forgoing book-ness.

I was speaking with Jess Lawson of Oxford University Press earlier this week about using XML in book production. The desirability of an XML workflow comes across more clearly, I observe, when it’s called “XML first,” as OUP does. Adding XML markup for web or e-book delivery after a standard birth — inception, editing, production — enables electronic delivery but seems to be worth only about as much trouble as it takes. After-the-fact XML brings little additional benefit.

I remember a slide that Tommie Usdin of Mulberry Technologies showed at an XML conference ten years ago. It stated simply, “Markup is expensive.” And about the same time Jon Bosak of Sun did some back-of-the-envelope calculations that balanced the extra costs of adding markup at about 1.8 uses of the content.

What’s that mean? By Bosak’s rule of thumb, if I were to publish ten books with ten chapters each, the additional cost incurred by structured markup like XML or SGML would be met by the simple re-use of 80 of the 100 chapters — on the web, in advertising, in custom publications, in re-purposed derivative works. If I remember correctly, Jon’s data came from Sun’s own experiences, in which material describing computer subsystems would be used in documentation for many different final products, with some descriptions even making their way into marketing handouts.

Trade publishing doesn’t have so many opportunities for repackaging, but re-use is as simple as utilizing the same source for different editions. So the added cost of XML markup is met if, say, I publish four of my ten texts in hardcover, mass-market and large-print editions. Fifteen years after the internet’s appearance and well into the second coming of e-books, this seems rather crude justification. But fifteen years ago, those three editions likely would have been produced — keyboarded, formatted, proofed — in three entirely separate editorial workflows. In 1994 I was working with Ballantine Books, and even then setting the paperback from the hardcover text files was the exception rather than the rule.

Single-source “P- and E-” publishing appears to drive the publishing industry’s slow turn to XML workflows. Markup is expensive, and the uncertain economics of our electronic future means the sight line for pay-back on that extra expenditure must be short, direct and obvious.

Perhaps this is one reason “electronic” books are scarcely electronic at all, but only scantily draped in the most superficial of markups, our ever-present HTML. With its ready use in even the most rudimentary web-pages, HTML markup must seem like a no-brainer to those publishers venturing into e-books. Who wants to invest millions on markup with no way of assuring its return?

To return to my opening question — Why XML? — we won’t understand the answer until we first realize that the responses publishers most often rely upon are really answers to Why HTML? or Why single-source? Years ago, Bob Stein argued that we couldn’t exploit the electronic side of publishing unless authors understood what that meant (and then he set about building new authoring tools).

Today, I would argue that we can’t exploit E- (yes, “E hyphen” is my abbreviation of e-publishing and fiddle for anyone else’s use of “electronic” as a situational attribute) until editors understand XML as well as English grammar, and regard metadata as valuable as a plug on Oprah. Only then will the structural elements exist in e-books that will make them more valuable than p-books.

This is only the first broadside of many which I will be launch here from Teleread’s ramparts. I also splutter as @eReadster on Twitter.

17 COMMENTS

  1. I agree that editors should be as conversant with XML as they are with grammar. But the one thing you haven’t addressed is: Who will pay for that expertise?

    Publishers are currently budgeting fewer and fewer dollars for editing. One company tried to hire experienced STM (Science, Technical, and Medical) editors for 80 cents a page. That rate is lower than the rate publishers were paying in 1984 and, from what I have been told, lower than they were paying in 1973.

    Publishers have asked me if I can do editorial work in InCopy and InDesign. Yes, I can; I invested a lot of money into acquiring and learning the software and in having computers capable of working efficiently in those programs. But then it came down to price. Because I was being asked to edit, the publishers were unwilling to compensate me for my skill and investment. Publishers viewed it as my problem.

    So now you want XML thrown into the mix. What incentive is there for an editor to learn yet another skill for which the editor will not be compensated? Or for the editor already skilled in XML to use that skill? I do not view this as an editor’s responsibility. An editor should not have to master a nonediting skill without compensation.

  2. This article makes some good points but seems to omit some very important reasons for XML beyond single-source and metadata. One is accessibility for readers with disabilities. This is becoming a big deal for education publishres. Another is adding richer content to ebooks and other electronic forms. Embedding movies, graphs that can be manipulated, mathematical equations that can be copied into calculation apps, etc. When ebooks become more than just digitized text, then XML’s descriptive power will be a requirememt/

  3. I guess I just don’t get it. Maybe my books are simple and miss the point but I find HTML has all the markup I need for creating eBooks. What, specifically, does XML offer that HTML doesn’t? What tools exist to (for free or nearly free) convert from XML to the various formats in which my product is delivered?

    Remember, authors aren’t writing in XML (or HTML), they’re writing in Word or OpenOffice. So, you never have the chance for XML first. Editors and publishers begin with Word or RTF files submitted, work with these as a basis, and move on from there. They days of re-coding author-submitted text from paper manuscripts are long (and thankfully) gone.

    I’m truly not trying to be argumentative. I’d be happy to learn XML if I could see a direct payoff. But religious purity is not important to me.

    Rob Preece
    Publisher

  4. Coming from a research-based professional publishing background where granular content re-use is a business necessity, the business justification for upstream XML conversion always held water.

    But I can understand the continuing reluctance in, for example, the fiction and “joy read” non-fiction markets, where the reuse potential is still very limited and focused on established product “packages.” Content disaggregation from these packages, even at the chapter level for non-fiction, and reassembly into custom-assembled topical products, would be clunky at best–not to mention the headache of rights issues. Unless, of course, authors start WRITING with this disaggregation potential in mind.

    Until eBooks/eReaders become fully interactive, and start to prove to authors and publishers that a new, more granularly-written genre holds new monetization potential, the traditional reasons for NOT investing in XML-centric enriched tagging and metadata will continue.

    Thanks for the post!

  5. Writers who write in OpenOffice.org write natively in xml as the .odt file format is just xml, a css file, and others, (say image files) in a zipped archive. Writers who write with the defaults in recent editions of MS-Word also write in the .docx xml format. Both are ‘open’ as approved by international standards bodies. (Microsoft’s purchase of that approval was widely criticized, and its docx format is so complex, it’s claimed that nobody but Microsoft can use or understand it; we can hope that translators will come into being, though.)

    OpenOffice.org’s file format is supported by Apple in Pages and the latest editions of Textedit, as well as KOffice and Abiword, though these implementations are not complete; it’s claimed that only OpenOffice.org fully implements their own standard.

    I’d also like to read Roger’s further explanations of why HTML is insufficient for the purpose, and why he seems to think that a proprietary set of xml tools is better than the host of widely available, free, and open-source tools capable of producing clean HTML. What do publishers need to do that our current standards don’t support? What do publishers need from their authors?

    I can perhaps understand extra tools needed for the ‘text as video game’ proposed by Linda Gruber in her book. But maybe most authors (who seem not too pleased with the very idea that their words won’t be read on pulped trees) don’t need or want to create video games.

    In short, this was a great rant, but I wish Roger had written out his whole series of rants ahead of time, and then posted an introductory explanation first of the bunch. Because I like others am at a bit of a loss to see what he’s getting at here — specifically, and in actual practice.

  6. I would hazard that at a minimum what makes a reasonable XML platform preferable to ad-hoc HTML is that XML is data and HTML is formatting–down to page-layout, if you fiddle with it. That really decreases the ease of then transferring that book to ‘n’ E-Platforms. The core idea behind XML is that it is meant to be fully semantic. Toss in a complex stylesheet for print and xslt it into a beautiful PDF; or another one for “large format” print; or any number for various e-book readers that support options (a, b, c, q). Need to do a revision on the book? Edit the XML, and re-publish–you don’t have to much about it “well, this got this bit of style applied here, so we have to undo that, move it around here, then redo it like this”. That’s the dream, anyway. Or _a_ dream… 🙂

  7. Er, meant to say also–my understanding is that the XML that Open Office uses is nowhere near “semantic”. XML by itself doesn’t mean much–not much more than “a more extensible version of a csv for delineating data”. Not all spreadsheet templates are made equal, I’m sure we’ll agree.

    And apologies for any off-sounding exposition. I haven’t slept in longer than I should…

  8. Yawn. Is Russian idea; Soviets had first before lackeys of Wall Street brought down Socialist Paradise with corruption and illegal use of chocolate bars. .fb2 format is great XML ebook format of Heartland of Eurasia and will yet lead to dominance of world by Dialectical Materialism. More vodka for everyone!

    Regards,
    Jack Tingle

  9. To those who recognize that some mark-up is needed but wonder why XML, in my experience the benefit of using XML is that you can use XSLT to transform your content into various outputs–HTML for Web, Word documents, InDesign Tagged Text, PDF, you name it. XSLT provides a much safer and predictable transformation than, say, scripting a ton of finds and replaces using regular expressions. XSLT can be difficult to learn, but once you get it, the question Why use XML? vanishes. Being able to navigate the DOM comes in handy, too.

  10. I guess the question hinges on whether firms have the inhouse knowledge or willingness to make the XSLT/XSL-FO modifications they need. Even Oreilly was a MS Word shop for a long while (although I think they switched to docbook).

    It makes perfect sense for academic publishing to invest in reuse; they have longer-range goals. Much as I can appreciate xml, I also appreciate the ease of doing cut-and-paste from one application to another. I am working on 2 docbook projects, and confess that sometimes I copy content from a web browser into OO, then save as docbook xml. Sounds idiotic, I know, but sometimes it’s the easiest thing to do.

    for a recent project I had to convert latex to docbook. There are scripts for that, but because of special circumstances, it made sense just to convert it to XHTML and fix all the image references manually.

    Semantic markup and web content management don’t really play well; I’ve seen few examples of good implementations for the content originating on the web.

    Also, an XML workflow tends to reduce the emphasis on design with GUI, so it limits your design options somewhat (how many designers know XSL-FO?). I’m sure integrated and user-friendly solutions exist, but their costs put them out of reach for smaller publishers.

    for some publishers, it makes sense to tackle semantic markup and layout, but for many projects it seems like overkill. For those willing to invest, it seems there are many turnkey XML solutions to reduce the learning curve.

  11. More forward-thinking academic publishers have been using XML workflows for decades, starting with SGML and moving to XML when XML displaced SGML. These publishers are reaping long-term benefits from that.

    Many of the benefits have to do with what the publisher can do with the data and not necessarily translate to “reading” benefits. When rendered online, the book or article markup is significantly dummed down to support the browser or ereader technology.

  12. I agree with SPost. My company, Design Science, works with many publishers with XML workflows. As SPost says, most use XML internally only. It is my hope that this will change soon. Once education moves more toward ebooks, competition between publishers will causes an interactive features arms race for their books. This will require more substantial XML workflows and, sometimes, delivery of XML to the reader software.

  13. A great thought provoking piece that reflects what I’ve been thinking for several years now.

    I have spent most of my career in the Technical Documentation industry where applying markup, like XML, has been standard practice for decades. The cost is recovered through reuse, repurposing, reduced translation costs, and easier sharing of content across different delivery paltforms.

    Over the last few years I have been lucky enough to have several non-fiction books published and the more I have become involved in the ‘traditional’ publishing model I have often wondered why it is still stuck with a process that has changed little since the days of the type writer – yes the tools have changed, but the process is still fundamentally the same; largely because traditional publishers still see the physical book as the product, and not the content.

    But today content is king, and we need to make that content available across all platforms, and that means mark-up. Using XML can also allow you to add value to the content – imagine reading a book on an eBook reader and when the story mentions a particular place, being able to click on the word and get the wikipedia entry combined with a street view google map of the location.

    As for the cost of adding that mark-up – is it any higher than adding an index marker?

    As for asking editors to learn XML, sure they need to be aware of it and it’s power – but there is a whole profession of people out there who already know about applying XML mark-up to content – the Technical Publications industry. Oh and a lot of them know about XSLT and XSL-FO too, and are skilled in the tools that use these standards.

    I am writing my current non-fiction book using a wiki as the editing tool. From that we will export the underlying XML tagged content – it will then go through an XSL-FO transform for the printed version and will also be ready to be converted as an eBook.

    And believe you me, for people who have spent years tagging things like aircraft manuals and software user guides, tagging a trade mass market book is not too great of a challenge.

  14. I think this piece is spot on, albeit a little funny. It’s funny because many of the publishers that could benefit from XML have been playing it lip-service for years and making sorry excuses for not adopting it. Ironically, the very companies that could benefit from using it to re-engineer their publishing process and trim all the extra fat from their customer-facing publishing process are already using it internally for training ad technical documentation. Not all publishers, mind you, but a fair amount are using it (and have been for years now) to create multi-channel, platform, browser, and OS-agnostic content — things like policies, procedures, training,technical support and product documentation. Often, they translate that content into multiple languages and XML helps them there as well.

    The other irony is that books in the trade press would be some of the easiest to XMLify (for lack of a better term) because compared to manuals designed to help folks assemble a 747 or to run a Nuclear Power Plant, these books are very, very simple and could easily be modeled and styled for multiple outputs, including EPUBS, Amazon’s soon-to0become-extinct proprietary format, PDF, etc. all with the click of a mouse.

    It’s high time publishers start leading their industry and providing content in standardized formats instead of allowing device manufacturers tell them how to prepare their eBooks, for example. By becoming leaders and focusing on the content creation, management and delivery processes, instead of formatting and print-paradigm processes, publishers could provide their customers with new, engaging, interactive content.

    And, it’s also time for publishers to realize that the “build it and they will come” mentality is so 1999. No one wants to come to your website. They want to grab the content they want and go — from wherever they are, whenever they want to, on whatever device they choose. This means teaming up with distribution networks like iTunes — the largest entertainment retailer in the world — to make eBooks affordable, consumable, shareable, tweetable, and enjoyable.

    There’s so much to be excited about in the book publishing space, but consumers are not going to sit around and wait for each company to get their act together. I expect some major changes are coming down the digital publishing pipe….and soon.

  15. When you said, “Today, I would argue that we can’t exploit E-… until editors understand XML as well as English grammar, and regard metadata as valuable as a plug on Oprah.” my heart skipped a beat…you’re exactly right.

    Of course, there’s a whole lot more to the e-format (see what I did there?) than metadata. Descriptive data has been a ghost for nearly 15 years, and it is getting better, but the real power of “e” is to move beyond the sequential and flat formats of books and journals altogether. While it is a topic for another time, I believe glyphs on a page are inefficient in any form. We are visual, non-linear creatures… a medium that meets our basic nature is what will help information become truly accessible to the greatest number. In the meantime, XML is a great start at improving visibility and exchange.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.