OpenReader and the open web

By Roger Sperberg, New York Editor for TeleRead -

March 14, 2006

162

Reading about Google’s purchase of Writely and its impact on Microsoft, I ran across this observation from Gary Edwards, quoted at ZDNet:

For Google to thrive, information must be cut loose from [its] application ties and set free using open, Internet-ready file formats.

Wow. That sounds a lot like the kind of principle that should be underlying the creation of an OpenReader format. “Open” meaning non-proprietary, “Internet-ready” meaning shareable, postable, current — my kind of features. And single-format for work, collaboration and distribution, not one version for me to work with and another for you to look at.

Like Sophie, the Institute for the Future of the Book platform, by the by.

Edwards had more to say:

Microsoft made their billzillions from tying information to specific application and platform versions, and tying those to hardware and API references, charging a premium for the licenses needed to facilitate the exchange and interchange of documents. Just the opposite of the Internet-centric Google model.

Something else to consider. Web 1.0 information was document-bound and unstructured. Meaning it hardly qualifies as machine-readable. You can beat it to death with a browser, but forget about computationally working it to extract information-rich components and data streams.

Web 2.0 changes that. Enter XML, RDF and the world of highly structured, highly interactive component information and data-rich documents. So highly structured that our computational machines can run wild with metadata, conceptual tagging and ontology armed engines to aggregate, re-use, re-purpose information components as they zig-zag across vast stores of documents.

This sounds very much like a path to resolve the “Tower of eBabel” problem much discussed here at Teleread.

Edwards goes on to point out the opportunities provided by use of the OpenDocument (ODF) file format. His view is even more expansive than David Rothman’s.

This might lead us to ask, Should OpenReader be a subset of OpenDocument? Well, I’m not headed in that direction yet. But I do think that OpenReader should consider OpenDocument’s goals and principles as it looks to make for a widespread file format.

24 COMMENTS

bowerbird March 14, 2006 at 3:39 pm

it’s entertaining to me to see people suggest that
a complex file-format which is an “open” one can
solve the problems which were caused by the
complex file-formats that are “proprietary”…

it is _not_ the “proprietary” nature of the file-format
of ms-word and its ilk that causes all the problems,
it is their _complexity_ and unintelligent design…

strip the file-format back to basics — plain-text,
structured in a consistent manner — and the apps
of tomorrow (whether based offline or online) will
be able to figure out the “semantic structure” of
the document all by themselves, without us to
have to do _any_ heavy markup at all…

but hey, you don’t have to believe little old me.

just pay close attention to heavy-markup people
as they get bogged down in their own quicksand.

in the meantime, though, do _not_ get suckered
into doing any of their heavy markup for them,
or you will only be wasting your valuable time…

-bowerbird

Log in to leave a comment
Jon Noring March 14, 2006 at 6:06 pm

All the conversion houses (which do a significant amount of books for major publishers) are now switching over to XML-based workflows (many are already there, such as Rosetta Solutions). They are not bogged down — rather, they are doing very well since, when done right, it makes things easier.

Your attempt at a strawman and then tearing it down is plain for all to see.

Anyway, we all await with bated breath for ZML. How long have we waited, two years now? (*laugh*)

Log in to leave a comment
Roger Sperberg March 14, 2006 at 7:45 pm

Anyone with experience will tell you: “Markup is expensive.” There’s no way around that. It’s a lot of work to put markup into text. The bowerbird is right about that.

And he’s right that you can get a lot of what e-readers want from simpler markup that masquerades as “no” markup. (What’s the difference between putting four blank lines at the end of a chapter — 5 keystrokes — and putting in some symbol or </ch>?)

But even wikis show us that “almost-no” markup has its appeal, so b-bird has good reason to pursue his course.

What his demonstrations of “zero” markup don’t show us, though, is the network effect of markup. My text, and your database, and this programming language and that graphics language all using XML markup allow me to have graphics appear with the text that can be updated in real time and this while being parsed with the same engine that is dealing with the text formatting. I don’t think bowerbird’s parser, which knows how many lines make a chapter end, is versatile enough to cope with a zero term programming language and zero term graphics language to achieve the same thing.

And of course he’s never claimed that it could. Is it necessary or even a good idea? If you think so, stand on my side of the room. If you don’t, please stand over there by bowerbird.

If you’re going to go the markup route, you’d better have a good reason. Bowerbird didn’t say that, but it’s implicit in what he has said, and it’s certainly something that I believe.

You can read something about the network effect as it applies to markup at http://www.w3.org/People/cmsmcq/2002/whatmatters.html

Log in to leave a comment
bowerbird March 15, 2006 at 2:43 pm

roger said:
> (What’s the difference between
> putting four blank lines at the end of a chapter
> — 5 keystrokes — and putting in some symbol
> or ?)

well, i think roger meant this as a rhetorical question,
with “none” being the answer he wanted you to give.

and in some senses, the difference is slight enough
that it won’t really matter very much, i grant that.

but most of the time, there _is_ a difference.
and sometimes that difference can be big,
_especially_ cumulated over many chapters,
or a wide variety of other semantic structures.

you can see this for yourself, just by doing a
“view source” on this very page. do it now…

i’m serious. do a “view source” in your browser
right now, to see where a heavy-markup mentality
takes you, once you let a simple “/ch” in the door.

kind of daunting, isn’t it?

is that the kind of “open” future you want?
one where the content is obstructed with
markup gobbledygook? not me, thank you!

> If you’re going to go the markup route,
> you’d better have a good reason.

_and_ a huge budget…

-bowerbird

Log in to leave a comment
Roger Sperberg March 15, 2006 at 2:48 pm

Bowerbird, you go stand on the other side of the room where the deluded folk go!

We addled folk will stand on this side.

Log in to leave a comment
bowerbird March 15, 2006 at 2:56 pm

well _that_ is a sparkling example of discourse! :+)

-bowerbird

Log in to leave a comment
Roger Sperberg March 15, 2006 at 3:02 pm

Oh, that’s just my way of acknowledging that going any further could lead us into a mudfight. “You’re an idiot!” “No, you’re an idiot!”

I may concede some of your points — even accept the validity of your viewpoint — but that doesn’t mean I draw the same conclusions as to the best course of action.

Log in to leave a comment
bowerbird March 15, 2006 at 5:02 pm

jon said:
> All the conversion houses (which do
> a significant amount of books for major publishers)
> are now switching over to XML-based workflows
> (many are already there, such as Rosetta Solutions).

nice bluff, jon.

do you have contact info on these “conversion houses”?
because i woud like to get some price quotes from them.
let’s see _exactly_ how expensive heavy markup can be…

i don’t think you know many, though, or else you
wouldn’t have put out messages on the listserves
recently asking if anyone could give some quotes
on the cost of digitizing paper-books…

but hey, maybe you got some answers from people,
and now you know of some actual conversion houses.
so if you do, let us know, so we can collect some facts.

(i will update this thread periodically, to see if you have
come up with any info, so do keep trying to find some
even if you aren’t successful right away, ok? because
with the nifty “recent comments” links on the front page,
even long-scrolled-off entries can be brought to the top…)

***

roger said:
> going any further could lead us into a mudfight.

why? simply because you have no counterargument
to show people why the expense of markup is justified?

you had a good effort there — “because everyone else
will be doing it, so that’s where all the tools will be” —
but of course once everyone else figures out this is
a road fraught with more expense than they imagined
(and fewer benefits than they had been led to believe),
everyone _won’t_ be doing it for very long…

> that doesn’t mean I draw the same conclusions
> as to the best course of action.

of course.

but “the best course of action” will reveal itself to be
patently obvious once people can see all the evidence,
after paying the real costs out in the real world…

if light markup gives benefits equal to heavy markup,
without requiring the added expense of heavy markup,
then its cost-benefit ratio is going to be superior…

obviously superior. to anyone who cares to look.
and most especially to the people paying the bills.

and yes, i can see why your only counterargument
to _that_ logic would be “a mudfight”…

but hey, i like playin’ in mud as much as the next guy.

-bowerbird

Log in to leave a comment
Roger Sperberg March 15, 2006 at 6:41 pm

Please explain how I should enter my text when the book I want to share with other readers is this polyglot Bible, with texts in greek, latin and hebrew above, annotations referencing other books of the Bible in hebrew in the right margin, and at the bottom, a different arrangement of texts.

The equivalent or translated lines in each text of course need to align so that I can check the latin as I work through the greek or hebrew. Likewise the annotations.

I’m willing to use asterisks for bold and underscores for italic. And, of course, sometimes I have an illustration for a drop cap.

Here is an image of what the first page of what my e-book should represent (this is, btw, Genesis 1:1.). Small as it is, I imagine you understand everything I’m trying to get here.

Now please tell me how I should enter my text when what I want to do is show the differences between several versions of the same book — a Bible, a Shakespearean play, Beowulf — so that I can view each verse or section from the different versions, of whom ar any point, the texts that agree with each other are often different. Here two have “eat,” a third “ear,” a fourth “hear.” I want to use the e-book so I can quickly discover the variations in printed form and figure out for myself what is the most likely and what the corruption.

Also, I was wondering — how does the no-markup method handle forms and tables? I was wanting to reprint some of the statistical work on population growth and bond prices I used to work with. Should I use underscores or dashes for lines?

Of course, all the questions/examples I have here are from my own experience, from real books. People invested the money to typeset and print them; people paid money to buy them. I sure hope e-books don’t exclude this kind of content. By the way, in the children’s book I want to do next, how do I indicate the color of the type? It’s different on every page, to match the color in the facing illustration.

Log in to leave a comment
bowerbird March 15, 2006 at 7:07 pm

the old “parallel texts” problem, eh?

is that the best you can do roger? :+)

you want people to infer that
a light-markup system cannot
handle that kind of difficulty,
while a heavy-markup one can.

so please, show us your solution!

once i see the costs you paid
in the form of your markup, and
the benefit you got from them
in terms of the user-interface
that resulted from that markup,
i’ll have a much better idea how
to give a superior cost-benefit ratio
from my own light-markup solution…

again, take your time with it, but
i’ll be bringing this entry back to
the top of the “comments” links
on a periodic basis, so please do
get around to it sooner or later…

-bowerbird

Log in to leave a comment
Roger Sperberg March 15, 2006 at 7:27 pm

I notice that you decline to answer the question but instead pretend it’s a problem for me. As it happens, the cost to me is nothing, because the markup for these projects has already been done. And because each person studying those text variations in Shakespeare or Beowulf used self-describing markup, even when they did different things I can unambiguously locate and, if desired, change them.

But with no markup, would each person have to invent their own solutions for complex issues? And how do they document that so my computer can understand their method without my intervention?

Heck, that’s something like the problem of wiki markup, isn’t it? Where one wiki uses an asterisk to indicate bullets and another uses it for bold. In fact, bold can be indicated a lot of ways in different wikis. Here’s one list:
*bold text*
**bold text**
##bold text##
||bold text||
__bold text__
”’bold text”’

So I guess there’s a lot of guessing in the no-markup model to figure out what someone means if they used somebody else’s version and not yours, right? Bold seems pretty easy to track. But what about that interlinear commentary? How do you standardize that for everyone without making up some absolute laws?

But let’s get back to my simple questions at the end of my comment.

How do you do represent a table? How do you indicate color? In a non-markup style solution, of course.

Log in to leave a comment
Roger Sperberg March 15, 2006 at 7:37 pm

I’m reminded of how some database people dismissed XML when it first came out, saying, “We can transmit data between companies already. It’s called EDI. All XML is doing is sticking flags between fields.”

They were right. XML didn’t do more than EDI did. It just made it unambiguous where one thing ended and another thing started, in a way a program could easily check. And if there was an error in the consistency, it got flagged at the initiating end, where the people who could answer the question would know. Whereas with EDI, you never knew where an error occurred, and you didn’t always know that one did.

And when you wanted to exchange information with a different company, mapping your system to their system was flat out easier (I know; I did this). Differences in field size and names were trivial to sort out; getting to a working exchange was about one-tenth the time.

And really, the only difference was unambiguous marking of the start and end of different pieces.

I think it’s the same with text.

Log in to leave a comment
bowerbird March 15, 2006 at 11:07 pm

roger said:
> And really, the only difference was
> unambiguous marking of the
> start and end of different pieces.
> I think it’s the same with text.

i have moved your “closing statement”
to very the top of my response to you,
because it answers your questions to me.

in order to represent a table, or colors,
or _anything_ else, using light markup,
you’ll give an “unambiguous marking”
to the “start and end” of those pieces.

to see what any no-markup viewer-app
considers as “unambiguous marking”,
you’ll need to get experience with it…

i could tell you what _my_ viewer-app
considers as “unambiguous marking”
in terms of the variables you’ve raised,
but that is a simple technicality that
can be learned my perusing the texts
on the “rules” of zen markup language,
and/or the z.m.l. test-suite, located here:
> http://snowy.arsc.alaska.edu/bowerbird/test-suite/zml11rules.txt
> http://snowy.arsc.alaska.edu/bowerbird/test-suite/test-suite.zml

> How do you do represent a table?

it depends on how complex your table is.

for an ordinary table with one-line cells,
using tabs between the cells will work.

or you can line up the columns using a
monospaced font, and the viewer-app
will “pretty it up” with a nicer font…

basically, since z.m.l. is built on the
conventions used in e-mail and in the
existing e-texts from project gutenberg,
you can look to those other venues to
get a general idea of the z.m.l. “solution”
for any particular problem you encounter.

> How do you indicate color?
> In a non-markup style solution, of course.

type-color is one of the factors that
i generally consider in the purview of
the end-user, not the content-creator.

as many print designers learned as they
moved to the electronic world of the web,
you must come to grips with the realization
that you no longer control all the variables…

aftet all, “in the library of the future”, as you
have told us, the _user_ controls font-color.
right? right.

good thing, too, because some users are
color-blind, so all your fancy coloring is
lost on them.

but if you’re really determined, you can
use the “wild-card” form of emphasis to
try to force a particular color. but know
that the end-user has “the last say” on
if/how a particular emphasis is rendered.
much like they can edit a css stylesheet.

still, the answer to your question remains:
whatever you want, mark it unambiguously.

***

oh yeah, as to your other question, about
parallel texts, the answer should be obvious.

it _will_ be obvious when i tell it to you.
painfully so. you might even respond by
saying, “well yes, that’s obvious, but…”

however, then you won’t have anything to
_follow_ that “but”… or maybe you will,
but _that_ will have an obvious answer too.

i can say with total and absolute certainty
that you have already seen many ways of
dealing with this particular issue, in fact.
you just didn’t _realize_ that it was that…

but in the end, you’ll wonder why this ever
seemed like it was some difficult problem.

so maybe you’ll wanna think some more,
see if you can come up with the answer
yourself, rather than have me tell you,
and then have you so, “oh, d’uh…”, ok?

but i’m not trying to stall you, so if you
ask me again, i’ll tell you right away, ok?

but really, think about it first, before you do.

-bowerbird

Log in to leave a comment
John N. March 17, 2006 at 3:29 pm

I sorta stumbled into this discussion, so I’ll
apologize in advance if I rehash something
that’s been beaten to death elsewhere on this site.

It seems to me that bowerbird and Roger are basically
looking at different problems; or equivalently, are working
from different principles.

Bowerbird’s ‘zen markup’ is, as far as I can see, designed
under the assumption that content is largely represented
by the sequence of words. There are some few generic
kinds of formatting that are much like
HTML styles, but represented with special punctuation so
they don’t get in the way of the words; and there are
a few special classes of ‘block formatting’ representable
by a special keyword set off by itself, the *presentation*
of which is not part of the content. Simple tables and
lists occur frequently enough in literature that they
have some clues in the e-text for layout, but again striving
to not get in the way of the words. The goal of the
e-text formatting under this assumption is to represent
the content in a way that is ‘good enough’ for any
reader that derives from one of the simplest forms of
our technology to do an acceptable job of rendering
the content. The exact nature of that rendering is left
entirely up to the device.

This assumption is actually fine and dandy for a large
chunk of literature out there, probably *most* literature
out there; but not all. Certainly over 99% of the fiction
I read could be dealt with this way; possibly over 50% of
the nonfiction I encounter could, as well. However, the model
starts to fail when the *content* takes on a two-dimensional
nature.

As an example, consider “The Mouse’s Tale”, in Alice’s
Adventures in Wonderland. How faithful is the Gutenberg
representation to the original? Ans: not particularly.
(http://www.gutenberg.org/dirs/etext91/alice30.txt)
For example, the font should be getting smaller as
one progresses down the tail; there is no non-rich
markup I can think of that would deal with this
correctly. Bowerbird might argue (at the risk of
putting words in bowerbird’s mouth) that the font is not
vital to the poem, that the simple formatting still
contains the essence of the poem, and that if it were
truly vital, we could accompany the simple formatting
by a picture. For the first, I wonder if the author
would agree; for the second, my personal belief is that
the poem loses some charm presented the Gutenberg way;
and for the third, that’s simply punting the issue to
some other black-box format, worse than any rich-markup.

Of course, it’s worth asking whether rich markup really
does deal with this case correctly; the answer is ‘probably so’,
as it was published. However, at risk of obscuring the issue,
I discovered as I was writing this that Lewis Carroll’s
*original* version — which is a completely different poem than
the one in print! — was handwritten and formatted in a
much fancier way, such that only a very rich markup (such as,
say, an Adobe Illustrator format) could correctly render it:
the text on the tail curves around to the tip!
http://www.bl.uk/onlinegallery/ttp/ttpbooks.html
One wonders at the transformation of the Mouse’s Tale from this
original to the published version… how much was dictated
by the demands of the media, and how much was the author’s
concept?

For another example, consider the vast amount of scientific
literature involving mathematics. (Something I encounter on
a daily basis, as a programmer dealing with geophysics.)
‘Simple’ formatting *cannot* do a good job of representing
complex math; rich formatting is vital. You can again punt
to pictures, but that sacrifices the usefulness of a
textual representation that is searchable and able to be
manipulated.

I’ll agree with bowerbird that the ‘side-by-side’ text issue
isn’t such a problem, especially if you’ve relaxed your
notion of formatting to allow simple tables: you can get an
adequate side-by-side representation via tables of text. Will
an arbitrary text reader handle this correctly? Maybe, maybe not;
but under bowerbird’s basic premise, this is a red herring.

I find bowerbird’s dismissal of the ‘colored text’ problem
to be one of the keys to the premise: the exact nature of
the formatting is not key to the content. For Roger’s
specific example of a children’s book, I find this premise
to be specious; it’s clear that the presentation and
formatting is quite important in this case. Of course, in
this case, it’s quite likely that the color and formatting
is related to images, and that the text should probably be
in images as well; for indexing purposes, an accompanying
plain-text is probably ‘good enough’ again. A rich markup
that does a better job at presenting images would be useful,
and the problem of having a format that can be propagated
indefinitely in the future is non-trivial (don’t tell me
PDF, that’s still too proprietary despite GNU’s efforts, IMHO).
But at that point, I think we pass beyond the problem that
bowerbird allows us to handle at all.

Essentially, it feels to me that bowerbird’s claim is that
content can usually be divorced from formatting; and that
in many of the cases where it isn’t, it’s the author’s
responsibility to make that separation. If I as an author
really believe that a custom set of drop caps, text that
is specifically underlined and not some viewer-specific
’emphatic’ form, and arrows for list bullets are
vital to the content I am presenting, I am not allowed to
live in bowerbird’s world. (OK, I’m not allowed to have my
content represented the way I want; to some authors, I
suspect that’s much the same sensation.) If I want an
e-text that closely resembles what I can create on a sheet
of paper, I need a richer format and more powerful renderer
than that provided by bowerbird’s zen; but for a large
number of cases, I can get by without. Simple formatting,
nevertheless, will not handle all desirable cases.

–John N.

Log in to leave a comment
Roger Sperberg March 18, 2006 at 12:56 pm

I would just point out that I can use XML to mark up my text and in the middle of it I can insert SVG vocabulary in XML to precisely size and position my text, so that the mouse’s tail/tale, in roman type or script can be precisely replicated.

But really the point is that it’s not a question of “oh, you can do that in a lot of other ways, that’s not special” but of the network effect — I can do it in a text editor, without going to binary, without leaving my XML file, without invoking a jpeg file. If markup by itself is miraculous, SGML would have been all we need. If all XML brought to the table was a simplification of SGML, then it would be an equivalently fringe technology.

The talk cited above quotes Murata Makoto: “XML is interesting primarily and possibly only because it is a language plausible for use both with data and in documents.” It is a mistake, the speaker notes, to look at the benefit of XML (and of markup) through the lens of texts alone.

If this so-called zero markup can do tricks with etexts, then good for it. If it’s not useful to me beyond texts, I’m uninterested. It’s stuck back in 1985. And so-called heavy markup is mostly machine-generated so the evils it’s painted with are caricatures.

Thank you, John N. for your insights.

Log in to leave a comment
Jon Noring March 19, 2006 at 11:59 am

Of course, Bowerbird also does not define what he means by “heavy” markup. Is all XML markup “heavy”, or only some markup? He also talks about cost of markup vis-a-vis plain text. Most people these days don’t even know how to use a text editor (like vi), so ZML formatting is not necessarily easy for them.

Log in to leave a comment
bowerbird March 19, 2006 at 1:44 pm

oh gee, now we’re getting ridiculous.

i’ve got a long response already written
to the excellent questions john n. raised,
i’m just working to finish up a demo-book
and post it, so that i can point to it…

if people want an example of heavy markup,
just do a “view source” on this very page…
(remember the days when .html was simple?)

and yes, all x.m.l. markup is “heavy”, you betcha.

and are you really saying, jon n., that people
“don’t even know how to use a text editor”?

really? well, i suppose those people aren’t
doing any writing or book-digitizing anyway.

but as for the “difficulty” of zen markup,
just take a look at these example-books
and tell me if the formatting looks hard:
> http://www.greatamericannovel.com/mabie.zml
> http://www.greatamericannovel.com/myant.zml
> http://www.greatamericannovel.com/sgfhb.zml

or gawk the “11 rules of zen markup language”, at:
> http://snowy.arsc.alaska.edu/bowerbird/test-suite/zml11rules.txt

my goal was to make z.m.l. so simple that a 4th-grader
could understand it. if you’ve got a 4th-grader who
finds any part of it “difficult”, let me know what part.

-bowerbird

p.s. then see if your 4th-grader can understand the
.html coding for this web-page!

Log in to leave a comment
Jon Noring March 19, 2006 at 10:25 pm

XHTML markup for books (particularly that used in OEBPS) is a lot simpler since web pages are usually made to be quite complex. For example, refer to My Antonia for one example. Once one gets past the table of contents section, it is pretty dirt simple.

Anyway, you answered my question about what you meant by “heavy”, an adjective you’ve added simply to spread FUD. Typical Bowerbird droppings.

Log in to leave a comment
bowerbird March 20, 2006 at 4:57 am

jon said:
> Typical Bowerbird droppings.

typical noring mud-slinging…

-bowerbird

Log in to leave a comment
bowerbird March 20, 2006 at 2:00 pm

ok, let’s see if we can restore
some resemblance of dialog…

john n., yes, i’m afraid that
if your subject matter requires
a lot of equations, you’re likely
to be stuck with heavy markup.

at present are you using latex?
have you done any math-ml?

i’ll have more to say on this later,
but i’m curious about something.

when communicating with your
colleagues via plain-text e-mail,
how do you handle this problem?

or do you just throw up your hands
and do without?

-bowerbird

Log in to leave a comment
bowerbird March 21, 2006 at 3:51 pm

ok, let’s update:

jon noring, got an pointers to “conversion houses”
where i can get a price quote for some x.m.l. markup?

-bowerbird

Log in to leave a comment
Jon Noring March 21, 2006 at 7:02 pm

Rosetta Solutions, CodeMantra, Apex, etc.

Log in to leave a comment
bowerbird March 21, 2006 at 8:18 pm

jon-

thanks!

but please do fill out the “etc.”,
since the more price-quotes i get,
the better it will be…

-bowerbird

Log in to leave a comment
Jon Noring March 22, 2006 at 12:57 am

You can add DataConversion Laboratory. There’s a few other smaller ones, but that should suffice. All the ones listed do conversion work for most of the major and not-so-major publishers.

Log in to leave a comment

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com. Cancel reply

You must be logged in to post a comment.

Share this:

Related

24 COMMENTS

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com. Cancel reply

AMAZON

REVIEWS: E-Book & AUDIO BOOKS

SELF PUBLISHING: TECH & BIZ TIPS

MOST RECENT

POPULAR POSTS

MAJOR CATEGORIES