Suvudu.com recently gave away a bunch of ebooks1 in pdf format. I was pretty surprised by the reaction: people were angry because the free stuff wasn’t the right free stuff. They didn’t want pdf, that was pretty clear. Now suvudu are offering a bunch of other formats, but it got me thinking about why I like pdf ebooks, and why so many people seem not to.

Just like there’s a contrast between compiled and interpreted programming languages, there’s a contrast between compiled and interpreted ebook formats.2 Pdf is at the compiled end, html is at the interpreted end. The two extremes are good at different things (just like compiled and interpreted languages); I’ll give some examples below, which point to why the reactions to the suvudu release were so extreme. The analogy with compiler technology suggests a third alternative: a level of intermediate code which is cheap to interpret but contains more compiled-in information than html and its relatives. As a concrete application, I’ll argue that language-specific discretionary hyphenation belongs in this layer.

The first complaint about pdf as an ebook format is always that it doesn’t support reflow: the page dimensions are fixed, and if they don’t match the size of the screen you’re reading on, that’s too bad.3 Html, on the other hand, is delivered to your computer as an unbroken stream of text and it’s up to your web browser to decide how to split that text into lines that fit on your screen. (I’m ignoring css; it’s a separate technology from html proper and badly-written css tends to make html pages that behave more like pdf anyway.)

I could go on talking about ‘pdf-style’ and ‘html-style’ ebook formats, but there’s a more principled distinction available: pdf is a compiled format and html is an interpreted format. By that I mean that the bulk of the layout work for displaying a pdf document is done when the document is created (‘compiled in’ to the file you download), while the bulk of the layout work for html is done when the document is viewed (‘interpreted online’).

Where the work gets done is also where the decisions get made — that is, where the control is. Here’s a reason a publisher might prefer to produce compiled rather than interpreted texts: they get to make all the decisions about how those texts appear. If a textbook is produced in both paper and ebook versions, it might be helpful if the page numbering is the same for both.4 Comics, too, are unlikely to benefit much from interpreted delivery: in drawng a strip or a page spread the author will probably want to control all the decisions that might be left up to the interpreter in a more text-heavy document.

So far our analogy hasn’t done much for us. Now I want to put it to work, by introducing another reason to prefer compiled over interpreted documents: typographic quality.

Go have a look at Dario Taraborelli’s page The Beauty of LaTeX; in particular, the section on line breaks, justification and hyphenation (I’ll use ‘justification’ as shorthand for all three of these processes). Taraborelli’s point is different to mine: he’s comparing different kinds of document compilers. And in fact some wysiwyg systems do do the clever and complicated things TeX does to get justification right; Adobe InDesign apparently has this feature, and for all I know the latest Microsoft Word might include it also. It’s unlikely, though, that we’ll get this kind of quality in an interpreted document.5

Justification as TeX does it is a complex process of simultaneously adjusting all the lines in a paragraph, to get the best possible overall quality. This is expensive, among other reasons because it can involve backtracking: an early tentative decision about the first line of the paragraph might have to be reversed if it would force things to turn out badly on later lines. The more globally sensitive the algorithm is, the more expensive it gets (should it notice page breaks and avoid widows and orphans?) but the higher the typographic quality of the result.6 Another thing making TeX’s justification expensive is that it invokes automatic hyphenation. Doing this right involves knowing which language the document is written in: different languages allow hyphenation at different points. It involves a complex system of (language-dependent) hyphenation rules, and some mechanism for marking by hand cases where these do not apply. All this is expensive not in processor time (as a backtracking line-break algorithm is) but in the complexity of the system itself. TeX, as a document compiler, can include hyphenation rules for many different languages and call on the particular one that is needed for the particular document being compiled. The same flexibility and scope is hard to imagine in a document interpreter that might have to run under severe hardware restrictions; it is no surprise that your web browser doesn’t even attempt hyphenation.

So here’s a reason why both publishers and readers might prefer compiled over interpreted documents. Publishers, because it lets them produce documents of high typographic quality — or more exactly, because it gives them control over the typographic quality of the documents they produce. A lot of design effort may go into a textbook, with clever placement of different elements on the page (main text, side notes, boxed ‘expert information’ sections, pointers to related material and so on), careful choice of fonts and colour schemes and what-have-you. A publisher might easily refuse to put that work to waste in the electronic version of the book, by giving control over to the interpreter doing the display. There is rather less evidence, I’m afraid, for readers who appreciate typographic quality. The suvudu reaction shows that quite clearly: these folk wanted the content, and forget about whether the form is pretty or not.7 That’s the interpreted-ebook market; high-quality typesetting is for the compiled-ebook market. I hope I’m not the only person in that market… (Feedbooks seem to be using LaTeX to generate pdfs with custom page dimensions, from free sources such as Project Gutenberg. Unfortunately the typographic quality is still low; there’s no way their offerings are a replacement for a LaTeX production that has had some human design attention. Still it’s a definite cut above html or plain text. If only they offered the LaTeX sources as well…)

I promised some concrete payoff from the compiled/interpreted analogy, and here it is. There’s a way to get some of the benefits of both approaches, in language design: you compile your handwritten programs to intermediate code, which gets interpreted on the platform the program runs on. The idea is the the compiler part of the process can do expensive things like type-checking and optimisation, while the interpreter can be tightly matched to the runtime environment. We shouldn’t stretch the analogy too far, but here’s a suggestion for something that belongs in the intermediate code level: discretionary hyphens.

Suppose we had a compiler that did TeX-style hyphenation analysis: it knew what language the document was written in, and it used hyphenation patterns for that language to find all the valid hyphenation points in the document. The intermediate-code level would be something like html, but with those hyphenation points marked with ‘discretionary hyphens’. And the interpreter level could use that hyphenation information to improve the quality of on-the-fly justification (reflow) based on changing the display area or font size; I imagine that just adding reliable hyphenation would already improve the line-by-line justification enormously.

What was the point again?

I’ve wandered, it’s true. Here are the main points. Compiled and interpreted formats (static design like pdf and reflowable like html) are good at different things, just like compiled and interpreted languages. Compiled formats put the control in the hands of the document designer, and give high typographic quality — at least when the designer puts the effort into making this happen! Interpreted formats respond to the details of the environment in which they’re viewed (display area, the user’s preferred fonts and font sizes, etc.). On the downside, compiled formats are inflexible (in particular, viewing them on devices with different display areas is usually unpleasant) while interpreted formats tend to produce low typographic quality and enforce boring design. The analogy suggests that ‘intermediate code’ might be a way to increase the typographic quality of interpreted formats, and descretionary hyphenation is a specific example where this could work.

Notes:

  1. This was a great marketing move: they took several sf/f series that have proven popularity, and gave away the first book in each. Way to get new readers hooked: “First time’s free…” []
  2. Sure, this is more general than ebooks. Let’s keep this concrete and specific though, otherwise I’ll end up drowning in abstraction and generality. []
  3. Actually pdf can do some sort of reflow, these days. Nobody understands or uses it yet though. []
  4. My argument for pages instead of an unbroken column of text… will have to wait for another day. []
  5. A Word file is a compiled document, but it’s one you can edit. The analogy only goes so far. []
  6. Only to a point, presumably: we are automating the process of aesthetic judgement, which —I hope!— can only be approximated. []
  7. As an aside, interpreted formats tend to be easier to convert losslessly than compiled ones, as anyone who has tried to copy-paste from pdf will attest. That’s another reason those folk may have preferred an interpreted format: it’s easier to share across different devices, by making the conversions as needed. []