Notes about the XML exported by Diogenes

The legacy CD-ROM databases that Diogenes operates upon were created using a variety of obsolete systems of markup which were idiosyncratic to these databases, and all of these have gaps in their documentation. Most of these systems do not map straight-forwardly onto modern XML-based ideas of markup, so the XML output produced by Diogenes cannot be anything other than a compromise.

All of the texts will validate against the full TEI schema, but despite adhering to the letter of the P5 TEI guidelines, in many cases it is obliged to violate their spirit, simply because the necessary information is lacking in the databases. This page tries to explain the reasons for some of the idiosyncrasies of Diogenes’ XML output.

If you are interested in playing with the XML output, there is a command-line tool which provides more options than the application defaults. It is called xml-export.pl and it should be run from the server subdirectory of the Diogenes application. It includes the full, uncustomized TEI schema and the Jing validator (requires Java runtime).

Semantic vs Presentational Markup

The most fundamental problem in translating these databases to XML is that they were marked-up in an almost strictly presentational style, which is the opposite of good practice these days. That is, the markup records how the text looked on the page rather than what those features might mean. This has the advantage that the markup and proofreading were able to be done by completely untrained workers who needed to know nothing about the texts and their language, but it puts the burden on the reader of knowing how to interpret the typographical features of each individual text.

There are a few cases where the databases do contain semantic markup (e.g. the supplements of Servius Danielis and some of the ancient Greek lexica have their lemmata marked as such), but these are exceptions. In most cases, semantic information has to be inferred by the user from the presentational markup on a work-by-work basis.

Prose vs Verse and Other Linewise Texts.

The most significant problem that arises from the strict focus on the way the text appears on the page arises with prose texts. All databases have structural markup down to the level of the individual line, which makes sense for verse and for documents like papyri where the layout of the text, line by line, is significant. But the line-breaks in a given edition of a normal prose text are not interesting. Moreover, the hyphenation in such texts is duly recorded. Not only is this uninteresting information, it actively interferes with searching for the words that have been hyphenated.

Diogenes responds to this by removing hyphens in prose texts and rejoining hyphenated words. It does not mark up the individual lines of prose texts, but, on the other hand, it does not re-flow the text, leaving it as-is, so that the reader can see where the line breaks were. This is useful, for example, when a prose author quotes a passage of verse and the layout of the lines suddenly becomes important. Unfortunately, there is no easy general way to identify such embedded verse quotations in order to mark these lines up appropriately.

The lowest structural element of verse (and papyrus) texts is the line, while for prose texts it is the paragraph, so these categories must be treated differently in the XML export process. But how to tell prose and verse apart? Since all works in the databases are marked-up as if lines of text are significant, there can be no general rule. Diogenes operates a heuristic based upon the frequency of hyphens in the text and a few other indications, but this is not foolproof. There is also a mechanism to specify manually whether a given text is prose or verse/papyrus. Please let me know if you notice any text that has been misidentified.

Structural Metadata

There is one aspect of the original markup that maps well onto XML, and this is the structural, citation information that is embedded in the non-ascii, binary metadata of the files. These are the levels that you see when Diogenes asks you what point in a text you want to jump to and when it identifies the location of search hits. These levels are strictly hierarchical and are translated into <div>s, except that the lowest level for verse and related texts is the line, represented as an <l>.

Unfortunately, prose sections often end in the middle of a line, such as at the end of a sentence. But the binary structural metadata can only occur between lines. So there is no way for the databases to indicate in these cases where the prose section actually ends. The export code tries to guess, based upon punctuation, at what point in a given line the current section of a prose text ends. But this is not infallible.

Paragraphs

At the start of the conversion process, the lowest level of <div> in prose texts contains a single <p> element, but this often does not represent a paragraph. For example, when the citation system is based upon the pagination of a specific edition (e.g. Stephanus), the structural metadata may not have any nesting relationship with paragraphing or any other intrinsic aspect of the text.

For this reason, the export code tries to guess where real paragraphs are located in prose texts. It does this by looking for indentation. The initial <p> element within a given <div> is then broken up into multiple paragraphs, indicated with <p rend="indent(1)">. Indentation is not, however an infallible indicator, so some of this guesswork may be wrong.

Other Markup

The other kinds of markup found in the databases are part of the non-binary, ascii text. There are several overlapping systems within this scheme, which is documented in the TLG Beta Code Manual. These systems do not nest at all with respect to the binary, structural markup, nor do they nest consistently with each other. This poses major problems when converting to a strictly hierarchical format like XML.

The conversion code tries to do the sensible thing, but there are cases when it is impossible to know what the original intention was, such as when markup is unterminated. There are a set of exceptions coded into the conversion code to deal with problematic instances. If you come across any others – for example, a passage of Greek that has not been translated into Unicode – please report them.

Divisions Containing Only Headings.

One of the peculiarities of Diogenes' XML output is that there are <div>s which contain nothing more than a heading, such as the title of a poem or a book. This is simply an artefact of the way the texts exist in the databases. It might seem easy to delete these <div>s and to convert their contents to a <head> for the following, real <div>.

This works in simple cases, but it is hard to do in a way that always produces valid TEI XML in every case. Moreover, some of these <div>s contain important information. For example, in some texts of the kind mentioned above, where the citation structure is based upon a historic edition, the n attribute of the <div> tells you the scope of the text to which the heading pertains. This also is a frequent situation in a text which collects fragments from many works: without the information attached to the header it is not possible to know which fragments pertain to which ancient text.

Given the difficulty of a coding a general solution to the problem that does not throw away important information, I have decided to leave these <div>s as-is. They can always be distinguished by having a letter, often "t" (presumably for "title") attached to the section number, such as n="t2" n="t22-44".

Support for Documentary Corpora

There is no support at the moment for exporting the documentary corpora (papyri and inscriptions) as XML. It would not be hard to add it, but it is not a high-priority item on the assumption that these are already available elsewhere in superior digital editions. The DDP corpus is available in a much richer form (in EpiDoc XML) from papyri.info.

The situation with the inscriptions is less clear, but many of the collections included in the legacy databases are now available in an improved form in various places online. See epigraphy.info for a list of websites. The PHI also has an online version of their Greek inscriptions. There would be limits on any attempt by me to convert the inscriptions on the legacy databases to XML, for I do not currently have access to documentation for many of their encoding conventions.