[M2L1] Creating Digital Editions with TEI

In our first module, we took a look at HTML, the markup language of the web. Any Digital Humanities project delivered over the web ends up being delivered to your browser as HTML. Digital editions of texts are no different; whatever their purpose you encounter them on the web as HTML.

Let’s take a look at one such publication, an interactive edition of an inquisition manuscript from southern France created as an accompaniment to a PhD thesis. The manuscript records around 6,000 witness statements given before inquisitors during an investigation into heresy in villages of the languedoc. Visit the edition online here and explore some of the features. In particular note that:

    • on the right-hand side, you can highlight people, places and under the heading ‘Explore document features’
    • within the text people and places appear as clickable links
    • each person and place has their own web page that lists other appearances within the manuscript

This is a truly interactive digital edition that allows us to explore the manuscript. If we encode the text in sufficient detail, such an edition also allows us to subject the manuscript to automated analysis. For example we might have the computer count the number of times a particular place or person appears and make assumptions about their relative importance. Or we might look at the relationships between different individuals, counting who is mentioned by whom and how frequently in an attempt to make sense of the social world which the manuscript describes.

Markup

How is this text marked up? Let’s have a look at the first line:

Item. Anno et die quo supra.1Poncius Rainardi testis iuratus dixit quod vidit in domo Peire de Sancto Andrea Iohannem Cambitorem et socium eius, hereticos. Et vidit ibi cum eis Willelmum Vital; Bernardum, dominum del Mas; Arnaldum de Rozengue; Raimundum de Causit; et plures alios de quibus non recolit. Et omnes et ipse testis adoravit ibi dictos hereticos. Et sunt XIIII anni vel circa.

(this is translated as)

Item. Year and day as above, the sworn witness Pons Rainard said that he saw Johan Cambiaire and his companions, heretics, in the house of Peire de Sancto Andrea, and saw with them Guilhem Vidal, Bernard lord of Mas-Saintes-Puelles, Arnald de Rosengue, B. de Causit and many others he did not recall. And the witness and everyone else adored said heretics. This was about 14 years ago.

If we use the View Page Source command of our web browser, we can see that this section is made up of the following HTML:

 
 <p class="MS609-0006-1 doc-segment-yellow">Item.
 Anno et die quo supra.<sup>1</sup><a href="/person/Pons_Rainart_MSP-AU" class="change_link_colour Pons_Rainart_MSP-AU doc-person-blue">Poncius Rainardi</a> testis iuratus dixit quod      vidit
 <a href="/place/home_of_Cap-de-Porc" class="change_link_colour home_of_Cap-de-Porc doc-place-pink">in domo
 </a><a href="/person/Peire_Cap-de-Porc_MSP-AU" class="change_link_colour Peire_Cap-de-Porc_MSP-AU doc-person-blue">P<span class="supplied">eire</span> de Sancto Andrea</a>
 <a href="/person/Johan_Cambiaire_MSP-AU" class="change_link_colour Johan_Cambiaire_MSP-AU doc-person-blue">Iohannem Cambitorem</a>
 et socium<span>&nbsp;</span>eius, hereticos. Et vidit ibi cum eis
 <a href="/person/Guilhem_Vidal_MSP-AU" class="change_link_colour Guilhem_Vidal_MSP-AU doc-person-blue">W<span class="supplied">illelmum</span> Vital</a>;
 <a href="/person/Bernard_del_Mas_Senior_MSP-AU" class="change_link_colour Bernard_del_Mas_Senior_MSP-AU doc-person-blue">B<span class="supplied">ernardum</span>, dominum del Mas</a>;
 <a href="/person/Arnald_de_Rosengue_MSP-AU" class="change_link_colour Arnald_de_Rosengue_MSP-AU doc-person-blue">Arnaldum de Rozengue</a>;
 <a href="/person/Raimund_de_Causit_MSP-AU" class="change_link_colour Raimund_de_Causit_MSP-AU doc-person-blue">R<span class="supplied">aimundum</span> de Causit</a>; et
 <a href="/person/others_unrecalled" class="change_link_colour others_unrecalled doc-person-blue">plures alios de quibus<span>&nbsp;</span>non recolit</a>. Et omnes et
 <a href="/person/Pons_Rainart_MSP-AU" class="change_link_colour Pons_Rainart_MSP-AU doc-person-blue"></a>ipse testis adoravit ibi dictos hereticos. Et
 sunt XIIII anni vel circa.
 </p>

This fragment of HTML uses only a small number of tags:

  • <p></p> to wrap the paragraph
  • <a></a> to link to the summary page for each individual and place
  • <span></span> to mark certain words
  • <sup></sup> to mark superscript sections

Does HTML help us understand the structure of the original text or the editor’s process? Well, sort of. We can see that the <a></a> elements each link to a person or place.

We can also see that the HTML makes extensive use of class attributes. For instance ‘Poncius Rainardi’ has the classes change_link_colour Pons_Rainart_MSP-AU doc-person-blue. The doc-person-blue class hints at what kind of link the marked text is: a link to a person. Likewise the <span></span> elements white use a class of supplied seem to have been used to mark up text supplied by the editor that was not in the original.

To ‘reverse engineer’ the HTML in this way we have to make a set of assumptions about the author’s intentions.   These are our best guesses, but this is not really what class attributes are for.  As we learned discovered in Module 1, CSS classes allow us to associate CSS styling rules with the element – that is to say that they are intended just for presentation.

We could probably reverse engineer this HTML to understand something about the editor’s process, but it would be a time-consuming process of detective work. And what if we wanted to compare the HTML content of different editions whose authors had made different decisions about how to present their text visually? We would have to go through the whole process again, with no guarantee of success. At best, we would have a very limited understanding of the complex intellectual process of producing a digital edition. HTML is a fantastic general purpose language for presenting material online, but the range of meanings it can convey is rather limited; this is why digital editions are not typically produced using HTML.

TEI-XML

In fact, this digital edition, like most these days is produced in TEI-XML. Before we go much further into what that is, let’s have a look at the fragment of markup which represents the excerpt of the edition we have been looking at.

(You can download the whole XML source of the document we have been examining by chosing the XML option from the download links on the right-hand side of its page:

or you can download it directly here.

The structure of our fragment (which begins on line 46 and ends on line 62 of the XML document) will be rather familiar:


            <seg type="dep_event" subtype="event" xml:id="MS609-0006-1">
               <lb break="y" n="5"/>Item.  
               <date type="deposition_date" sameAs="#MS609-0001.xml" cert="medium">Anno et die quo supra.</date> 
               <persName nymRef="#Pons_Rainart_MSP-AU" role="dep">Poncius Rainardi</persName> testis iuratus dixit quod vidit 
               <placeName type="event_loc" nymRef="#home_of_Cap-de-Porc">in domo 
                  <persName nymRef="#Peire_Cap-de-Porc_MSP-AU" role="own">P<supplied reason="expname">eire</supplied> de Sancto Andrea</persName>
               </placeName>
               <persName nymRef="#Johan_Cambiaire_MSP-AU" ana="#pAdo" role="par">Iohannem Cambitorem</persName> 
               et socium<lb break="y" n="6"/>eius, hereticos. Et vidit ibi cum eis 
               <persName nymRef="#Guilhem_Vidal_MSP-AU" ana="#pAdo" role="par">W<supplied reason="expname">illelmum</supplied> Vital</persName>; 
               <persName nymRef="#Bernard_del_Mas_Senior_MSP-AU" ana="#pAdo" role="par">B<supplied reason="expname">ernardum</supplied>, dominum del Mas</persName>; 
               <persName nymRef="#Arnald_de_Rosengue_MSP-AU" ana="#pAdo" role="par">Arnaldum de Rozengue</persName>; 
               <persName nymRef="#Raimund_de_Causit_MSP-AU" ana="#pAdo" role="par">R<supplied reason="expname">aimundum</supplied> de Causit</persName>; et 
               <persName nymRef="#others_unrecalled" ana="#pAdo" role="par">plures alios de quibus<lb break="y" n="7"/>non recolit</persName>. Et omnes et 
               <persName nymRef="#Pons_Rainart_MSP-AU" ana="#pAdo" role="par"/>ipse testis adoravit ibi dictos hereticos. Et 
               <date type="event_date" when="1231">sunt XIIII anni vel circa</date>.
            </seg>

This code looks quite a lot like HTML. There are the familiar triangular brackets for <element></element>s and attribute="value" pairs. It also acts like HTML in that the elements surround the original transcribed text. Were we to strip the elements out, we would be left with a perfectly-readable plain-text transcription.

While the structure of the markup (technically the syntax) is familiar, though, the elements are probably less so. Here are a few of them:

  • <seg></seg> elements are used to mark up a segment of the manuscript (rather than the p element used in HTML
  • <persName></persName> elements mark up the name of a person.
  • <placeName></placeName> elements mark up the name of a place.
  • <date></date> elements mark up dates.
  • <supplied></supplied> elements show text that has been added by the author.

Already we can see that this TEI markup allows the editor to make use of a much broader range of elements that help readers understand the function of the text that has been marked up – whether those ‘readers’ are humans or computer programs. The generic HTML elements used in the snippet above are much less useful in this regard. We can also see that the attributes used add even more detail. For instance the <date></date> has a when attribute which supplies the year in a modern format. What’s more each <persName></persName> and <placeName></placeName> element has a nymRef attribute – these attributes provide unique IDs for each person and place in the document (handy, as surnames were not yet common and Arnaudus, Petrus and Poncius were very common names).

Using this kind of specialised markup allows editors to describe documents in much more detail and without ambiguity.  They can then be subject to automated analysis (for example).  The inquisition manuscript’s editor  has published such a study of the text using network analysis techniques in the Open Library of Humanities.  This study was in part based upon an automated analysis of the relationships between elements and attributes in a fully-marked-up edition.

This is the power of TEI-XML: it provides a huge array of elements and attributes for marking up digital texts; it is a toolbox from which humanities scholars can draw to produce rich digital editions.

About TEI

We have seen a snippet of TEI-XML encoding, but what actually is TEI-XML? First let’s deal with the TEI part. TEI stands for the Text Encoding Initiative. Properly, TEI actually describes the eponymous Text Encoding Initiative consortium, a group of academics and organisations in the digital humanities who collaborate to produce a set of guidelines for representing texts digitally.

TEI has been in development since the 1980s, and in widespread use in the humanities since the publication of the first set of guidelines in 1994. TEI developed at the same time as and in parallel to HTML. The two markup languages share some similarities – for example both use the <p></p> element to denote paragraphs of text – but these overlaps are coincidental and the two languages cannot be mixed. (We will explore later how editions marked up following the TEI guidelines are turned into HTML documents for reading on the web).

The latest edition of the TEI standard is known as the P5 Guidelines. It defines a set of elements common to all editions, as well as a number of separate bundles of elements and attributes that authors can opt to use or not depending on the needs of their project. These bundles of elements (called ‘modules’ in the guidelines) allow editors to perform domain-specific tasks like:

  • identifying people and places in a document (as in our example)
  • encoding lexical information in dictionaries
  • creating editions of poetry and verse
  • editing music
  • distinguising the particular features of transcribed speech
  • marking additions deletions and emendations to a manuscript over the years of its life, and embedding metadata about the authorship and dating of these

There are many more such uses of TEI. The guidelines aim to provide a comprehensive menu of choices from which individual editors can pick individual elements to suit their needs. It is a rich and very flexible set of standards. This flexibility is the great strength of TEI; the toolbox it provides can be used by editors in a great variety of tasks.

The TEI guidelines are available in full on the TEI website.  There is also a very active and friendly TEI mailing list, currently hosted by Brown university.  It is worth noting that the encyclopaedic nature of TEI means that these guidelines can often be somewhat intimidating.  Similarly, the flexibility of the markup (and the questioning nature of academics) means that there may often be more than one ‘right’ way to mark up a TEI-encoded edition of any given text fragment.  If you have enjoy esoteric conceptual or linguistic debate, or if you just love reading manuals do feel free to dive straight in to the TEI guidelines. If you are more pragmatically minded, you may prefer to hold back for now.

What is XML

TEI markup is technically a dialect of another language, eXtensible Markup Language, or XML. XML is an even more generic markup language. The specification defines some rules for how code must be formatted (using the now-familiar <element></element>s and attribute="value" pairs) and how computer programs should treat that code. The XML specification itself defines only a very limited set of elements and attributes but it is (as the name suggests) extensible.

The syntax of XML is for the most part similar to that of HTML.  Indeed HTML was at one point looking likely to become an XML dialect (this version of HTML still exists; it’s called XHTML).  However, HTML has developed in a different direction.  The key difference for authors is that the HTML guidelines are less strict, and HTML interpreters like web browsers are allowed to be forgiving in how they deal with malformed HTML.  Web browsers will for the most part try hard to make sense of HTML that is malformed.  If the author has missed out a closing tag, or used an element the browser’s HTML parser hasn’t heard of, it will either try to guess what the author meant or ignore that tag completely.

By contrast, XML cannot be processed at all unless it is syntactically perfect.  It must be well-formed: this means that every element must have a start and end tag (or be self-closing), and that elements cannot overlap.  It must also be valid, meaning that the author must only use the set of elements and attributes described in the specific schema that is linked to at the start of the document using a namespace attribute.  If an XML document does not meet these requirements, interpreters cannot handle it.  A single formatting error halts processing completely.

Uses of TEI XML

We have already seen one function of TEI XML: to produce an interlinked digital edition of a manuscript which can be subject to automated analysis.  Properly marked-up texts enable other kinds of projects, too.  Students on the Digital Editing of Medieval Manuscripts training programme out of which this Hackathon grew produced a number of digital editions exploring a range of themes:

  • treatment of ‘named entities’ (people and places)
  • the relationship between a physical artefact and its textual content
  • exploring different versions of a manuscript
  • presenting editions for non-specialist audiences

You can see these editions from DEMM here.

These projects focus on presenting TEI-encoded texts online, in a process through which automated software ‘transforms’ the TEI-XML into HTML and CSS that we can view in a web browser. This separates the production of knowledge about a text from its presentation, and means that document experts can focus on the intellectual questions behind the digital editing of their text, while those with web design expertise focus on designing the ‘front end’. A well-defined schema for describing digitised texts like TEI allows scholars to re-use these documents beyond the context of their intial publication online, building on the work of the original editors while asking questions that they had not envisaged.

I hope you have enjoyed this brief introduction to TEI-XML.

Further resource on TEI-XML (optional):

Digital Scholarly Editions: Manuscripts, Texts and TEI Encoding

Course by Marjorie Burghart and Elena Pierazzo

https://teach.dariah.eu/course/view.php?id=32