You may already be familiar with HTML (HyperText Markup Language), which underlies most of the web. HTML is primarily a representative encoding language, meaning that the tags are intended to shape how the document looks on the web (with some exceptions in the age of CSS that we can discuss if you are interested). Today, however, we’ll be working with the TEI (the Text Encoding Initiative), a variant of XML encoding developed by literary scholars, linguists, and historians for digital, scholarly editions of texts. TEI is different from HTML. In TEI the tags designate formal features of the document—they are metadata that describe
- the structure of the text being encoded,
- the structure of the digital file being created, or
- the editorial decisions made by the encoder about the document.
In order to understand precisely what that means, we’re going to talk first about metadata itself—and what that term could possibly mean when discussing literary-historical texts—and then move on to investigate how the TEI tried to encode and represent metadata.
- We’ll start with the “Memoir of Florence Hall” from Northeastern’s own Early Caribbean Digital Archive project (and here’s the transcription), which is using TEI to encode pre-twentieth century Caribbean literature so that scholars around the world can access texts that were previously available only in a few special-collections libraries. Look at the images of this document and imagine you were tasked with translating these images into machine-readable text. What features of the physical document would you want somehow represented in that new, digitized edition of the text? What information about the document would you want to include in your new digital edition? What aspects of the historical document do you think would be difficult or impossible to translate into a digital edition? Why
- Now let’s look at a few simple TEI documents from my digital edition of Nathaniel Hawthorne’s “The Celestial Railroad.” This will give us a chance to see what encoding looks like and begin understanding how it works.
- The original Democratic Review printing of Hawthorne’s story.
- a political speech referencing the story.
Spend 5-10 minutes looking through these files. What tags stand out? Can you discern anything about the structure of the tags, and how they relate to the literary content? What metadata do the tags add to the poems? How do you think that metadata would help scholars work with this text, either by itself or in a larger collection of texts (a corpus).
- Next, let’s look at “The Short and Sweet TEI Handout” developed by Alex Gil at Columbia. We’ll work through this document together and talk about the structure of TEI.
- Finally, you will work in groups attempt to encode the “Memoir of Florence Hall.” I promise this won’t be as daunting as it perhaps sounds.
Some practical tips for working with TEI:
- You should not encode documents in a word processor like Microsoft Word. Word’s .docx files are actually XML encoded documents—that’s what the “x” at the end signals—that use XML tags to create the various styles—italics, bold—that you assign via the interface. If you try to write encoded text in Word, you’ll actually be producing tags-within-tags that will not be useable. Instead, you want to use a plain text editor (Wordpad on Windows or TextEdit on Mac) or, even better, download the free trial of the oXygen editor, which is designed for working with XML.
- TEI by Example provides a nice walkthrough to encoding different kinds of texts in TEI. You can also browse and search the current TEI guidelines to find the right tags for particular tasks.
- In my opinion, the best way to figure out TEI is to look at encoded documents and steal liberally from them (in fact, I think this is the best way to learn any encoding, including HTML and CSS).
- If you are familiar with CSS stylesheets you can use TEI Boilerplate to style your TEI documents and publish them on your personal website. There are more complex ways to publish large collections of TEI, but Boilerplate is a quick and easy way to get TEI-encoded documents out.