Introduction: The promise and peril of the digital edition
A frequent topic of discussion at past meetings of the International Editors of Diplomatic Documents has been how programs should best harness the indisputable advantages of digital distribution without compromising the proud traditions of diplomatic documentary editing, a discipline rooted in print publishing. On the one hand, digital distribution offers our programs numerous advantages. If the purpose of a documentary edition is to expertly compile and annotate a selection of key primary source documents on a topic to give readers (including the many who may lack access to the archive) a nuanced and contextually-informed understanding of the subject matter, how are these advantages to be passed by? They include:
- Instant worldwide dissemination.
- Reach all interested readers, including taxpayers who fund the edition and policy makers whose decisions should be informed by it.
- Search tools empower readers to pinpoint relevant sources.
In addition, in a digital environment, the reader’s experience of a documentary edition is no longer bound by the covers of a book, the length of a shelf, or distance to the library. The reader gains visibility into the period covered by a sub-series of volumes or even the entire series. Hyperlinks allow readers to traverse cross-references without turning a page; scanners allow editions to include high resolution images of documents, not just typeset transcriptions; and search engines and databases allow readers to perform research in ways that were never before practical.
Despite the great potential of digital dissemination, many of our programs have approached it with considerable caution. Programs face real challenges in pursuing a digital strategy, such as how to fund these initiatives, how to find the expertise to select standards and technologies, and how to turn these digital dreams into a reality—all without detracting from vital ongoing projects or derailing well-established editorial workflows. With programs being small (some as few as one editor) and enmeshed in large organizations (ministries or institutes) with established infrastructures and goals for digital outreach, implementing such changes may seem insurmountable. Furthermore, with the pace of technological change speeding up rather than slowing down, programs worry that any approach could rapidly become obsolete. For an example, look no further than our oft-lamented experiments with microfiche or CD-ROM distribution, which were once seen as “the future” but are now increasingly inaccessible. Even worse, in retrospect, these editions sacrificed editorial standards in order to conform to the limitations of the medium. No program would risk making a mistake like that again.
Is the only sound solution to forswear digital publishing, focus solely on the print edition, and perhaps, as a result, consign the digital edition to for-profit publishers who perform limited if any editorial services while charging handsomely for the print edition and/or erecting paywalls to extract subscription revenue from taxpayers, with none of the profits supporting the edition itself? Are there ways to both retain the reliability of print and offer the benefits of digital to our readers?
We believe there is hope and a sound path to digital editions for programs interested in offering them. Several programs have proven solutions to the core challenges. They have surveyed the field, identified suitable technologies, prototyped and tested solutions, sought and incorporated reader feedback, and launched offerings that balance the rigorous standards of documentary editing and the promises of technology. They have even presented their results here at past ICEDD meetings. But our meetings are few and far between, presentations at the conferences are short, and attendees have reported they they have struggled to capture the information they need to bring home to their colleagues. Therefore, while certain members of our community have discovered compelling solutions, our community as a whole has lacked a handbook describing these best practices for programs to consult as they plan and secure resources to carry out these reforms.
To answer this challenge, two of the most mature digital programs in the IEDD community, the Diplomatic Documents of Switzerland (Dodis) and the Office of the Historian (editors of the Foreign Relations of the United States, or FRUS, series), set out last year to create a handbook describing best practices in the field of digital diplomatic documentary editions. These two programs make a particularly suitable pairing for this effort, because they began over a decade ago with strong but surprisingly different approaches to their digital editions, and, in recent years, they have converged around a set of common approaches. Through dialog between the two programs, they realized that they had identified best practices that other programs might benefit from learning about. The result is this document.
This handbook introduces the concepts that programs should understand and considerations they should grapple with as they set out to improve their existing efforts or to initiate a digital publishing strategy. We recommend resources for further study. Each program will need to make its own choices, but we hope that there is utility in this handbook for all programs. That said, this is not a comprehensive guide. It focuses on the fundamentals of digital publication of diplomatic documents, and does not include discussion of metadata management, an important related topic. Perhaps this can be covered in a future revision or a companion handbook. (For more comprehensive guide, see A Guide to Documentary Editing, 3d edition by Mary-Jo Kline and Susan Holbrook Perdue.)
Finally, lest the authors of this document be misunderstood, we neither advocate nor harbor hidden desires for the demise of print. None of the methods described here prevent programs from retaining print editions while also publishing a top notch digital edition. In fact, the standards described here may even enhance programs’ editorial facilities, yielding higher quality publications, simplifying publishing workflows, and reducing costs—allowing programs to focus on their core expertise and eliminate distribution agreements that limit citizens’ access to the final product. The purpose of offering a digital edition are manyfold: to offer taxpayers (who ultimately pay our archivists’ and editors’ salaries) and readers around the world free and open access to the annotated and contextualized records of the actions of their government; to foster public understanding and maximize the impact of our efforts; to advance scholarship; to reinforce the value of diplomatic documentary editions and our uncompromising standards; and to ensure not just survival but a brighter future for our programs which are vital for the healthy functioning of democracy and sound policy making. It also opens up possibilities for repurposing our work and engaging in international scholarly exchange of documents and data, which are discussed below.
Conceptual Foundations
The basics of a digital edition
The most basic distinction between a print edition and a digital edition is that a digital edition is accessible on the internet. Making an edition available on the internet can be as simple as the following options:
- Full book PDFs: Assuming that documents are transcribed and annotated in preparation for printing as a book, books can be distributed as downloadable PDFs from the program’s web site. This maximizes fidelity with the printed book, allowing readers to read a book cover to cover, search its contents, and confidently cite page numbers in the book in their writing. There is no simpler method of getting a book to readers than posting a PDF. Keep in mind, however, that it is difficult to search across PDFs, so readers searching for a topic in your books may be forced to download and repeat the same search in multiple files. Also, because the size of text in a PDF is fixed, the rapidly increasing number of readers using mobile devices with small screens may find it difficult to read your PDFs on their device. However, if PDF is all that your program can manage, posting a PDF is a wonderful start to a digital edition, opening possibilities rather than forestalling them. Be sure to retain a copy of all source files (e.g., Microsoft Word files you used to transcribe and annotate the documents), since building a more sophisticated digital edition may well be easier and cheaper than working from PDF files.
- Document images: If documents are not transcribed for print, then scanned images of the documents can be posted as downloadable PDFs on the program’s web site. This gives readers an unvarnished look at the source documents that the editor selected. Keep in mind, however, that PDFs of scanned images may not be searchable. As a result, readers who have visual disabilities will find it difficult or impossible to read scanned images. Optical character recognition (OCR) software can be applied to PDFs to make them somewhat usable for these uses, but it is difficult to correct the typos that OCR software inevitably produces, especially for archival documents. Annotating image-only documents is also difficult. In lieu of footnotes, editors can systematically index the document, revealing extensive information about the key entities—people, places, and organizations—mentioned in the document. These indexes can be a powerful aid for readers, as it helps them locate documents on the primary actors and topics for their research.
- Full text documents: The transcribed text of each document can be posted to the program’s web site. In contrast with the two previous options, this method allows users to directly read and search the document in their web browser. Footnotes can easily be added to this text (either via pop-up windows or beneath the document, or both), text can be hyperlinked to editorial apparatuses (such as glossaries), and readers can be shown sidebars with links to the key entities (people, places, organizations), terms and abbreviations, etc. This method also allows search engines to ingest the full content of the document, meaning that many more readers on the internet will be more likely to discover the document.
Whereas the 1st method above does meet the basic criteria of a digital documentary edition by giving readers the ability to access publications over the internet, only the 2nd and 3rd methods, which can also be combined, begin to unlock the potential of digital documentary editions:
- Every document has its own web page and its own URL. Readers can cite individual documents by their URL and even share them on social media.
- The document’s editorial apparatus is no longer limited to what is possible on the printed (or scanned) page, but rather can build on World Wide Web technologies:
- HTML can be used for basic layout, hyperlinks, and typography.
- CSS can provide richer typography and layout.
- JavaScript or Web Components can provide dynamic, interactive features—such as footnotes, timelines, and visualizations.
- The document can also be enriched or supplemented with indexes of people, places, and organizations. Hyperlinks can lead readers to more information about these entities (such as biographical databases) and to other documents that feature the same entities.
While programs may still publish documents as books, many readers may begin to see the series-wide corpus of documents as a database, rather than as a collection of books meant to be read cover to cover.
Indeed, we, as editors, should start thinking about an edition as a collection of individual documents that might be combined in a variety of novel ways. Our documents might be printed in a book, as they have been traditionally. But a professor might select several documents into a reading packet for her students. A policy maker might request a customized collection of documents organized differently than the print edition; for example, a series organized into annual volumes might create a customized collection of documents on a single topic from across a decade. Your edition might contribute documents to a bilateral or multilateral publication, or you might contribute your biographical data to an international scholarly network. Researchers with expertise in data science or computational linguistics may be able to apply their discipline to our documents to answer questions the editors never considered. Publishing documents in the right “open” digital format enables these possibilities, rather than closing them off by locking your documents in “closed” formats.
A prerequisite for taking advantage of these possibilities is not technical, but legal: you must retain or reclaim the full digital distribution rights to your publication and its archive. If you allow a commercial publisher to distribute the print edition of your documents, insist on retaining the full rights to distribute the digital edition from day one. If you lack these rights currently, consider working with your legal department to renegotiate the terms of the arrangement or find a different vendor. Every day without full, free, digital access to your publication is one day fewer of maximum impact your publication can have on public or scholarly discourse.
The value of an edition, then, should be measured by both the quality of its editorial apparatus and the ability of readers to access it. The program may consider the printed book as the ideal form for consuming the documents, but the digital medium can easily preserve and reinforce the edition’s original organization and vision, with footnote-based annotations and references that can be cited in scholarship, etc. There is no longer any need to compromise on editorial standards when adopting a digital publication strategy. We can produce stellar print editions while offering our documents to readers online.
The remainder of the handbook assumes that the programs understand these advantages and are looking for advice about how best to carry it out—how best to maximize the potential of their digital editions through this hybrid method of full document text and rich digital annotations.
Selecting a format for a full text edition
A key question when embarking on a digital edition is the format for the text. The ideal format would be non-proprietary, so that your edition is not locked in a format tied to one company’s software. Not all readers will own a license of Microsoft Word, for example, so while you may use this software for preparing your publication, it is not an ideal format for distributing your publication. Despite the benefits of PDF described above, its limitations (also described above) also disqualify from consideration as the master digital format for our publication. There are many other requirements for our “ideal” format, which we will explore as we discuss various candidate formats, before arriving at our final recommendation. Knowing a few details about the alternative formats will prepare you for discussions with your technical team who may be more familiar with them than the ideal format we recommend.
HTML
The lingua franca of the World Wide Web is HTML, the Hypertext Markup Language. Every web browser has robust support for HTML, and thus it is an attractive format for any web-oriented publication. While digital documentary editions could target HTML directly, it has some downsides:
- HTML is a rapidly evolving standard, so HTML authored today may need to be updated to maintain its presentation on future web browsers. The ideal format would allow you to store your documents in a far more stable fashion, which is unlikely to require dramatic reformatting.
- HTML is a presentation-oriented format, designed to style text for presentation on the web. As a result, documentary editions may find themselves leveraging typography to indicate semantic differences—much as when laying out a book for the printed page. For example, when preparing a document’s dateline, you might simply right-align that block of text. Similarly, a long quotation might be left-indented by one inch. The ideal format would let you identify a dateline as a dateline, a quotation as a quotation. While HTML can be coerced into conveying semantic information like this (with “class attributes,” for example), this is not a wise use of HTML. The ideal format would let you identify the key portions of your document—encode your text—without typography as a proxy. The ideal format would also give you some facility to transform your encoded text into the appropriate typographic and stylistic presentation for the web.
- The appearance of HTML-encoded text may look drastically different on different browsers or screen sizes. The ideal format would allow you to focus your efforts on correctly encoding your text, rather than styling it to look good on different devices.
Markdown
Another popular format for encoding text is Markdown. This format uses simple conventions to prepare text for publishing on the web. For example, italicized text is surrounded with asterisks or underscores (like _this_
or *this*
), and bolded text is surrounded with two of these (like __this__
or **this**
). While Markdown’s simplicity is attractive for working with text and may play some productive role in an editing workflow, it is even more presentation-oriented than HTML—supporting a narrow subset of HTML’s methods for encoding different parts of a text. For example, Markdown has no mechanism to right-align text, much less to describe that text as a dateline.
Markdown is the emblematic “plain text” format. The advantage of plain text formats is that they are likely to be readable by computers far into the future, because they do not depend on any particular application (unlike, for example, Microsoft Word files). A Markdown file can be edited in any basic text editor application, like Notepad (Windows) or TextEdit (Mac). (HTML is also a plain text format.) The ideal format for our documents would be a plain text-based format. Why not adopt Markdown for our publications?
Unfortunately, there is no single standard for Markdown. As a result, there is a profusion of different “flavors” of this format. Certain flavors support features that documentary editions need, such as footnotes, while others do not. The absence of a standard for Markdown (including an unambiguous specification that software developers can reliably craft their software to support) means it is difficult to find software solutions that guarantee conformance with your documents, and it is also difficult for editions to cooperate with each other, since they may use different flavors, and there is no way to describe project-specific conventions. The ideal format would allow programs to document their project-specific conventions.
Text Encoding Initiative (TEI)
The Text Encoding Initiative is the de facto standard in digital humanities for encoding text and is the format we recommend that diplomatic documentary editions adopt for their text. TEI has been in active development for over 30 years (since 1987) and is built by and for humanities projects whose texts are of prime importance. It is a free, open standard that responds to the needs of the community.
Like HTML and Markdown, TEI is a plain-text based format. It also uses an HTML-like syntax for encoding text, called XML (Extensible Markup Language), which is easy to learn and which many software packages support. In contrast to HTML and Markdown, TEI makes an explicit distinction between the content of a text and its presentation. And it provides a mechanism for projects to define their project-specific conventions. It is truly the ideal format for documentary editions.
The TEI Guidelines define over 500 “tags” for encoding different parts of a text, including one for our running example, dateline. These include tags to identify the following concepts in the text of our documents:
- People: Names, titles, birth and death dates, nationality, languages
- Places: Place names, type of places, geolocations
- Dates: Dates, times, time zones, from-to ranges, indeterminate ranges (not before, not after)
- Portions of a document: opener, closer, signatures, headings, footnotes, paragraphs, lists, tables
- Breaks: page breaks, column breaks
- Manuscript description: repository, collection
- Correspondence: Sender, recipient
- Links and references: within a document or to other documents
- Corrections: insertions, deletions, regularized spellings vs. original spellings
If you do not need all of these tags, the TEI lets projects create a project-specific subset of tags. If you need a tag that TEI has not considered, the TEI has a mechanism for adding new tags.
Here we have taken a document that was included in When the Wall Came Down, and we show screenshots of (1) a scan of the original document, (2) the TEI transcription, and (3) an HTML transformation of the TEI:
Because TEI is built on XML, TEI benefits from a rich ecosystem of tools and technologies that support the publication of documentary editions—such as the need to transform our semantically-encoded text into a formats suitable for presentation (like HTML or PDF). Several such tools are listed in a section below.
Learning TEI
It is beyond the scope of this short handbook to provide a thorough introduction to TEI. For that, we recommend reading the TEI Guidelines, which are freely available on the TEI Consortium’s website and downloadable as PDF and ebooks. For beginners to scholarly text encoding, we strongly suggest watching video in the first bullet below.
- Digital Scholarly Editions: Manuscripts, Texts and TEI Encoding: A series of videos on YouTube by Dariah.EU.
- Teach Yourself TEI: A list of resources for teaching yourself TEI.
- Digital Textedition with TEI: Also by Dariah, but in German.
While it is certainly possible to teach oneself TEI, we recommend finding a course or institute that can provide guided instruction, such as:
- TEI Introductory School, September 10-13, 2019, at University of Graz (Austria), immediately before the 19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium.
- Digital Humanities Summer Institute, University of Victoria (Canada).
- Digital Humanities at Oxford Summer School, Oxford University (UK).
- Seminars on Scholarly Text Encoding, Women’s Writers Project, Northeastern University, Boston (USA).
For specific questions regarding the handling of TEI in daily work you can refer to the TEI-L Mailing List. Its archive dates back until 1990. You can subscribe to the mailing list here.
Tools
Having selected TEI as the format for digital editions (like Dodis and FRUS), you are free to use a broad range of tools. After all, TEI is a non-proprietary, plain-text, XML-based format. Thus, any tools which can edit or manipulate plain text XML files will work. For the purposes of this handbook, we will recommend several good tools for editing and publishing your TEI:
Tools for editing TEI documents
The most commonly used editor in the TEI community is oXygen XML Editor. Think of oXygen as Microsoft Word for your TEI, a program that allows you to write, edit, search, and enrich your documents. It comes with built-in support for the TEI Guidelines, providing contextual hints about TEI tags and links to the Guidelines for expanded definitions. While “editor” suggests one of oXygen’s abilities—facilitating the editing of documents—it is, in fact, a veritable “Swiss Army knife” for XML-based projects. It has robust support for other XML technologies, for searching within your document, transforming it into other formats, and even defining and enforcing highly customizable rules for the structure and content of your documents. While there are other XML-capable text editors, oXygen is the most popular one in the TEI community.
In addition to the local version, Oxygen offers a web-based editor. This editor can be useful when working on documents with several colleagues from different locations. This solution can also be considered if there are regulations that prevent the installation of local software on the computers at the workplace.
Another web-based editor is XEditor.
Tools for publishing TEI editions
The exciting moment for a digital edition is when the documents become searchable and browsable through a website. TEI Publisher is perhaps the easiest way to take your TEI documents and make them into a website—with a drag-and-drop, form-based interface to uploading your documents and customizing your website, while adhering to the TEI Guidelines. It is free and open source, runs on all major operating systems (Macintosh, Windows, and Linux), and can also be installed on web servers. TEI Publisher empowers TEI-literate projects to publish their content on the web.
A forthcoming edition of TEI Publisher offers the ability to upload Microsoft Word files and convert these automatically to TEI. This offers the possibility that programs that use Microsoft Word to transcribe and annotate their documents could achieve the benefits of TEI without switching entirely to a dedicated XML Editor such as oXygen.
TEI Publisher is built on eXist-db, a open source database and web server that, in contrast to most databases or search engines, has native support for XML and XML technologies. eXist is a powerful tool for searching, analyzing, and transforming your TEI documents.
Encoding diplomatic documents with TEI
Even more important than the selection of tools is the choice of elements used for the transcription. This depends very much on the needs and priorities of the projects. However, we suggest some list elements that we think are useful for editions of diplomatic documents.
Elements of a diplomatic document
Documentary editions of diplomatic documents share many elements in common:
- Document type (e.g., memorandum, telegram)
- Heading
- Correspondence (sender, recipient)
- Dateline (place, date)
- Provenance
- Classification and handling markings
- Opener (e.g., salutation)
- Subject lines
- Body paragraphs, lists, tables
- Closers (e.g., signatures)
- Footnotes
Documents also contain references to several kinds of entities:
- People
- Organizations
- Places
- Events
- Agreements
TEI has mechanisms to handle all of these elements and entities.
Links to TEI editions
Dodis
- The TEI transcriptions of all documents in Quaderni di Dodis 12 can be downloaded from http://dodis.ch/T1443.
FRUS
- The TEI transcriptions of all FRUS volumes can be downloaded from https://github.com/HistoryAtState/frus. The “volumes” folder contains the TEI file containing each volume.
Possible Structure of documents
teiHeader
The teiHeader contains the metadata of a document and additional information about the edition and its standards.
Document Head
“Contains any type of heading, for example the title of a section, or the heading of a list, glossary, manuscript description, etc.”
For editions of diplomatic documents, this can be e.g. the original title, editorial title, etc.
Document Opener
The opener “groups together dateline, byline, salutation, and similar phrases appearing as a preliminary group at the start of a division”. It can contain classification information, the place of origin and a date.
Paragraphs
“Marks paragraphs in prose”. Used for the transcription of the main part of documents.
Notes
“Contains a note or annotation.”
Notes can be added in all parts of the document.
TEI Guidelines | |
<teiHeader> | (TEI header) supplies descriptive and declarative metadata associated with a digital resource or set of resources One of the few elements unconditionally required in any TEI document |
<fileDesc> | (file description) contains a full bibliographic description of an electronic file |
<titleStmt> | (title statement) groups information about the title of a work and those responsible for its content. |
<title> | contains a title for any kind of work |
<author> | in a bibliographic reference, contains the name(s) of an author, personal or corporate, of a work; for example in the same form as that provided by a recognized bibliographic name authority. |
<publicationStmt> | (publication statement) groups information concerning the publication or distribution of an electronic or other text |
<publisher> | provides the name of the organization responsible for the publication or distribution of a bibliographic item. |
<sourceDesc> | (source description) describes the source from which an electronic text was derived or generated, typically a bibliographic description in the case of a digitized text, or a phrase such as “born digital” for a text which has no previous existence. |
<text> | contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample |
<body> | (text body) contains the whole body of a single unitary text, excluding any front or back matter |
<div> | (text division) contains a subdivision of the front, body, or back of a text. |
<head> | (heading) contains any type of heading, for example the title of a section, or the heading of a list, glossary, manuscript description, etc. |
<ref> | (reference) defines a reference to another location, possibly modified by additional text or comment |
<title> | contains a title for any kind of work |
<note> | contains a note or annotation |
<idno> | (identifier) supplies any form of identifier used to identify some object, such as a bibliographic item, a person, a title, an organization, etc. in a standardized way. |
<orig> | (original form) contains a reading which is marked as following the original, rather than being normalized or corrected |
<opener> | groups together dateline, byline, salutation, and similar phrases appearing as a preliminary group at the start of a division, especially of a letter |
<add> | (addition) contains letters, words, or phrases inserted in the source text by an author, scribe, or a previous annotator or corrector |
<add> | The <add> element should not be used to mark editorial changes, such as supplying a word omitted by mistake from the source text or a passage present in another version. |
<dateline> | contains a brief description of the place, date, time, etc. of production of a letter, newspaper story, or other work, prefixed or suffixed to it as a kind of heading or trailer |
<placeName> | contains an absolute or relative place name. |
<date> | contains a date in any format |
<p> | (paragraph) marks paragraphs in prose. |
<list> | contains any sequence of items organized as a list |
<item> | contains one component of a list |
<emph> | (emphasized) marks words or phrases which are stressed or emphasized for linguistic or rhetorical effect. |
<hi> | (highlighted) marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made |
Further elements | |
<label> | |
<num> | |
<address> | |
<msDesc> | |
<orgName> | |
<persName> | |
<quote> |