A Not-So-Modest Proposal: No More Legal PDFs
Introducing the GOLD Standard. It's better than the old standard.

January 16, 2014


As a follow-up to my recent modest proposal, the issue arose as to the suitability of the proposed "LREF" hyperlinks, which are really unique identifiers, as a general markup language that could encode metadata. This gets into a complex nest of questions that is really at the heart of much that is wrong with legal technology today.

The short answer is that the LREF proposal is not suitable to encode metadata of various kinds, but also that it should not be. That's not its purpose; its purpose is to uniquely and accurately locate legal data. The OASIS committee that seeks to address all of these issues has apparently decided to start off by conflating two very different problems: the first, quickly and uniquely identifying legal information; and the second, consistently representing that information in machine-readable format. While these problems are obviously related, they are not the same, and they demand very different solutions.

Fortunately, solutions are not in short supply, and ample tools exist in our current technological landscape. Among friends, I have been informally calling the ideas in this proposal the "GOLD Standard," with GOLD being an acronym for Global Objective Legal Data. It of course must be "Global" because otherwise it would just be the OLD standard, which is precisely what we're trying to do away with.

As a programmer, I am generally not fond of object-oriented software development (and I know many people are) because for many tasks, at least involving web development, it seems overly complex and flexible to the point of absurdity. Sometimes you just need a hammer to solve a problem, not a blob of play-doh that you can fashion into a hammer by using forty other hammers, a knife, and a kiln. Yet in the case of representing legal information, object-oriented principles have some instructive lessons to offer—hence "Objective" Legal Data.

When one looks at a contract, a legal brief such as any kind of motion before a court, an agency office action, or a court opinion, they all have one thing in common: text. (In addition, they are all likely to be provided as PDF documents in our current envrionment.) That text is common to all of these documents would not seem to be such a useful observation except that it's actually quite remarkable how different all of these documents can manage to appear, despite the fact that they are usually just text. Once you clear away the differences in fonts, numbering systems, indentation, and headings, you find that really the lowest common denominator is that they are broken up into clauses or paragraphs that each try to encompass some discrete meaning. Again, this just seems par for the course as with any document, but it's important, because it gives us the fundamental building block (or brick, if one prefers to stick with the gold analogy) upon which an entire standard can be built.

In short, the problem with standardizing around PDF files is that PDFs do an excellent job of reproducing layout information thanks to PostScript (the printer language that PDFs encode), but layout information is precisely what needs to be dissociated from the underlying content of legal documents to make them machine-readable and infinitely more useful. Any underlying legal data should ideally be exchanged free of any inherent formatting save for a limited set of markup symbols (for bold, italic and underline), and the formatting should then be applied later on as needed. In this way, the same legal document could be presented as a web page, as a PDF, as a Rich Text Format (RTF) file, or in any other format that has yet to come about.

To compare legal documents to a more progressive document-centric paradigm, the World Wide Web, HTML started off with most of its formatting cues embedded as tags; over time, Cascading Style Sheets (CSS) have evolved to replace many of those older tags, which are now discouraged (even though having learned HTML in 1997, I still feel wedded to many of them). The legal profession has the same general problem, but its evolution has been non-existent, and most practitioners haven't yet figured out what a tag even is. Most courts are struggling mightily just to keep up basic web sites, let alone handle e-filing of PDFs.

The GOLD Standard would encapsulate legal data free of formatting as just described so that it could be read, exchanged and analyzed by computers as well as humans. In a sense, GOLD is Microsoft Word in a strait jacket. As proposed, each block in a GOLD document is by default a paragraph in a conceptual sense—but it does not have to be a paragraph. GOLD blocks are fundamentally objects with properties (clearly, I relent to object-oriented principles in this instance). One such property is the type of block:

Required Block Property type: enumerated { Text, Heading, Image, Video, Data, Claim, Charge }

For the type property, "Text" is the default. A "Heading" is a special kind of text that will be emphasized but have no legal meaning, inserted only for clarity and organization. "Image" and "Video" are self-explanatory, but how they are stored is actually not a simple matter. Either could be stored in an absolute (embedded) or relative (hyperlinked) manner. "Data" essentially means binary data that could be used to encode any kind of proprietary digital document, including but not limited to PDFs, Word and Excel documents, PowerPoint slides, etc. Finally, the "Claim" and "Charge" types would still represent text, but text with special weight in the legal world. These two types would simultaneously also be Headings.

Optional Block Property indent: integer

The indent property would allow the user to present some text as indented from the left (or for Hebrew/Arabic, right) margin; the integer value would represent the number of tab stops to indent.

Optional Block Property number: string

Many paragraphs and clauses in legal documents are numbered. How they are numbered should be up to the user. The numbering scheme could be 1, 2, 3; A, B, C;, i, ii, iii, etc. The number property would encode the actual number applied to that block, leaving the scheme implied for maximum flexibility.

Optional Block Property date: YYYY-MM-DD date

Frequently, a block of text in a legal document concerns events that transpired on a specific date. This property allows the author to explicitly specify a date to associate with a block of text. Such explicit specification could enable the later automatic generation of timelines, written histories, and the like, and make searching easier.

Required Block Property text: string

The text property is the actual meat of the block, encoding whatever the author wants to say.

There are a few other optional properties that would be extremely useful in legal systems. These properties concern the relationships between different pieces of legal data, and could make use of (proposed) LREF-style URIs.

Optional Block Property key: integer

The key property is intended as a unique identifier for each block in the GOLD data. If not specified, the keys would be automatically assigned in order of presentation.

Optional Block Property parent: integer

If one block is actually a subset of another block conceptually (consider a complaint containing a claim with several numbered paragraphs), the parent key of the given block can be specified in the parent property to link them together.

Optional Block Property linktype: string

In conjunction with the next property, linkid, the linktype property is perhaps the most novel and useful property of all. It could be set to any related type of LREF-encodable data. For a block with type "Claim", this property could be set to "Statute", and...

Optional Block Property link: string

...the link property could be set to the LREF-style URI for the statute the claim concerns, such as law://usc.18.1960 for 18 U.S.C. § 1960. In this manner, it would be possbile to instantly index every complaint's claim by statute.

Optional Block Property entry: string

The entry property would use LREF encoding to specify the document in any docket that the GOLD file was intended to represent. For example, if the given GOLD file was an embodiment of Document 7 in California Northern District Case No. 5:12-cv-03123 (which may or may not really exist), the value would be docket://gov.uscourts.cand.5-12-cv-03123.7.

GOLD Standard-compliant files could be encoded in XML or JSON (preferably the latter), and derived automatically by extracting plain text from Microsoft Word documents. We have already developed an internal beta of this technology not yet released, and PlainSite will natively support GOLD data in place of PDF links on its docket pages. The ultimate goal is to have every court use GOLD Standard systems so that one could examine a particular claim in a complaint starting in the lowest court, and follow its progress through subsequent appeals just by looking at an automatically-generated timeline.

This is a complex area of legal technology with many stakeholders and this proposal is just designed to be a starting point. Feedback, as always, is welcome.

No comments have been added yet. Sign in to post a comment.

Issues Laws Cases Pro Articles Firms Entities
Issues Laws Cases Pro Articles Firms Entities
Sign Up
Need Password Help?