by Kevin Leser, VP, Project Management, Planet Data
Though the terminology is perceived as new – “Technology Assisted Review” or TAR – there’s really little or nothing new about it. Anyone who has been using such search tools as Concordance and Summation since the 1990′s can attest that it’s just new verbiage wrapped around recent legal work-flow enhancements. Lawyers have been applying keyword searching to screen discovery files since the 1970′s when three companies, Aspen Systems, Informatics and Control Data (later rebranded as Quorum), were the first to point search engines at huge volumes of discovery files in an effort to whittle the mass down to a manageable pile of business records potentially relevant to a litigation.
If you’re under forty-five you probably won’t recognize those three names or, at least, remember much about their role as the founding elements of the litigation support industry. Their operations were all housed in the Maryland suburbs of DC. They all had mainframes running inverted file text search engines. Aspen used AspenSearch, its own creation, and likely the first of these unique tools. Informatics used Inquire, and Quorum used Basis. These all worked in a similar manner. The text of a document was broken down into unique words, which were assigned word IDs, which were, in turn, strung together in huge blocks of bits-and-bytes to make the contents yield to a search. If the search was for “Man bites dog”, the results would return exactly that group of words, and none other. It had to be a man not a woman, boy or girl; had to be a dog, not a poodle or a mutt; had to be a bite not a nip, gnaw or nibble.
These tools were all fed by armies of document coders and information analysts who sat arranged in rows of tables with blank document control forms to their right and stacks of Bates numbered paper to their left. In the days prior to electronic discovery (or even basic scanning and OCR), litigation support databases were built one hand-printed document control form at a time. College students looked at each document and painstaking recorded the author, addressee, copyee, date, document type, document title. They also noted specific conditions such as the presence of marginal notes, illegible scrawl, and even ink blots caused when some scribe tipped over his ink well while penning a document. Yes, this last bit is a bit of an exaggeration, but the point is that today’s catchphrase TAR, describes a process that, at a minimum, dates back to the Gerald Ford era.
So, if TAR is not actually a contemporary notion, what’s new enough in this realm to compel me to pen this article?
What’s new is that now concept engines are reshaping the completeness and accuracy curve of a search result. Basic keyword search tools are notoriously ineffective. The methodologies deployed in the 1970′s, and still largely active today, remain at the core of Concordance, Summation and dtSearch. The efficacy of the searches relies almost totally on the quality of the keyword list. However, the English language, and most any language, doesn’t innately lend itself to being probed with precision by simple keyword lists, even those that smart attorneys and litigation support professionals labor over for hours. There’s a famous (and still relevant) study done by David Blair and M. E. Maron in the early 1980′s, and published in the March 1985 issue of Computing Practices Magazine. Although the article contains myriad charts and equations as support for its conclusions, the methodology of the study and its results were elegantly simple. They took a ton of paper documents, had them accurately keyed into machine readable form and comprehensively indexed in e IBM’s text search tool, STAIRS, which was considered relatively powerful at the time.
The Blair and Maron team (including lawyers and paralegals) took this database and a set of keyword searches they expended an inordinate amount of time perfecting, and ran them several times, tweaking the results with each iteration until they were fairly confident that they found the vast majority of the relevant documents. Sort of sounds familiar, huh? Sounds similar to one of the keyword search driven reviews we supported over the last couple of months.
The team was then charged with manually reviewing all the original paper documents and flagging those they felt were relevant. When they compared the respective stacks, the keyword searches identified less than 20 percent of the documents determined to be relevant during the ‘Big Read’ of the roughly 350,000 pages that made up the original input to the database For those of us who had been building litigation support databases since the 1970′s, there were no surprises here, only validation of what we had learned in the early years of using those first-generation tools. Paraphrasing a current political truism: It’s the language, stupid! There are many, many ways to say the same thing, particularly in English, and having a Roget’s Thesaurus at your side when you’re probing a database doesn’t help much.
Back then we got around these limitations by developing retrieval thesauri and taxonomies for specific litigations, which enable document analysts to apply codes that reflected a document’s content. The distinction here is that legal issues can change over the course of a matter, but document content does not.
These content categorization aids were really elegantly designed tools that let an analyst objectively index each document so an attorney could then plug in a search code and more readily locate the documents that might be relevant to a production request. The only problem with something being elegant is that it’s usually also really, really expensive to develop and then apply; like tens of thousands of dollars to design, and $15 to $20 per document to apply, all in 1980′s dollars. But if you needed your retrieval to be comprehensive and accurate, you paid the freight.
These tools were often applied in asbestos and other mass torts products cases in the 80′s, where the accuracy and completeness of the retrieval and production process were critical to a solid defense. In the absence of the content indexing technology that has emerged over the last few years, the issue of accessing content was as relevant then to paper documents as it is to today’s electronic documents. Keyword search tools simply find documents, with an emphasis on simply. They don’t hunt them down, sift them and selectively offer morsels up for qualified review. Content engines do that.
I’m not going to drill down into the pros and cons of competing content indexing technologies. My company uses one content engine and other people in other litigation support firms use it too, and still others use different, but not altogether dissimilar tools. There are similar-but-different keyword search tools, and arguably, the same can be said of them. That’s a sidebar we can leave to the sales guys and the information scientists to postulate on. But, in a nutshell, content engines create intricate matrices of, well, content. You know, the stuff you might have said or written yourself, but maybe a little differently or a lot differently, using different language, maybe even a different premise or set of facts, but which when you read it you recognize it, and say, “hey, that’s what I’m looking for”. Or, in our business, maybe it’s “yikes, wish I hadn’t found that”.
The reality is that we think in concepts, not in keywords or phrases. We’re assaulted daily by politicians and advertisers who would like to think we function in a world dominated by buzzwords. Simply put, we’re conceptual beings and constantly filter input to get at the core of what we’re looking for or need at any given moment. This axiom holds true whether it’s where you can find the best ribeye steak or where that pesky little document that can help you or hurt you is hiding. That’s why the participants in the Blair and Maron study found the missing 80 percent of the relevant information in the test population when they serially reviewed all the 350,000 pieces of paper, page by page by page.
In the late 1980′s, Bell Labs applied something called Singular Value Decomposition (SVD) to textual material in an effort to replicate this core human capability inside the circuits of one of their neat Unix-driven boxes. SVD was, and is used widely in statistical applications. Basically, it’s a way of building a two-dimension matrix of something, such as the universe. In the early to mid-1990′s, related techniques were deployed in text search applications like Excalibur to look for patterns in documents, and in effect, pump up the volume of hits returned by a search. More is usually better, but for those of us who were forced to mull over the results, more was often just more. It’s all about the”accuracy versus completeness” curve, today known as “Recall and Precision”. Pattern recognition techniques pushed up the numbers of documents retrieved, but their relevance was often disproportionate to that extra volume. You looked at more documents, found more that were relevant, but only marginally more relevant. The added cost of reviewing those additional documents was often disproportionate to their value.
Then Latent Semantic Indexing (LSI) arrived during the first decade of this century out of that sylvan looking office park in Langley or those 10 story buildings at Ft. Mead, topped by arrays of satellite dishes. LSI was derived from the SVD model and is used widely by intelligence agencies to constantly screen the terabytes of data they grab hourly from “The Cloud”. Forgiving the pun, the results are spooky. You can input a chunk of verbatim text from one document and get back lots of verbatim text from other documents that virtually align conceptually without any readily apparent shared text.
For example, consider the following paragraph:
‘We would like to actively promote people into positions of power and influence to effect change in the legislative and regulatory process. This involves using lobbyists and personal contacts to move Congressional and Senatorial committees to change the regulations and laws to benefit Enron. In particular our connections with the George Bush Administration, office of the President and Vice President as well as congressman, senators and agency heads should be used to get policies changed on our behalf.’
When this exemplar text is used to probe a LSI-enabled database containing the EDRM sample set of Enron files, the content engine hits on the following paragraph in an internal Enron memo:
‘This memo is a follow up to your phone conversation with Roger Enrico regarding Enron contributing $250,000 to The President’s Dinner. The President’s Dinner is a joint fundraising effort by the National Republican Congressional Committee (NRCC) and the National Republican Senatorial Committee (NRSC). We contacted both Congressman Tom DeLay and the House Senate Dinner committee to ensure that Enron could fully participate in The President’s Dinner and receive credit for money we have already committed to give to the Committees earlier this year.’
Pretty amazing. LSI, and its derivations, is at the heart of conceptual search eDiscovery applications. Unlike keyword tools, LSI-based engines convert the textual content of documents into vector mathematics, creating three dimensional models that can be used to identify how the documents relate to each other based on the syntax and frequency of all the words in all the documents, rather than on just shared keywords or vague patterns. The search results depend on how and where ideas and concepts co-join across documents.
So, how does LSI factor into the realities of today’s text-rich litigation environment? Content Analyst is a leading LSI-based tool that has been integrated into Relativity, among other review tools. We use Relativity at Planet Data, and we have also integrated Content Analyst into our early case assessment platform, Exego. Exego is utilized early in the ESI food chain, and is the perfect spot to inject LSI capabilities since this is the juncture in our workflow where a comprehensive pool of document text is first available to end-users. All the original email and edocs have been ingested and deduped, and had their metadata and body text extracted. This process includes files for which no text exists, such as image only PDFs, which we detect, render as TIFFs and OCR to maximize the depth of the searchable text pool. From here, our current best practice unfolds along these lines:
- We run the agreed upon keywords using our dtSearch integration to identify potentially relevant documents. Our sampling tool then carves out a statistically defensible subset of documents that can be reviewed directly in Exego or pushed to Relativity. Either way, our clients puts these documents in front of a review team, who then separates the chaff from the wheat by flagging “response”,” responsive but potentially privileged”, and “non-responsive documents”.
- Next, we take the responsive but potentially privilege documents and feed their content into the Content Analyst engine deployed in Exego. This step finds conceptually similar documents within the remaining document population. We then repeat this work-flow for the responsive documents. These two distinct sets are then exported, loaded to Relativity and batched for full-up review. Depending on the matter and the results of the initial sampling, sometimes a second sampling pass is applied in between the initial sampling and the broad export to Relativity for pre-production review.
If this sounds simplistic, that’s because it sort of is. The steps are well defined and require little in the way of execution, with the exception of the actual nose-to-the-grindstone document review step. In terms of the processing, Content Analyst does all the heavy lifting associated with sifting the pool of documents down much more accurately than can be achieved ( or even vaguely approached) by a mere keyword search.
And the results our clients are seeing have been promising. On one recent project, a law firm with a thriving document review practice applied this scenario to a construction matter. The collection was the usual mix of 40 GBs of email and edocs, comprised of just under 148,000 documents that de-duped down to just over 136,000 documents. We created three random samples sets based on keyword hits. Combined, these sets ran to 4,600 documents in total. They were reviewed and the files determined to be responsive were pushed back against the balance of the files containing a keyword hit. The net result was that Content Analyst identified just 5,355 additional documents that were potentially responsive. These formed the primary review set. Following review, slightly less than 1,500 documents of that set were determined to be responsive following review.
Both to validate these results and formulate a supplemental production if warranted, our client then reviewed the balance of the documents that hit on the keyword searches but were not tagged by the concept engine. That population ran to just over 50,000 documents. Following a review of the residual keyword hits, another small subset of approximately 1,500 documents was found to be relevant (out of 50,000 that hit on the keywords alone). It is important to note that the bulk of those documents would likely have been identified by Content Analyst had a follow-on pass been applied.
As I stressed early in this article, it’s all about the accuracy vs. completeness model of the search. Or, more simply put, that point on the curve where the number of documents you reviewed more closely intersects the number of documents that were then determined to be relevant. Is it arguable that the most successful document retrieval-review-production cycle is one where the least amount of time is spent identifying and reviewing the most relevant and potentially responsive documents? Probably not. The alternative is to continue to use basic keyword searching to over-retrieve irrelevant documents and, perhaps worse, overlook what may possibly be the majority of responsive documents.
To end where we started, keep in mind the recurring unmentionable in our little world: Keyword searches are inaccurate and incomplete … but equally so on both sides of a matter. We’re at a point in time where all parties need to move back to the future.