Standards for Competency in eDiscovery on the Rise – What’s Your Best Defense?

By, Howard Reissner, Esq., CEO Planet Data

The recently issued opinion in Branhaven, LLC v. Beeftek, Inc.  et.  al., 2013 WL 388429 (D. Md. Jan 4, 2013) highlights the requirements for attorneys to continuously keep abreast of changes in professional standards of competence in their fields of practice. The bar for minimum competency is rapidly rising in the e-discovery universe.  A significant percentage of federal judges have become well enough educated in this area to confidently determine which attorneys that appear in their court are both complying with the FRCP and have adequately investigated their clients data systems and infrastructure.

In “Branhaven” the court sanctioned both the client and counsel under FRCP 26 (g) for the incorrect certification of a signed response to a request for production. In fact, counsel had as of the date of the certification not made a reasonable effort to assure that the client had provided all of the information and documents available to him that are responsive to the discovery demand, yet he represented that he had done so.  The decision noted that pursuant to Rule 26 (g) (3) “if a certification violates this rule without substantial justification, the court….must impose an appropriate sanction on the signer of the party on whose behalf the signer was acting or both…”

In a second recent Federal Court decision, In re Delta/Air Tran Baggage Fee Antitrust Litigation., 846 F. Supp. 2nd 1335 (N.D. Ga. 2012), Delta Airlines was sanctioned pursuant to 26 (g) for failure to make sure that all relevant hard drives and other ESI were searched after making many assurances to the court that a reasonable inquiry had been made.

“Branhaven” and “In re Delta” are another clear signal to practicing attorneys that they will be measured against a higher standard of professional competence and scrutiny of their behavior by a judiciary that has become much more educated about technology and e-discovery over the past few years.

Along the same line of reasoning, counsel may not escape potential negative consequences due to having relied upon an outside vendor to manage part of the discovery process. In Brookfield Asset Management, Inc. v. AIG Products Corp., 2013 U.S. Dist. LEXIS 29543 (S.D.N.Y. Jan. 7, 2013); the defendant was allowed to claw back documents that had been inadvertently produced because a FRE 502 (d) agreement was in place. However, due to vendor error, the damage was done. The redacted text was visible to the plaintiff when viewing the metadata. I believe the lesson here is that an attorney should be confident that they have the knowledge to retain vendors that have significant professional expertise, utilize high quality software, and have developed work-flows and quality controls to minimize these types of painful errors.  See also: Peerless Industries, Inc.  v. Crimson AV, LLC., 2013 U.S. Dist. LEXIS 2985 (N.D. Ill. Jan. 8, 2013), where counsel was held responsible for the incomplete collection of data by a vendor. 

As a reminder to in-house counsel that they are responsible for monitoring the actions of their outside law firms, in Coquina Investments v. Rothstein, 2012 U.S. Dist. LEXIS 108712 (S.D. Fla. Aug. 3, 2012) the court imposed sanctions under Rule 37 against both the defendant and outside counsel. The findings of fact in the judge’s order will likely have substantial negative impacts for the defendant in future litigations brought by other plaintiffs.

As a participant at many legal educational forums over the past year it has become apparent to me that the federal judiciary has significantly enhanced their expertise in many of these technical areas; perhaps well beyond that of many of the lawyers that appear before them. I believe that it is good advice to encourage litigators who are still unfamiliar with their fundamental obligations in e-discovery to quickly get themselves up to professional standards. It should be apparent today that a large percentage of litigation will include some aspect of ESI. Lack of technical knowledge or the inability to employ others who do is no longer an excuse for discovery lapses.  In addition to the various types of sanctions and malpractice actions that can result from these professional lapses are the real possibility of incurring disciplinary proceedings from the state or federal Bar. See:  In re Disciplinary Proceedings Against McGrath, 174 Wash. 2nd. 813, 280 P. 3d 1091 (2012).

Although there has been a steady climb up the technology learning curve for many federal judges, there still is a wide disparity in expertise within the group. As such, an attorney is well advised to spend some time researching a particular jurist’s level of e-discovery knowledge and the professional standards that have been imposed in their courtroom.  A review of the judge’s prior published opinions (and other precedent from the jurisdiction) should be a mandatory requirement. Over the past two years a substantial number of opinions have addressed attorney cooperation, data preservation, litigation holds, processing, searching, technology assisted review (TAR), and production. 

So, what actions should an attorney take prior to commencing a case before a judge for the initial encounter? At the most basic level, all of the judges published opinions that include discovery issues should be read. In addition, any speeches, articles or other publications authored by the judge should be reviewed. Does the judge attend CLE and other professional conferences that address e-discovery? It would be prudent to seek out other counsel who had appeared before that court to seek out their experiences with that judge. Inquire as to the level of the judge’s technological savvy.  Does the judge become directly involved in discovery disputes or does she keep a “hands off” approach and let the parties work it out between themselves? Is the judge a proponent of TAR and has she allowed or mandated its use in prior cases?

So, what steps can an attorney take to get off on the right foot with the judge? First, cooperate with the opposing counsel from the outset as much as is practicable. Recently, the judiciary has taken a more active role in encouraging cooperation between counsels; see: Carrillo v. Schneider Logistics, Inc., 2012 WL 4791614 (C.D. Cal. Oct. 5, 2012), where the court awarded monetary sanctions for defendants repeated failures to cooperate in the discovery process. Also see: Easley v. Lennar Corp., 2012 WL 2244206 (D. Nev. June 15, 2012), where the court urged direct personal contact between counsel prior to filing motions to compel discovery. Finally, see: Kleen Products LLC v. Packaging Corp. of Am., 2012 WL 449865 (N.D. Ill. Sept. 28, 2012), where the judge commended the lawyers and their clients for conducting discover in a collaborative manner.  

Judges have made it clear that they do not want to be involved in “ministerial” discovery disputes. Attorneys who appear to be taking the extra steps to avoid these types of conflicts will have elevated themselves in the mind of the judge.

Secondly, take the effort to carefully consider your discovery requests, both as to scope and form of production. As the raw size of data continues to accelerate, the issue of proportionality has taken a more central role, see:  Boeynaems v. LA Fitness Int.’l, 2012 U.S. Dist. LEXIS 115272 (E.D. Pa. Aug. 16, 2012), ordering Plaintiffs to pay for additional discovery costs prior to class certification, and Juster Acquisition Co. v. North Hudson Sewerage Authority, 2013 U.S. Dist. LEXIS 18372 (D.N.J. Feb. 11, 2013), where the court granted plaintiff’s discovery request as being reasonable and not creating a cost burden that outweighed the benefits of defendants compliance as considered within the scope of the case.   These decisions emphasize that judges want cases to be decided on the merits and that discovery requests should take into consideration the value of the cases and issues under dispute.

Finally, if the judge is not as sophisticated in the technology issues as you would prefer, then provide educational resources and professional support that will validate your positions.

Technology Assisted Review is NOT New … Just Improved

shutterstock_1028726_cropped

by Kevin Leser, VP, Project Management, Planet Data

Though the terminology is perceived as new – “Technology Assisted Review” or TAR – there’s really little or nothing new about it. Anyone who has been using such search tools as Concordance and Summation since the 1990′s can attest that it’s just new verbiage wrapped around recent legal work-flow enhancements. Lawyers have been applying keyword searching to screen discovery files since the 1970′s when three companies, Aspen Systems, Informatics and Control Data (later rebranded as Quorum), were the first to point search engines at huge volumes of discovery files in an effort to whittle the mass down to a manageable pile of business records potentially relevant to a litigation.

If you’re under forty-five you probably won’t recognize those three names or, at least, remember much about their role as the founding elements of the litigation support industry. Their operations were all housed in the Maryland suburbs of DC. They all had mainframes running inverted file text search engines. Aspen used AspenSearch, its own creation, and likely the first of these unique tools. Informatics used Inquire, and Quorum used Basis. These all worked in a similar manner. The text of a document was broken down into unique words, which were assigned word IDs, which were, in turn, strung together in huge blocks of bits-and-bytes to make the contents yield to a search. If the search was for “Man bites dog”, the results would return exactly that group of words, and none other. It had to be a man not a woman, boy or girl; had to be a dog, not a poodle or a mutt; had to be a bite not a nip, gnaw or nibble.

These tools were all fed by armies of document coders and information analysts who sat arranged in rows of tables with blank document control forms to their right and stacks of Bates numbered paper to their left. In the days prior to electronic discovery (or even basic scanning and OCR), litigation support databases were built one hand-printed document control form at a time. College students looked at each document and painstaking recorded the author, addressee, copyee, date, document type, document title. They also noted specific conditions such as the presence of marginal notes, illegible scrawl, and even ink blots caused when some scribe tipped over his ink well while penning a document. Yes, this last bit is a bit of an exaggeration, but the point is that today’s catchphrase TAR, describes a process that, at a minimum, dates back to the Gerald Ford era.

So, if TAR is not actually a contemporary notion, what’s new enough in this realm to compel me to pen this article?
What’s new is that now concept engines are reshaping the completeness and accuracy curve of a search result. Basic keyword search tools are notoriously ineffective. The methodologies deployed in the 1970′s, and still largely active today, remain at the core of Concordance, Summation and dtSearch. The efficacy of the searches relies almost totally on the quality of the keyword list. However, the English language, and most any language, doesn’t innately lend itself to being probed with precision by simple keyword lists, even those that smart attorneys and litigation support professionals labor over for hours. There’s a famous (and still relevant) study done by David Blair and M. E. Maron in the early 1980′s, and published in the March 1985 issue of Computing Practices Magazine. Although the article contains myriad charts and equations as support for its conclusions, the methodology of the study and its results were elegantly simple. They took a ton of paper documents, had them accurately keyed into machine readable form and comprehensively indexed in e IBM’s text search tool, STAIRS, which was considered relatively powerful at the time.

The Blair and Maron team (including lawyers and paralegals) took this database and a set of keyword searches they expended an inordinate amount of time perfecting, and ran them several times, tweaking the results with each iteration until they were fairly confident that they found the vast majority of the relevant documents. Sort of sounds familiar, huh? Sounds similar to one of the keyword search driven reviews we supported over the last couple of months.

The team was then charged with manually reviewing all the original paper documents and flagging those they felt were relevant. When they compared the respective stacks, the keyword searches identified less than 20 percent of the documents determined to be relevant during the ‘Big Read’ of the roughly 350,000 pages that made up the original input to the database For those of us who had been building litigation support databases since the 1970′s, there were no surprises here, only validation of what we had learned in the early years of using those first-generation tools. Paraphrasing a current political truism: It’s the language, stupid! There are many, many ways to say the same thing, particularly in English, and having a Roget’s Thesaurus at your side when you’re probing a database doesn’t help much.

Back then we got around these limitations by developing retrieval thesauri and taxonomies for specific litigations, which enable document analysts to apply codes that reflected a document’s content. The distinction here is that legal issues can change over the course of a matter, but document content does not.

These content categorization aids were really elegantly designed tools that let an analyst objectively index each document so an attorney could then plug in a search code and more readily locate the documents that might be relevant to a production request. The only problem with something being elegant is that it’s usually also really, really expensive to develop and then apply; like tens of thousands of dollars to design, and $15 to $20 per document to apply, all in 1980′s dollars. But if you needed your retrieval to be comprehensive and accurate, you paid the freight.

These tools were often applied in asbestos and other mass torts products cases in the 80′s, where the accuracy and completeness of the retrieval and production process were critical to a solid defense. In the absence of the content indexing technology that has emerged over the last few years, the issue of accessing content was as relevant then to paper documents as it is to today’s electronic documents. Keyword search tools simply find documents, with an emphasis on simply. They don’t hunt them down, sift them and selectively offer morsels up for qualified review. Content engines do that.

I’m not going to drill down into the pros and cons of competing content indexing technologies. My company uses one content engine and other people in other litigation support firms use it too, and still others use different, but not altogether dissimilar tools. There are similar-but-different keyword search tools, and arguably, the same can be said of them. That’s a sidebar we can leave to the sales guys and the information scientists to postulate on. But, in a nutshell, content engines create intricate matrices of, well, content. You know, the stuff you might have said or written yourself, but maybe a little differently or a lot differently, using different language, maybe even a different premise or set of facts, but which when you read it you recognize it, and say, “hey, that’s what I’m looking for”. Or, in our business, maybe it’s “yikes, wish I hadn’t found that”.

The reality is that we think in concepts, not in keywords or phrases. We’re assaulted daily by politicians and advertisers who would like to think we function in a world dominated by buzzwords. Simply put, we’re conceptual beings and constantly filter input to get at the core of what we’re looking for or need at any given moment. This axiom holds true whether it’s where you can find the best ribeye steak or where that pesky little document that can help you or hurt you is hiding. That’s why the participants in the Blair and Maron study found the missing 80 percent of the relevant information in the test population when they serially reviewed all the 350,000 pieces of paper, page by page by page.

In the late 1980′s, Bell Labs applied something called Singular Value Decomposition (SVD) to textual material in an effort to replicate this core human capability inside the circuits of one of their neat Unix-driven boxes. SVD was, and is used widely in statistical applications. Basically, it’s a way of building a two-dimension matrix of something, such as the universe. In the early to mid-1990′s, related techniques were deployed in text search applications like Excalibur to look for patterns in documents, and in effect, pump up the volume of hits returned by a search. More is usually better, but for those of us who were forced to mull over the results, more was often just more. It’s all about the”accuracy versus completeness” curve, today known as “Recall and Precision”. Pattern recognition techniques pushed up the numbers of documents retrieved, but their relevance was often disproportionate to that extra volume. You looked at more documents, found more that were relevant, but only marginally more relevant. The added cost of reviewing those additional documents was often disproportionate to their value.

Then Latent Semantic Indexing (LSI) arrived during the first decade of this century out of that sylvan looking office park in Langley or those 10 story buildings at Ft. Mead, topped by arrays of satellite dishes. LSI was derived from the SVD model and is used widely by intelligence agencies to constantly screen the terabytes of data they grab hourly from “The Cloud”. Forgiving the pun, the results are spooky. You can input a chunk of verbatim text from one document and get back lots of verbatim text from other documents that virtually align conceptually without any readily apparent shared text.

For example, consider the following paragraph:

‘We would like to actively promote people into positions of power and influence to effect change in the legislative and regulatory process. This involves using lobbyists and personal contacts to move Congressional and Senatorial committees to change the regulations and laws to benefit Enron. In particular our connections with the George Bush Administration, office of the President and Vice President as well as congressman, senators and agency heads should be used to get policies changed on our behalf.’

When this exemplar text is used to probe a LSI-enabled database containing the EDRM sample set of Enron files, the content engine hits on the following paragraph in an internal Enron memo:

‘This memo is a follow up to your phone conversation with Roger Enrico regarding Enron contributing $250,000 to The President’s Dinner. The President’s Dinner is a joint fundraising effort by the National Republican Congressional Committee (NRCC) and the National Republican Senatorial Committee (NRSC). We contacted both Congressman Tom DeLay and the House Senate Dinner committee to ensure that Enron could fully participate in The President’s Dinner and receive credit for money we have already committed to give to the Committees earlier this year.’

Pretty amazing. LSI, and its derivations, is at the heart of conceptual search eDiscovery applications. Unlike keyword tools, LSI-based engines convert the textual content of documents into vector mathematics, creating three dimensional models that can be used to identify how the documents relate to each other based on the syntax and frequency of all the words in all the documents, rather than on just shared keywords or vague patterns. The search results depend on how and where ideas and concepts co-join across documents.

So, how does LSI factor into the realities of today’s text-rich litigation environment? Content Analyst is a leading LSI-based tool that has been integrated into Relativity, among other review tools. We use Relativity at Planet Data, and we have also integrated Content Analyst into our early case assessment platform, Exego. Exego is utilized early in the ESI food chain, and is the perfect spot to inject LSI capabilities since this is the juncture in our workflow where a comprehensive pool of document text is first available to end-users. All the original email and edocs have been ingested and deduped, and had their metadata and body text extracted. This process includes files for which no text exists, such as image only PDFs, which we detect, render as TIFFs and OCR to maximize the depth of the searchable text pool. From here, our current best practice unfolds along these lines:

  • We run the agreed upon keywords using our dtSearch integration to identify potentially relevant documents. Our sampling tool then carves out a statistically defensible subset of documents that can be reviewed directly in Exego or pushed to Relativity. Either way, our clients puts these documents in front of a review team, who then separates the chaff from the wheat by flagging “response”,” responsive but potentially privileged”, and “non-responsive documents”.
  • Next, we take the responsive but potentially privilege documents and feed their content into the Content Analyst engine deployed in Exego. This step finds conceptually similar documents within the remaining document population. We then repeat this work-flow for the responsive documents. These two distinct sets are then exported, loaded to Relativity and batched for full-up review. Depending on the matter and the results of the initial sampling, sometimes a second sampling pass is applied in between the initial sampling and the broad export to Relativity for pre-production review.

If this sounds simplistic, that’s because it sort of is. The steps are well defined and require little in the way of execution, with the exception of the actual nose-to-the-grindstone document review step. In terms of the processing, Content Analyst does all the heavy lifting associated with sifting the pool of documents down much more accurately than can be achieved ( or even vaguely approached) by a mere keyword search.

And the results our clients are seeing have been promising. On one recent project, a law firm with a thriving document review practice applied this scenario to a construction matter. The collection was the usual mix of 40 GBs of email and edocs, comprised of just under 148,000 documents that de-duped down to just over 136,000 documents. We created three random samples sets based on keyword hits. Combined, these sets ran to 4,600 documents in total. They were reviewed and the files determined to be responsive were pushed back against the balance of the files containing a keyword hit. The net result was that Content Analyst identified just 5,355 additional documents that were potentially responsive. These formed the primary review set. Following review, slightly less than 1,500 documents of that set were determined to be responsive following review.

Both to validate these results and formulate a supplemental production if warranted, our client then reviewed the balance of the documents that hit on the keyword searches but were not tagged by the concept engine. That population ran to just over 50,000 documents. Following a review of the residual keyword hits, another small subset of approximately 1,500 documents was found to be relevant (out of 50,000 that hit on the keywords alone). It is important to note that the bulk of those documents would likely have been identified by Content Analyst had a follow-on pass been applied.

As I stressed early in this article, it’s all about the accuracy vs. completeness model of the search. Or, more simply put, that point on the curve where the number of documents you reviewed more closely intersects the number of documents that were then determined to be relevant. Is it arguable that the most successful document retrieval-review-production cycle is one where the least amount of time is spent identifying and reviewing the most relevant and potentially responsive documents? Probably not. The alternative is to continue to use basic keyword searching to over-retrieve irrelevant documents and, perhaps worse, overlook what may possibly be the majority of responsive documents.

To end where we started, keep in mind the recurring unmentionable in our little world: Keyword searches are inaccurate and incomplete … but equally so on both sides of a matter. We’re at a point in time where all parties need to move back to the future.

Reflections on the LegalTech Panel

legaltech_banner

Judicial, Industry, Legal, Media Perspectives on Where Legal Technology is Taking Litigation and How It Affects You

By Howard Reissner

This year at the LegalTech New York conference Planet Data hosted a panel with the Hon. Michael Baylson, U.S. District Judge and an eDiscovery analyst, an attorney and a journalist.

The session filled up early and eager attendees lined the walls to hear about the most hotly debated current issues in e-discovery. The hypothetical scenario allowed the panel and Judge Baylson to explore “attorney –client privilege” and “attorney work product protection”, cost shifting, TAR protocols, vendor selection, and the extent of the role of the judiciary in the discovery aspects of a case.

By design, the hypothetical situation was intended to generate debate between counsel on the appropriateness of their actions during the outset of discovery in a complex case involving multiple parties, numerous potential custodians, and the efficacy and completeness of data collection, processing and searching, review and production.
Over the past year a number of actual cases (including of course, Judge Baylson’s “LA Fitness” decision) have addressed many of these newly emerging issues. Some of the most pressing current concerns have evolved around the efforts to implement TAR on a wider basis. The complexities, strengths and limitations of these technologies have led to procedural challenges to their utilization. The hypo created a situation where the defendants implemented a TAR process and produced far more documents than the plaintiffs. Nonetheless, the plaintiff’s counsel inquired as to how the “seed sets” were developed, and how the methodology for review and production was developed. Along similar lines, the defense counsel demanded to know how the plaintiff identified and collected their documents in light of the relatively small number of documents produced.

“With his decisions in Rhoads Industries and LA Fitness having helped shape the current state of the law of electronic discovery, it was great having Judge Baylson with us live on the panel,” said attorney David Horrigan, e-discovery and information governance analyst at 451 Research, who served as moderator and hypothetical defense counsel on the panel. “Adding David Brown’s perspective from The National Law Journal and Ann Kershaw’s experience as a practicing e-discovery attorney helped us cover all the issues—with Judge Baylson keeping us all in line from the bench.”

As in the real world, these issues were then put before the judge, who was reluctant to be drawn into the underbelly of discovery work-flow and technology. It appeared evident from this exercise that the judge favors litigants resolving these issues between themselves before they reach his courthouse. Our scenario highlighted the real concern that in these very early days of TAR adoption, it is important to slow down a bit and /P>

Areas of High Risk for Counsel when Producing ESI

By Steven Bailey, Senior Case Manager, Planet Data

ESI, electronically stored information, eDiscoveryIn the pre e-discovery age, there was only one aspect to a document production – the Bates stamped paper documents. Today, the majority of all discoverable information is created and stored electronically. Document productions often include a variety of electronically stored information (“ESI”). This piece addresses the areas of high risk for counsel and aims to improve awareness of the issues when a matter involves the production of ESI. Understanding the benefits and challenges of available production formats will allow counsel to create an e-discovery plan to best maximize case strategy and effectively manage costs.

The Federal rules require lawyers on both sides to address all discovery issues at the outset of the litigation. Given the many variables and factors involved in producing ESI, counsel should involve technical team members as early as possible to consider what formats of production are available and what can be generated. It is also important to understand the capacity of the litigation support department or vendor who will process and or produce the data. Knowing “turn-around” times will go a long way to ensure deadlines are met and enough lead time is factored in for proper quality control checks and to accommodate any last minute changes to the production set.

If ESI is requested in a different format than what the party expected, the parties should discuss what is feasible in terms of expense and logistics and who will be bear the extra costs. Where the ESI discovery is unknown or to be produced on a rolling basis, an on-going discussion between the parties will be necessary. Also, logistically speaking, the determination about production formats should be considered up front, before the data is processed as some processing methods may prevent some production outputs. Having to search through boxes of media to re-process certain file types at production time, for example, results in added time and increased costs.

The production of ESI in the format in which it was originally created is referred to as native production. Native format is commonly used for files not meant to be printed such as Powerpoints, spreadsheets, small databases, and audio and movie files. Data contained in these applications works properly when produced natively and it also may be the only way to produce the files for the other side to review. Some attorneys prefer to produce in native format to save the time and expense of converting to static image files like TIFF or PDF. In other cases, tight discovery deadlines leave attorneys no other choice but to produce the documents natively.

While reviewing and producing documents in native format can save time and money, native productions often present case management challenges and risks that may outweigh the benefits. Some key risk factors counsel should consider: Native productions typically do not include metadata or extracted text and as a result cannot be searched or indexed by the receiving party. Additionally, redacting sensitive or privileged information is not possible on native files. The producing party also cannot control or restrict the metadata produced, such as hidden comments, track changes or speaker notes. Lawyers don’t always realize that they are granting full access to all of the document metadata when they produce ESI in native format.

Native format productions can also adversely affect case management and make it difficult to manage evidence during discovery and at trial. Native files cannot be endorsed with a Bates number or confidentiality designation. Consequently, documents used at depositions will not have a shared, page-level Bates number and highly sensitive materials could lack the necessary confidentiality designations. Also, data produced with non-standard or proprietary software may not be able to be opened and viewed at all.

Certain types of files like most e-mail and databases cannot be reviewed or produced in true native format without first being converted. The process of converting ESI to a non-editable digital file is known as rendering. Rendering of the ESI is necessary in order for the parties to redact privileged information. Also, if stamping or designations are required the native files have to first be converted to electronic image format.

Counsel should be aware of the common issues and risks that exist with ESI converted to image format. Most importantly is the risk of altering or losing data during the conversion process. For certain types of ESI the images generated may not accurately represent the native. Excel Files, for example, often contain hidden cells, rows, worksheets, columns, and formulas that are not displayed on the image. Similarly, Word documents often do not display comments and track changes. PowerPoint files generally do not print speaker notes by default and animations do not display properly. For e-mail, blind copies and the date read are not available by default. Embedded data not appearing in TIFF view is likely to be less guarded, and therefore, more revealing and potentially harmful. A sound workflow plan will ensure that these types of ESI are also reviewed in native format to avoid producing embedded data not reviewed.

In determining the form of production, parties should also consider whether they want to request the production of searchable metadata and, if so, what fields. There can be hundreds of metadata fields associated with a single file. The parties should clearly state in writing the metadata requested any known problems or gaps in the metadata received from third parties. Aside from searchable text, metadata should include information about relationships between documents, e.g., parent-child relationships. Most typically, metadata is produced in a standard delimited load file for loading into most litigation support software platforms. Clear and concise communication regarding the load file format will save time and money for each party producing and receiving data. The more common load files include .dii (Summation), lfp (IPRO), and .opt (Concordance/Option)

Given the pros and cons of each production format different forms are often necessary to accommodate different types of ESI. In practice, it is common for a production to involve a combination of images, natives, extracted text, OCR, native files and metadata. Parties often will agree to produce certain ESI in native format along with image files such as TIFFs or .PDFs. Word documents are often produced as TIFF images and Excel and PowerPoints as natives. Files requiring redaction are produced as images while similar non-redacted file types are often produced as natives.

It is important to understand how redacted information is impacted by production format. Special attention should be paid to redacted materials when producing. Redacted images will require extra time to process. The images are OCR’d after the redactions are burned and the re-OCR’d text is substituted for the original text. When producing extracted text and metadata for redacted documents it is necessary to remove the original information from all parts of the production. Quality control will verify that redacted information is properly withheld on the image and from the extracted text and fielded metadata.

A proper document management plan will also assist in mitigating the risks associated with producing ESI. Thorough documentation of the process of review and conversion of the native files to images format should be in place. Documentation defining the review team’s redaction process is also key to ensure that everything produced was properly reviewed. Counsel should also document its privilege searches and verify the accuracy at the beginning and at the end of the production. Attorneys sometimes make coding changes after the documents have been added to the production queue.

Quality control review of the results will also help reduce the potential risks substantially. Each production should be thoroughly checked for quality assurance by the producing party prior to release. The scope and specifications of the production should be reviewed for both technical and legal conformity.

Relativity Analytics – Key Features Used to Improve Review Efficiency and Cut Costs

Denise Atesoglu

By Denise Atesoglu

Analytics is a dynamic tool that can dramatically enhance workflow in Relativity and contribute to substantial time and cost savings. This article aims to outline tactics that save time but do not require significant time or resource investments.

The Analytics platform can greatly improve workflow within Relativity.  It can be used to increase review efficiency, quickly isolate highly responsive or unresponsive documents and prioritize the review of particularly relevant documents.

The underlying technology behind Relativity Analytics is LSI (Latent Semantic Indexing).  This proprietary technology was originally developed for the U.S. Intelligence Community by the Content Analyst Company to offer conceptual analysis and organization for large repositories of unstructured data.  In general terms LSI is a math-based approach to text analytics that uses algorithms to organize text into a three-dimensional vector space.  The proximity of the text in this space is used to identify conceptual relationships among the indexed terms and documents.  It does not rely on external sources to classify the text; instead, it relies solely on the patterns and relationships identified when the data is indexed.

Conceptual Searching (CA Search)

Unlike traditional keyword searching, CA search results will yield conceptually similar documents based on the conceptual correlation of search terms to other indexed terms.  CA search will find documents that would not have otherwise been identified using traditional keyword searching.  Simply put, concept searching can be used to find documents related to a known term or phrase that do not necessarily contain the exact term or phrase.  We have found this type of searching to be a tremendous benefit to our clients, aiding in identifying responsive or privileged documents that would not have been found with keyword searching.

We have used CA search to identify top priority documents to be batched for immediate review.  This is particularly useful when dealing with very large data sets. For example, we recently had a project that consisted of over 11 million records with very tight discovery deadlines.  Traditional linear document review simply was not an option for this team.  With CA search, we were able target the most conceptually relevant documents in the database and create concept-focused priority review batches within several hours of the data being loaded into Relativity.

Finding Similar Documents

The “Find Similar Documents” feature can easily be used on-the-fly in Relativity from both the viewer and text modes.  This feature is used to return conceptually correlated documents based on the full text of an entire document.  It helps users quickly return a set of highly conceptually similar documents to the key responsive and/or non-responsive documents at hand.  We have successfully used this feature to locate groups of non-responsive, potentially privileged and extremely relevant documents, facilitating a more targeted approach to review.

In one of our recent projects, we successfully used the “Find Similar Documents” feature to quickly identify a large number of spam emails prior to batching the documents for review.  This process resulted in our client reviewing 30 percent fewer documents and contributed to great time and cost savings.

Conceptual Near-Duplicate Detection

The ability to quickly identify conceptual near-duplicates is now common practice in Relativity databases when Analytics is enabled. Near-duplicate detection is based on conceptual similarity rather than relying on exact text and metadata matches.  Near-duplicate groupings can be integrated with advanced searching and automated batching in Relativity, as needed.

In practice, we have found that the identification of near-duplicates is particularly useful when MD5 values are not available to identify exact duplicates.  We were able to apply this technology in a recent project on a set of newly loaded third party data.  After identifying the conceptual near-duplicates we found that nearly 40 percent of the records had near-duplicates already coded in the database.  The client was then able to leverage their prior coding to more efficiently code the new data, resulting in improved efficiency and significant cost savings.

Even in cases where MD5 hash duplicates are available, the addition of conceptual near-duplicates can improve review workflow.  Near-duplicates can aid in identifying potentially privileged documents to be flagged for a second-level privileged review.  Additionally, they can be useful when spot-checking coding consistency across documents.

Clustering

Clustering is a mass operation that automatically groups conceptually correlated documents into virtual folders displayed by topic.  Users are not required to define a set of exemplar documents upfront.  We frequently use clustering in conjunction with batching to generate conceptually similar review batches, aiding in review efficiency.

In a recent project clustering was applied to the full database consisting of around 80,000 records.  It took less than one hour for clustering to complete in Relativity.  The results allowed our client to quickly determine that around 45 percent of the documents were not relevant or eligible for review.  The non-relevant documents were then moved to a secure folder, allowing our client to focus on only the potentially relevant documents. This example clearly demonstrates the vast cost and time saving benefits associated with clustering.

Analytics is a versatile tool that can enhance workflow in Relativity and contribute to substantial time and cost savings.  Furthermore, the features outlined above do not require significant time or resource investments. Our clients have had noted success using Analytics to isolate priority documents for immediate review, locate highly responsive or unresponsive data, and improve overall coding efficiency with the use of clustering and near-duplicate identification.