, ,

The Three-Legged E-Discovery Model

Inhouse counsel typically provide for their e-discovery needs by having a partnership with a hosting vendor who provides processing and hosting for major cases and who generally bills based on the volume of gigabytes or number of files processed. However, the best way to manage costs is to add a third role – a consultant who can suggest alternative technologies and processes to limit overall costs.

The tripartite model recognizes the inherent conflict of hosting vendors who want to maximize their revenue by maximizing the volume of data they process and host. Their revenues are their corporate clients’ costs – they’re literally two sides to the same coins. When bonuses and commissions for hosting vendor personnel are based on the volume hosted, the natural inclination is to recommend a “collect-everything-and-sort-it-out-in-the-review-platform” approach. 

The bias towards maximizing the volume of data held by the review platform extends beyond just how much data is put into the system and can impact how that data is stored as well. For example, electronic documents like Word, PowerPoint, and Excel can have embedded graphics. Those embedded graphics can be extracted and stored as separate files despite the fact that in many cases they just clutter the review database and needlessly inflate the storage space being billed by the hosting vendor.

Hosting vendors understandably focus their attention on ways to use their existing system and may not take the time to learn of new technologies or new pricing models that could result in lower costs to clients but lower revenue for them. Vendors have finite resources to devote to market research especially if they’ve already made multi-year commitments to pay hefty licensing fees for their current offerings. Furthermore, some software licenses restrict the ability of vendors to share benchmarking data comparing the effectiveness of the primary technology with that obtained from other sources. 


Case Sensitive Searching. We recently conducted a post-action audit of e-discovery for a client and found that the initial selection of files to be reviewed in the final hosting platform involved a key term that was an acronym that appeared in all caps, e.g., ACT. The hosting vendor dutifully collected such files without either knowing or disclosing that the key term search could have specified case sensitivity which would have omitted gathering many files where the lower-case version of the word (“act”) or mixed-case version of the word (“Act”) that were never responsive.

Analytics/Predictive Coding. Hosting vendors often tout the efficiency of their analytics and predictive coding technology in eliminating clutter. And they’re right, analytics and predictive coding/Technology Assisted Review is very effective. However, that type of technology can be licensed on terms other than paying per gigabyte fees. Concept clustering, social network analysis, and other tools can be used iteratively to cull the non-relevant “noise” documents before the data goes to hosting vendors and without paying volume-based fees.

Having a knowledgeable consultant who is not compensated based on primarily volume processed or hosted can result in the selection of more cost-effective e-discovery solutions.

The 10 Steps of Early Case Analytics

Quantum regularly employs the following “10 Steps of Early Case Analytics” in civil cases:

1. Run visual analytics and advanced pattern matching on a representative custodial sample to help test key terms (see the influential Blair & Maron study to help understand the limitations of key term search)

2.  Initial Search Term Report (STR)

3.  Run Analytics

4.  Examine outlier terms

5.  Examine bulk mail

6.  Examine concepts

7.  Verify: Look across Subjects and Sender Domains

8.  Counsel approval to remove non-relevant mail

9.  Counsel updates search term list, re-run

10.  Production & Removal of redundant, non-relevant file types

Big Files, Big E-Discovery Cost Savings (Sometimes)

Reducing e-discovery costs involves being able to use various cost reduction strategies depending on the circumstances of each case. One of those strategies is to be aware of the extent to which there are large files being considered for processing in a final review platform. To use an extreme example, if there was one file that accounted for half of the GBs in a case, it would make sense to examine the file before loading it into a final review platform.

The critical benchmark to know is what your hosting provider charges per gigabyte cost to ingest and host a gigabyte for a year. If it doesn’t cost anything, there’s far less impetus to examine file size. This posting uses the $200 per gigabyte benchmark, you’ll obviously want to use your own metrics.

The basic idea is simple: get a listing of files sorted in descending file size and apply the first-year hosting cost to see if it might make sense to use an alternative approach for the largest files. Here’s the file sizes for the largest 20 files in a recent case, referred to as Case 1 in the following content:

As shown in the above table, the client could save over $3,000 in ingestion and hosting costs by examining those 20 files without putting them in an expensive final review platform.

To prepare this posting I examined files from four recent cases, two were relatively small, and two were mid-sized. The following scatter diagram shows that in the four cases examined, over 80% of the total gigabytes in those cases were accounted for by less than 20% of the files. The steeper the cumulative GB percentage curve, the more likely it is that examining the largest files in lieu of sending them to a hosted review platform could be cost effective.

Here are some other metrics from those four cases:

As can be seen, three of the cases wouldn’t benefit significantly from trying to treat the largest files without sending them on to the final review platform. The second and third cases involve such small total expenditures for ingestion and hosting that it’s simply not worth a lot of time trying to review individual files. Those were cased that primarily involved email, and such collections have fewer large files just because of email attachment size limitations. On the other hand, cases that involve large numbers of file share files, like Case 1, are more likely to have large numbers of large files.

As you can see in the table the largest 100 files in Case 1 will cost on average $70 per file to ingest and host for a year. That provides a reasonable incentive to consider whether a lower cost method of treatment other than a final review platform might be feasible, as those 100 files will cost $7,000 for year one. By contrast, Case 4 has a significant number of gigabytes and files but the largest files don’t offer nearly the savings opportunities as Case 1.

No Single Answer
There’s no single best answer on how to treat large files in all cases. As we’ve seen some cases are too small to warrant special processing, and in others the file size profile doesn’t hold much promise of significant savings – there are no low hanging fruit. Sometimes there may simply not be time available to determine alternative processing options, and in some situations, lawyers may feel that the benefits of having a simplistic approach where everything goes in one platform may outweigh the cost savings.

The best advice is to make it a practice to look at the largest-size files in a case and decide whether further analysis is warranted.

Upcoming: Triaging e-discovery by file type.


E-Discovery Issue Coding: Good Idea But…

The idea behind issue coding is a good one: When preparing for depositions or trial or summary judgement, attorneys would be able to pull up the documents that support or refute or maybe just provide good context for specific issues. It’s a way to select key bits of evidence from large amounts of data, much of which is ultimately irrelevant. However, litigation teams ought to assess how consistently the issue codes are being applied before over-relying on them. Assessing consistency may well impact when the issue coding is done, by whom, and at what level of granularity.

Measuring Coding Consistency

Issue coding is not free. If done as part of a responsiveness review, it will slow down the number of documents that can be reviewed per hour. A complex issue code schema my cut throughput in half, meaning it costs as much as the responsiveness review. Companies can use estimates of the different throughput rates to estimate the cost of issue coding. It makes sense to invest a little time determining the value of the coding.

Control Sets. The best way to measure coding consistency of the people you are considering using for coding is to have each of them issue code the same set of documents. There are several ways to analyze the coding results, including using Excel, Access, or possibly a review platform. The most flexible way to analyze results may be to create a simple relational database with separate tables for issue codes, coders, and documents reviewed, something along the following lines:

The Coding Results Table has one row for each code assigned to each document. This will be much easier to use for analysis than a table that has a multiple-value field with all the issue codes assigned to the document in one aggregated field. If you have a multiple-issue code field to work with, you may find it easier to parse it into the suggested layout.

Here’s how to analyze the results:

Coding Intensity – General Sense of Relevancy. At a gross level, how many issues does each coder assign? Some coders will see an issue lurking in every paragraph, others won’t see any unless the document would be entirely dispositive of an issue. There’s no one right answer, the best idea is to have the lead attorney or the principal litigators code the same document set and then use coders who have the same general sense of relevance.

The output of this type of analysis would be a table with columns for coders and total codes assigned:

Coding Congruence – Thinking Alike. Identify which pairs of coders assign the same issue codes to the same document to see how congruent their results are – the extent to which their coding overlaps.

The end result of the congruence analysis would be a table with coders listed on both the column and row headers and the congruence measure for the pair placed in the intersecting cell:

This is how the calculations would be performed:

Congruence of A and B: BOTH/(BOTH + A_NOT_B + B_NOT_A) = 80/(80 + 5 + 15) = 80%

Note that the suggested database structure permits congruence comparisons at a higher level, e.g., people can be regarded as agreeing on high level codes even if the second- or third-level codes are different.

Who Does the Issue Coding

To state the obvious, the coders who think like the lead litigators ought to do the issue coding, or they ought to be used to review issue coding before they’re finalized in the review platform.

Complementary Tools, e.g., Finding Conceptually-Similar Documents

If issue coding was the only tool available to attorneys for depo or trial prep, completeness of coding would be a major concern. Fortunately, there are many complementary tools and ways to expand on the documents initially tagged with issue codes. For example, most systems will be able to identify conceptually-similar documents, or attorneys can identify the people associated with the tagged document and examine other documents from, to, or about the same person for the same general time frame, or search for the key terms used to discuss the issues in the tagged document. All of which is to say that an issue coding system that tags documents that are highly relevant to specific issues will have value even if some relevant documents are not initially tagged.

When Coding is Done

The significance and interpretation of issues change as the case progresses. Highly detailed issue coding is better done closer to trial then when documents are first reviewed. To the extent issue coding is done earlier it can be better to have more general issue coding.


Having multiple levels in issue codes will slow the assigning of issue codes and can be difficult to keep updated as the lawyers’ understanding of the case develops over time. It may be more effective to have higher level issue codes combined with a “hotness” rating, something like:


Issue coding can serve as a useful way to organize the documents that are the most relevant to specific issues. Their usefulness can be maximized by evaluating those who will be doing the coding to ensure consistency of coding.

, ,

E-Discovery Process Improvement: The After-Action Audit

Process improvement involves an ongoing effort to identify what’s working well and identify what could be improved. In e-discovery, it’s sometimes painfully obvious when things didn’t work well, e.g., a production deadline is missed or sensitive data is produced. However, it’s not always obvious what could be improved – it’s hard to identify potential improvements if results are about what people thought was achievable.

After-action audits can be eye-opening in identifying ways to improve the e-discovery process. However, the term “improve” is rather broad; more specific goals will provide better guidance for the audit. Jeff Carr, long-time legal cost expert, recommends SMART goals – those that are Specific, Measurable, Achievable, Realistic, and Timely, e.g.:

  • Lower outside counsel document review fees by 20%
  • Lower hosting fees for e-discovery review platform vendors by 30%.
  • Identify cost-effective ways to get early looks at potential discovery from the very start of potential litigation.

One process I find useful is to select a case representative in complexity and scope to those ordinarily encountered by the client, and reprocess the same documents using alternative tool sets. Using actual case data has several advantages:

  • Proving scalability of alternative tools sets. Some tools look nice on small select demo data sets (does “Enron” sound familiar?) but don’t scale well for large collections.
  • Identifying “gotchas” in alternative tools. There can be idiosyncrasies in data sets that cause problems in some tools. Nothing identifies these problems like running actual client data.
  • Validating original technology. Search and analytics tools that performs similar functions may not produce the same results, e.g., some full text search software may have problems indexing specific document types. The audit provides a way to potentially identify weaknesses.

In the ideal world, there would be production notes detailing the tools that were used to achieve the original volume reduction, and the decisions that were made, and there would be bills from attorneys, review providers, hosting providers, and software providers. All that detail provides a baseline for comparison.

Audit Deliverables

The audit report should cover:

  • Alternative Techniques. What techniques and tools could have provided the same functionality in terms of eliminating irrelevant content and identifying relevant content, but at a lower cost? For example, social network analysis, key term logic testing, concept clustering, visual similarity, and other functions are available in a variety of software packages that can be provided without per-gigabyte processing fees.
  • Dollar Impact of Alternative Tools. Culling irrelevant content early in the process saves considerable money downstream, e.g., reduced hosting fees, and reduced attorney review time. The report can estimate those savings.
  • Recommended Training. What training should be provided to either make better use of existing tools or to use new tools?


The direct costs of the after-action audit needn’t be very large when the tools used for auditing are provided on a flat- or no-fee basis, i.e., not charged on a per-GB, per-user, or per-search basis. Most audits can be performed using low-cost cloud storage or existing consultant infrastructure.


Audits needn’t take a long time to complete. Large savings are usually quickly obvious, and useful, actionable data can be available within about a month.

Further Reading:


Eating at the E-Discovery Diner: Buffet or à la Carte?

There is a difference of opinion about the best way for corporations to buy e-discovery services. One view could be characterized as the “single-throat-to-choke” approach which focuses on accountability – the corporation wants to have a single party take complete responsibility for everything from collection through production so there’s no question who’s at fault if anything goes wrong. The other view is a more á la carte approach where the corporation buys services as needed from different providers.

My view is that the single-throat approach results in overpaying for e-discovery services. Corporations can obtain more cost-effective results by having consultants who specialize in using the most appropriate tools for the collection and initial culling and contract separately for the final review for the reduced data set. The corporation can retain full accountability by clearly delineated responsibilities and hand-offs between the two providers.

The collection-initial culling vendor is responsible for gathering initial content and applying early analytics and other tools to the content while keeping detailed logs of what was done, what tools were used, and what culling decisions were made.

In the shared responsibility model, the client specifies the format of files that will be handed over to the final review vendor as well as the method and date of delivery. The final review vendor is then tasked with documenting the steps taken to further cull the collection as well as the delivery date and method of production, including generating privilege logs.

The single vendor approach is like eating every meal and taking every coffee break at an all-you-can-eat buffet. You overpay and consume too much of the wrong things. As a matter of fact, there are many early analytics tools that give early insight into potentially-responsive documents while avoiding the per GB cost model used by most integrated approach vendors. Every GB that is screened out before going to final review can save the corporate client hundreds of dollars per year.

Final review vendors typically have large staffs for help desk, technical support, consulting, and sales personnel, and have major investments in licenses, processing infrastructure, advertising, trade shows, and office space. All those expenses have to be covered to stay in business. By contrast, discovery boutiques specializing in collection and initial culling can be nimbler and offer different pricing models for delivering a comparable or an extended range of early discovery analytics tools.

Note that time and data security are major considerations when deciding what approach to take when contracting for e-discovery services. Many early analytics tools can be deployed and yield results in the time it can take to setup a final review platform, administer passwords, conduct training, and begin to load the initial data. Furthermore, from a data security standpoint, it is much better to screen out as much content as possible before putting content on a final review platform where dozens of people will have access to some or all of it.

Five Low-Cost, Attorney-Friendly Ways to Cull Email in E-Discovery

E-discovery is expensive with email and its attachments typically being the most prevalent data types. Here are five low-cost, low-tech, lawyer-friendly tools that can be used to cull emails prior to going to a final review platform. Final review platforms, while powerful, are expensive and, compared to these five low-cost tools, are time-consuming to load and administer. In addition to achieving the immediate goal of culling unresponsive content, this set of tools also familiarizes lawyers with the collection and makes substantial progress on finalizing the key term list that will be used for final production and shared with opposing counsel.

5 Ways to Cull Email in E-Discovery

Five Tools to Cull Email in E-Discovery

Here’s how to use these five tools in the e-Discovery process: Once potential custodians have been identified, collect their data and identify an initial key terms list. Before sending content off for final review, search for the key terms in the collected content and provide electronic reports of the results for attorneys to review.  Each of these five reports takes a different look at the result set and those different looks provide a perspective to the attorneys reviewing the reports. In our approach, attorneys can sort the reports several different ways (e.g., date order, by sender, or by topic) and flag emails that can be safely excluded.

The process is highly iterative, as the attorneys gain more understanding of the documents and the terms the searching and report viewing is easily repeated to refine results.

Here are the reports which are run after the emails are deduped:

  1. Low Reply Rate Emails. These are emails sent that had a very low number of replies or no replies. Example of emails that fit in this category are:
  • Internal e-mails from IT
  • Emails from automated senders
  • Spam
  • Mass marketing
  1. Large Distribution Emails. Emails that are sent to large numbers of recipients tend to be distribution lists for standardized reports or other recurring content. Identifying which of those can be eliminated can remove substantial volumes from consideration.
  2. Visually-Similar Email Payloads. When people repeatedly send attachments containing the same types of information to other people, those attachments tend to look alike, even if the key terms in them differ from attachment to attachment. Grouping visually-similar attachments and then tying the groupings back to the emails that attached them can reveal subsets of documents that are either clearly nonresponsive or responsive.
  3. Textually-Similar Emails. Grouping emails based on their textual similarities is another way to pull together items where decisions can often be made in bulk to include or exclude items from production.
  4. Key Terms List. Attorneys are provided with a word frequency list of the words occurring in the search results as a way of familiarizing them with terms they may not have considered using for searches. The word frequency list can also suggest whether terms would be useful in identifying subsets of the documents. For example, a word that occurs in all the documents won’t help in selecting a subset of the documents.

The advantages of utilizing these 5 Tools in the e-Discovery process:

  1. Low Cost. The searching and reporting can be conducted without incurring per gigabyte or per user fees. Any items excluded will not incur the large initial ingestion and monthly hosting fees for content placed in the final review platform.
  2. Low Tech. Attorneys are accustomed to reviewing reports and no special training is required to browse the reports. There are no passwords or licenses to setup or administer.
  3. Quick. The tools that generate these reports can process large volumes of data in a short time. Lawyers can be reviewing results on a TB of data within two days.
  4. Highly Iterative. As lawyers gain insight into the content and how key terms are distributed across documents and custodians, they can refine the key term list and search logic to exclude plainly irrelevant content and identify responsive content.
  5. Complementary to Other Tools. The report toolset can be used to identify where further insight would be obtained by using other tools on the collection or subsets of the collection, e.g., to perform a concept clustering analysis or find linguistically-similar content.

Using tools like these described, the volume of emails sent for final review can often be reduced by well in excess of 90% before going to the final review platform.

Why Are In-House Early Case Analytics Important?

In-house early case analytics are important to the corporation because they have the potential to significantly impact the total cost of litigation.  According to RAND, attorney review typically accounts for about 73 percent of all eDiscovery production costs.  The simple rule of thumb is this:  the less documents you send to outside counsel, the more you will save on litigation costs.


I’ve built an online spreadsheet in Office 365 that calculates the savings that that can be achieved via in-house early case analytics.  It compares Quantum’s In-House Early Case Analytics Model to the Typical eDiscovery Model.  Feel free to modify it to suit your needs.  Here is a snapshot of the spreadsheet:

Early Case Analytics Model


Every company will have slight differences in their discovery workflow, so I am glad to spend some time with you and the spreadsheet to see how your company would benefit from in-house early case analytics.

For a closer look at In-House Early Case Analytics, see our explanation article here.


What Are In-House Early Case Analytics?

In-house early analytics are discovery intelligence gathering and reporting mechanisms that help in-house counsel and outside counsel understand a corpus of potentially-relevant documents and e-mail.

In-house early case analytics gives counsel the ability to make well-informed decisions about what documents and e-mail are clearly non-relevant so that these files can be removed prior to transfer to outside counsel for traditional review.

Said more specifically, the purpose of in-house early analytics is to educate and inform counsel as to the nature, scope and potential size of the document request.  In many legal cases, outside counsel is oblivious as to the size of burden a discovery request places upon a company.  In-house early analytics brings transparency to outside counsel so that they can refine the request.  Meanwhile, in-house early case analytics informs managing counsel as to the actual costs of discovery – prior to ESI being sent out the door.

In-house early analytics come to counsel in the form of informational reports and visualizations, three of which I list here:

  • Key term hit reports by custodian (see example below)

    Custodian Analysis by Term

  • Visual charts and graphs (concept maps, conversation clusters, clusters of similar documents, etc)

    • Concept cluster maps (visualization that clusters similar documents together)Concept Cluster Maps
    • Conversation cluster maps (a visualization that shows the e-mail communications)Conversion Cluster Maps
    • Interactive screen share sessions where outside counsel is able to view a file share firsthand.


In my personal experience, when outside counsel is educated and informed via early in-house analytics, they will often then have sufficient information necessary to refine and further perfect their key terms list.  This refinement will often have a significant impact on the total number of documents that end up in traditional attorney review.

Using Analytics for Pre-Review Data Reduction

According to RAND, Review typically accounts for about 73 percent of all eDiscovery production costs.

Technology has changed the way we work and live in virtually every other aspect of our lives.  So how can technology help reduce discovery production costs?

Quantum’s initial array of technologies work alongside Office 365 and other mail archiving environments, bringing fast indexing, complex iterative search capability and reporting to bear upon the reduction effort.   We are also able to (very inexpensively) pass reduced copies of original data along to the review stage of the eDiscovery process (see blog post: How Early Analytics Enable You to Count the Cost).

But even after applying traditional metadata filters and key search terms in an iterative fashion (which in our experience reduces the data by an average of 93-94%), a substantial number of non-relevant documents always seems to remain.

Pre-Review Data Reduction

The warning here is that once documents have been put into a review platform, eDiscovery costs immediately escalate.  Here are some of  the fees that kick in right away:

 – Review vendor hosting fees

 – Attorney review fees

 – Premium data storage fees

Purveyors of rigid, assembly line-style approaches to eDiscovery (that do not apply the necessary technical expertise need to defensibly reduce the data further before putting the documents into a review platform) will eventually find themselves at a competitive disadvantage at the corporate level, because corporations typically operate on fixed budgets and are more likely to form relationships with vendors who can help them solve the costly problem of sending tens of thousands of non-relevant documents out for attorney review.  Defensible and objective culling out of non-relevant documents before moving them to a review platform further smooths out the spikes in discovery costs for budget-driven corporations.

The technical challenge is applying objective, defensible methods so that the remaining non-relevant documents are significantly reduced before the documents are put into a review platform.

While there is no “silver bullet” technology that can achieve this in each and every case, we select from an array of technologies that can perform the following objective tasks quickly – before  the documents are put into a review platform (in conjunction with guidance from Counsel, of course):

 – Identify non-relevant clusters of documents

 – Identify non-relevant date and non-relevant time periods

 – Identify non-relevant senders

 – Identify non-relevant domains

 – Identify e-mail with visually-similar attachments

 – Identify e-mail that are To and/or From specific custodians