Text Mining for Quality Control of Court Documents

Abstract

Attorneys across the United States use government-provided electronic databases to submit docket entries and associated case files for processing and archival in public judicial records. Data entry errors in these repositories, while rare, can disrupt the court process, confuse the public record, or breach privacy and confidentiality. Docket quality assurance is thus a high priority for the courts, but manual review remains resource-intensive. We have developed a prototype application of text mining and human language technologies to partially automate quality assurance review of electronic court documents. This solution uses document classification and named entity recognition to extract metadata directly from documents. Discrepancies between the extracted metadata and the userprovided metadata indicate a possible data entry error. On two independent samples of publicly available court documents, we find that for a small number of classes with a sufficient number of training documents, the document class can be automatically classified with greater than 94% accuracy in one case, but only 81% in the other. Our attempts to extract case numbers and the names of parties from documents via a conditional random field model met with less success. Future work with more extensive training data is necessary to more accurately evaluate both applications.

Publication
In DocEng 2014 Workshop on the Semantic Analysis of Documents
Date
Links