Deduplication in a massive clinical note dataset

Bibliographic Details
Title:	Deduplication in a massive clinical note dataset
Authors:	Shenoy, Sanjeev, Kuo, Tsung-Ting, Gabriel, Rodney, McAuley, Julian, Hsu, Chun-Nan
Publication Year:	2017
Collection:	Computer Science
Subject Terms:	Computer Science - Databases, Computer Science - Information Retrieval
More Details:	Duplication, whether exact or partial, is a common issue in many datasets. In clinical notes data, duplication (and near duplication) can arise for many reasons, such as the pervasive use of templates, copy-pasting, or notes being generated by automated procedures. A key challenge in removing such near duplicates is the size of such datasets; our own dataset consists of more than 10 million notes. To detect and correct such duplicates requires algorithms that both accurate and highly scalable. We describe a solution based on Minhashing with Locality Sensitive Hashing. In this paper, we present the theory behind this method and present a database-inspired approach to make the method scalable. We also present a clustering technique using disjoint sets to produce dense clusters, which speeds up our algorithm. Comment: Extended from the Master project report of Sanjeev Shenoy, Department of Computer Science and Engineering, University of California, San Diego. June 2016
Document Type:	Working Paper
Access URL:	http://arxiv.org/abs/1704.05617
Accession Number:	edsarx.1704.05617
Database:	arXiv

More Details
Description not available.