Data Cleaning for Machine Learning Sysytems: A Survey

Posted by Amit Rajan on Thursday, April 18, 2024

Data cleaning plays a pivotal role in ensuring the accuracy and reliability of machine learning (ML)systems. The goal of a data cleaning task is to enhance the quality and reliability of datasets by identifying and rectifying errors, inconsistencies, and inaccuracies, ensuring robustness and effectiveness in subsequent data analysis and machine learning tasks. This survey meticulously examines existing data cleaning systems, with a specific focus on three crucial aspects: 1) Integrity constraint violation detection, 2) Identification and handling of outliers, missing values, anomalies, and adversarial examples, and 3) Deduplication or Entity matching techniques. Rather than providing a superficial overview of numerous methods, the survey delves into representative approaches, offering in-depth insights into their functionalities and results. By thoroughly discussing these methods, the survey aims to provide a comprehensive understanding of the landscape of data cleaning techniques tailored for ML systems, aiding researchers and practitioners in selecting and implementing appropriate solutions for their specific use cases.

Paper Link