Data Cleaning for Machine Learning Sysytems: A Survey-Amit Rajan Blog

Data cleaning plays a pivotal role in ensuring the accuracy and reliability of machine learning (ML)systems. The goal of a data cleaning task is to enhance the quality and reliability of datasets by identifying and rectifying errors, inconsistencies, and inaccuracies, ensuring robustness and effectiveness in subsequent data analysis and machine learning tasks. This survey meticulously examines existing data cleaning systems, with a specific focus on three crucial aspects: 1) Integrity constraint violation detection, 2) Identification and handling of outliers, missing values, anomalies, and adversarial examples, and 3) Deduplication or Entity matching techniques. Rather than providing a superficial overview of numerous methods, the survey delves into representative approaches, offering in-depth insights into their functionalities and results. By thoroughly discussing these methods, the survey aims to provide a comprehensive understanding of the landscape of data cleaning techniques tailored for ML systems, aiding researchers and practitioners in selecting and implementing appropriate solutions for their specific use cases.

Paper Link

FEATURED TAGS

alternate-hypothesis applied basis basis-function bayes-theorem-for-gaussian-variables bernoulli-distribution binomial-distribution bishop cdf classification column-space conceptual confidence-intervals conjugate-prior cross-validation determinant dimension eigenvalue-decomposition eigenvalues eigenvectors exercises expectation-maximization exponential-distribution feed-forward-network gaussian-distribution gilbert-strang graphical-models hypothesis-testing islr kernel-methods lagrange-multipliers least-squares linear-algebra linear-equations linear-model-selection linear-models linear-regression logistic-regression matrix-factorization matrix-multiplications matrix-space maximum-likelihood-for-the-gaussian maximum-margin-classifiers mean mixture-models mixtures-of-gaussians moving-beyond-linearity multinomial-distribution neural-networks normal-distribution null-hypothesis null-space one-tailed-test pattern-recognition pmf power probability-distributions projection random-variables regularization resampling statistical-learning students-t-distribution subspace support-vector-machines support-vectors think-stats tree-based-methods two-tailed-test unsupervised-learning variance vector-space

CATALOG

FEATURED TAGS