Automating Data Quality Management

Theodoros Rekatsinas / University of Wisconsin-Madison

Talk: , -

Abstract: Data quality management is a bottleneck in modern analytics as high-effort tasks such as data validation and cleaning are essential to obtain accurate results. This talk describes how to use machine learning to automate routine data quality management tasks. I will first introduce Probabilistic Unclean Databases (PUDs), a formal probabilistic framework to describe the quality of structured data and demonstrate how data validation and cleaning correspond to learning and inference problems over structured data distributions. I will then show how the PUDs framework forms the basis of the HoloClean framework, a state-of-the-art ML-based solution to automate data quality management for structured data. Finally, I will close with a discussion on lessons learned from HoloClean with particular emphasis on when accurate, automated data cleaning is feasible.

Bio: Theodoros (Theo) Rekatsinas is an Assistant Professor in the Department of Computer Sciences at the University of Wisconsin-Madison. He is a member of the Database Group. He is also a co-founder of inductiv, a startup focusing on automating data quality ops for analytical pipelines. Theo earned his Ph.D. in Computer Science from the University of Maryland and was a Moore Data Postdoctoral Fellow at Stanford University. His research interests are in data management, with a focus on data integration, data cleaning, and uncertain data. Theo's work has been recognized with an Amazon Research Award in 2017, a Best Paper Award at SDM 2015, and the Larry S. Davis Doctoral Dissertation award in 2015.