Automation of Data Cleansing Methods for Covid19 Contact Tracing Data in the Philippines

Date of Award

12-2021

Document Type

Thesis

Degree Name

Master of Science in Computer Science

First Advisor

Ma. Regina Justina E. Estuar, PhD

Abstract

Data has become important in helping government and health- care organizations create effective responses to mitigate the spread of the COVID19 virus. Using data as basis for decision making leads to better and more grounded policies and response implementations. How- ever, data quality is often overlooked during data collection even with data handling guidelines in place because of the immense scale of data collected and unprepared eHealth systems. Data cleaning is the most crucial and important, and the most time consuming part in data min- ing. Dirty data and time spent in data cleaning impacts the performance of models and causes delays in producing results needed in decision mak- ing. This study developed data cleaning scripts to clean and improve the data quality of the COVID19 data without consuming too much time. The data cleaning process framework is designed and used in developing the data cleaning scripts, analyzing and identifying data quality issues, and defining the transformation workflows. Data quality issues validity, consistency, completeness, and uniqueness are found in the COVID19 data during data analysis, where challenges and causes of these issues are identified to define the data transformation workflows. Validation scripts are also developed in this study to validate data before and after data cleaning to measure the improvement in data quality. The overall data quality of the COVID19 data is 90.91%, where data is 95.94% valid, 99.59% consistent, and 71.13% complete with 3.02% duplicates.

Share

COinS