Graph-partitioning entity resolution for resolving noisy product names in OCR scans of retail receipts
Abstract
In business intelligence for retail, it is critical to ensure consistent and unambiguous product dimension information. This is challenging, especially if an organization does not have full control over the source of either transaction or master data. Such lack of control is the case when brands rely on data provided directly by consumers through images of receipts. Product name strings obtained from the digitization of receipts often contain substitution, insertion, and deletion errors. These errors prevent product names from serving as a useful dimension for further analysis. This paper proposes a clustering-based approach to link error-laden product names to underlying SKUs to remove this noise. The problem can be modeled as an entity resolution problem: each digitized product name is a reference to an underlying entity SKU. The entity resolution problem can further be modeled as a clique-partitioning problem that can be solved in a reasonable time with an agglomerative clustering heuristic. The results of clustering a synthetic data set show that the approach can successfully resolve product references to reveal coarse-grained (i.e., category, generic product) groupings. Future work may be done on implementing blocking strategies, optimizing the model parameters, and understanding the limits of the model for fine-grained (i.e., size variation) groupings.