Quantitative Methods and Information Technology Faculty Publications

Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts

Jose Ramon Ilagan, Ateneo de Manila University
Joseph Benjamin R. Ilagan, Ateneo de Manila UniversityFollow

Document Type

Conference Proceeding

Publication Date

1-1-2024

Abstract

In business intelligence for retail, it is critical to ensure consistent and unambiguous product dimension information. This is challenging, especially if an organization does not have full control over the source of either transaction or master data. Such lack of control is the case when brands rely on data provided directly by consumers through images of receipts. Product name strings obtained from the digitization of receipts often contain substitution, insertion, and deletion errors. These errors prevent product names from serving as a useful dimension for further analysis. This paper proposes a clustering-based approach to link error-laden product names to underlying SKUs to remove this noise. The problem can be modeled as an entity resolution problem: each digitized product name is a reference to an underlying entity SKU. The entity resolution problem can further be modeled as a clique-partitioning problem that can be solved in a reasonable time with an agglomerative clustering heuristic. The results of clustering a synthetic data set show that the approach can successfully resolve product references to reveal coarse-grained (i.e., category, generic product) groupings. Future work may be done on implementing blocking strategies, optimizing the model parameters, and understanding the limits of the model for fine-grained (i.e., size variation) groupings.

Recommended Citation

Ilagan, J. R., & Ilagan, J. B. (2024). Graph-partitioning entity resolution for resolving noisy product names in OCR scans of retail receipts. Procedia Computer Science, 239, 338–345. https://doi.org/10.1016/j.procs.2024.06.180

Download

Included in

Business Commons, Computer Sciences Commons

COinS

Quantitative Methods and Information Technology Faculty Publications

Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts

Document Type

Publication Date

Abstract

Recommended Citation

Included in

Browse

Author Corner

About Archium

Quantitative Methods and Information Technology Faculty Publications

Graph-Partitioning Entity Resolution for Resolving Noisy Product Names in OCR Scans of Retail Receipts

Authors

Document Type

Publication Date

Abstract

Recommended Citation

Included in

Share

Browse

Author Corner

About Archium