case is the introduction of confidence values and the
comparison of the obtained result with similar ap-
proaches by performing the described experiments
and the datasets of the WDC Gold Standards for prod-
uct matching (Petrovski et al., 2017). Future work
will be the application of word embeddings (Penning-
ton et al., 2014) and character embeddings similar to
(Ristoski et al., 2016) the problem of product resolu-
tion and to combine these approaches with the prepro-
cessing and filtering methods described in this paper.
ACKNOWLEDGEMENTS
This work has been made possible by the Eurostars
project E!10138 ReProsis ”Big Data Product Anal-
ysis in Real Time – Product Management System
for International Markets”, sub-project ”Intelligent
Product Data Extraction and Product Resolution”
funded by the German Bundesministerium f
¨
ur Bil-
dung und Forschung (BMBF) under the grant number
01QE1632B.
REFERENCES
Auger, A. and Hansen, N. (2005). A Restart CMA Evo-
lution Strategy With Increasing Population Size. In
2005 IEEE Congress on Evolutionary Computation,
volume 2, pages 1769–1776, Edinburgh, Scotland,
UK. IEEE.
Breiman, L. (2001). Random Forests. Machine Learning,
45(1):5–32.
de Bakker, M., Frasincar, F., and Vandic, D. (2013). A
hybrid model words-driven approach for web product
duplicate detection. In Advanced Information Systems
Engineering, volume 7908 of Lecture Notes in Com-
puter Science, pages 149–161. Springer, Berlin, Hei-
delberg.
Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S.
(2007). Duplicate Record Detection: A Survey. IEEE
Transactions on Knowledge and Data Engineering,
19(1):1–16.
Gopalakrishnan, V., Iyengar, S. P., Madaan, A., Rastogi, R.,
and Sengamedu, S. (2012). Matching product titles
using web-based enrichment. In Proceedings of the
21st ACM International Conference on Information
and Knowledge Management, pages 605–614, Maui,
Hawaii, USA. ACM.
Horch, A., Kett, H., and Weisbecker, A. (2015). Extracting
Product Unit Attributes from Product Offers by Us-
ing an Ontology. In Proceedings of The Second Inter-
national Conference on Computer Science, Computer
Engineering, & Social Media, Lodz, Poland. IEEE.
K
¨
opcke, H. (2014). Object Matching on Real-World Prob-
lems. Dissertation, Universit
¨
at Leipzig, Leipzig.
Londhe, N., Gopalakrishnan, V., Zhang, A., Ngo, H. Q.,
and Srihari, R. (2014). Matching titles with cross ti-
tle web-search enrichment and community detection.
Proceedings of the VLDB Endowment, 7(12):1167–
1178.
Pennington, J., Socher, R., and Manning, C. (2014). Glove:
Global Vectors for Word Representation. In Proceed-
ings of the 2014 Conference on Empirical Methods
in Natural Language Processing, pages 1532–1543,
Doha, Qatar. Association for Computational Linguis-
tics.
Petrovski, P., Bryl, V., and Bizer, C. Learning Regular Ex-
pressions for the Extraction of Product Attributes from
E-commerce Microdata. page 10.
Petrovski, P., Bryl, V., and Bizer, C. (2014). Integrat-
ing product data from websites offering microdata
markup. In Proceedings of the 23rd International
Conference on World Wide Web - WWW ’14 Compan-
ion, pages 1299–1304, Seoul, Korea. ACM Press.
Petrovski, P., Primpeli, A., Meusel, R., and Bizer, C. (2017).
The WDC Gold Standards for Product Feature Extrac-
tion and Product Matching. In Bridge, D. and Stuck-
enschmidt, H., editors, E-Commerce and Web Tech-
nologies, volume 278, pages 73–86. Springer Interna-
tional Publishing, Cham.
Powell, M. J. D. (2009). The BOBYQA algorithm for
bound constrained optimization without derivatives.
Technical Report, University of Cambridge, Cam-
bridge, UK.
Ristoski, P. and Mika, P. (2016). Enriching Product Ads
with Metadata from HTML Annotations. In Pro-
ceedings of the 13th International Conference on The
Semantic Web. Latest Advances and New Domains
- Volume 9678, pages 151–167, Berlin, Heidelberg.
Springer-Verlag.
Ristoski, P., Petrovski, P., Mika, P., and Paulheim, H.
(2016). A machine learning approach for product
matching and categorization: Use case: Enriching
product ads with semantic structured data. Semantic
Web, 9(5):707–728.
Shah, K., Kopru, S., and Ruvini, J. D. (2018). Neural
Network based Extreme Classification and Similarity
Models for Product Matching. In Proceedings of the
2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 3 (Industry Pa-
pers), pages 8–15, New Orleans - Louisiana. Associa-
tion for Computational Linguistics.
van Bezu, R., Borst, S., Rijkse, R., Verhagen, J., Vandic, D.,
and Frasincar, F. (2015). Multi-component similarity
method for web product duplicate detection. In Pro-
ceedings of the 30th Annual ACM Symposium on Ap-
plied Computing, pages 761–768, Salamanca, Spain.
ACM Press.
Vandic, D., Van Dam, J.-W., and Frasincar, F. (2012).
Faceted product search powered by the Semantic Web.
Decision Support Systems, 53(3):425–437.
Wilhelmst
¨
otter, F. (2018). Jenetics - Library User’s Manual
4.3. Technical Report, Vienna, Austria.
Applying Heuristic and Machine Learning Strategies to Product Resolution
249