A Visual Technique to Assess the Quality of Datasets - Understanding the Structure and Detecting Errors and Missing Values in Open Data CSV Files

Paulo Da Silva Carvalho, Patrik Hitzelberger, Fatma Bouali, Gilles Venturini

Abstract

Nowadays, more and more information is flowing in and is provided on the Web. Large datasets are made available covering many fields and sectors. Open Data (OD) plays an important role in this field. Thanks to the volumes and the variety of the released datasets, OD brings high societal and business potential. In order to realize this potential, the reuse of the datasets (e.g. in internal business processes) becomes primordial. However, if the aim is to reuse OD, it is also necessary to be able of assessing its quality. This paper demonstrates how Information Visualization may help on this task and presents Stacktab chart - a new chart to analyse and assess CSV files in order to understand their structure, identify the location of relevant information and detect possible problems in the datasets.

References

  1. A. Haug, F. Z. and Liempd, D. V. (2011). The costs of poor data quality. Journal of Industrial Engineering and Management, 4(2):168-193.
  2. A. Sopan, M. Freire, M. TaiebMaimon, J. Golbeck, B. Shneiderman and Ben. Shneiderman (2010). Exploring distributions: design and evaluation. University of Maryland, Human-Computer Interaction Lab Tech Report HCIL-2010-01.
  3. A. Zuiderwijk, K. J. and Janssen, M. (2012). The potential of metadata for linked open data and its value for users and publishers. Journal of e-Democracy and Open Government, 4(2):222-244.
  4. Australian government (2008). Declaration of open government. http://www.finance.gov.au/e-government/ strategy-and-governance/gov2/declaration-of-opengovernment.html. Last accessed on January 27, 2015.
  5. data.gouv.fr (2011). Plateforme ouverte des données publiques franc¸aises. https://www.data.gouv.fr/fr/. Last accessed on January 27, 2015.
  6. Data.gov (2009). The home of the u.s. government's open data. http://www.data.gov/. Last accessed on January 27, 2015.
  7. European Commission (2011a). Digital agenda: Commissions open data strategy, questions & answers. http://europa.eu/rapid/press-release MEMO11-891 en.htm?locale=en. Last accessed on January 27, 2015.
  8. European Commission (2011b). Digital agenda: Turning government data into gold. http://europa.eu/rapid/ press-release IP-11-1524 en.htm. Last accessed on January 26, 2015.
  9. Hoffman, P. and Grinstein, G. (2012). The home of the u.s. government's open data. https://www.data.gov/. Last accessed on January 26, 2015.
  10. Kitchin, R. (2014). The data revolution: Big data, open data, data infrastructures and their consequences. Sage.
  11. London Borough of Barnet (2014). Air quality monitoring - 2014. http://data.gov.uk/dataset/air-qualitymonitoring-2014. Last accessed on April 13, 2015.
  12. M. Janssen, Y. C. and Zuiderwijk, A. (2012). Benefits, adoption barriers and myths of open data and open government. Information Systems Management, 29(4):258-268.
  13. Moore, R. and Lopes, J. (2014). Barriers to open data release: A view from the top.
  14. N. Veljkovic, S. B.-D. and Stoimenov, L. (2014). Benchmarking open government: An open data perspective. Government Information Quarterly, 31(2):278-290.
  15. Nugroho, R. P. (2013). A comparison of open data policies in different countries.
  16. S. Hunnius, B. K. and Schuppan, T. (2014). Providing, guarding, shielding: Open government data in spain and germany. In 2014 EGPA Annual Conference, 10- 12 September 2014 in Speyer, Germany.
  17. South West London and St George's Mental Health NHS Trust (2014a). Finance expenditure august 2014. http://data.gov.uk/dataset/finance-expenditureaugust-2014. Last accessed on Ferbruary 2, 2015.
  18. South West London and St George's Mental Health NHS Trust (2014b). Finance expenditure september 2014. http://data.gov.uk/dataset/finance-expenditureseptember-2014. Last accessed on Ferbruary 2, 2015.
  19. Spenke, M. and Beilken, C. (2003). Visualization of trees as highly compressed tables with infozoom. In Proceedings of the IEEE Symposium on Information Visualization, pages 122-123. Citeseer.
  20. Sundararajan, P. K., Mengshoel, O. J., and Selker, T. (2011). Multi-fisheye for interactive visualization of large graphs. In Scalable Integration of Analytics and Visualization.
  21. UK Government (2009). Opening up government. http:// data.gov.uk/. Last accessed on January 27, 2015.
  22. UK Government (2013). Open data charter. https:// www.gov.uk/government/publications/open-datacharter. Last accessed on January 27, 2015.
  23. W. A. Malik, A. U. and Gribov, A. (2010). An interactive graphical system for visualizing data quality-tableplot graphics. In Classification as a Tool for Research, pages 331-339. Springer.
  24. Zuiderwijk, A. and Janssen, M. (2014). Open data policies, their implementation and impact: A framework for comparison. Government Information Quarterly, 31(1):17-29.
Download


Paper Citation


in Harvard Style

Da Silva Carvalho P., Hitzelberger P., Bouali F. and Venturini G. (2015). A Visual Technique to Assess the Quality of Datasets - Understanding the Structure and Detecting Errors and Missing Values in Open Data CSV Files . In Proceedings of 4th International Conference on Data Management Technologies and Applications - Volume 1: DATA, ISBN 978-989-758-103-8, pages 134-141. DOI: 10.5220/0005496601340141


in Bibtex Style

@conference{data15,
author={Paulo Da Silva Carvalho and Patrik Hitzelberger and Fatma Bouali and Gilles Venturini},
title={A Visual Technique to Assess the Quality of Datasets - Understanding the Structure and Detecting Errors and Missing Values in Open Data CSV Files},
booktitle={Proceedings of 4th International Conference on Data Management Technologies and Applications - Volume 1: DATA,},
year={2015},
pages={134-141},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005496601340141},
isbn={978-989-758-103-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of 4th International Conference on Data Management Technologies and Applications - Volume 1: DATA,
TI - A Visual Technique to Assess the Quality of Datasets - Understanding the Structure and Detecting Errors and Missing Values in Open Data CSV Files
SN - 978-989-758-103-8
AU - Da Silva Carvalho P.
AU - Hitzelberger P.
AU - Bouali F.
AU - Venturini G.
PY - 2015
SP - 134
EP - 141
DO - 10.5220/0005496601340141