ated OGD (Open Government Data) portals to share
their data (e.g. (Data.gov, 2009), (UK Government,
2009);(data.gouv.fr, 2011)). However, looking at the
actual initiatives and platforms, OD is not without is-
sues.
For instance, there exist several and different OD
policies, meaning that the rules for opening differ
from country to country, and sometimes even within
national borders. Australia e.g. developed its own
OD policy (Australian government, 2008), the main
idea of which is to create new public value, encour-
aging the public to create and innovate. The United
Kingdom opted for another OD policy (UK Govern-
ment, 2013) with more emphasis on the role of citi-
zens in the society and to promote transparency. Eu-
rope adopted another strategy (European Commis-
sion, 2011a) focused on the possible economic gains
of OD. Furthermore, some barriers concern the access
and the publication of OD. Organizations still fear the
potential loss of control of their data and they feel
reluctant to open their datasets (Moore and Lopes,
2014).
Another important constraint regarding OD usage
is related with how the data is published and, directly
linked with this aspect, the doubt existing upon OD
quality. When talking about OD, we are not only fo-
cusing on datasets, but as well on the format used to
publish them, the accuracy of the data and so on. An-
other major aspect to take into account is the metadata
used to describe these datasets in order to turn them
searchable and findable. There is no common stan-
dard used by all OD initiatives to build and publish
datasets. Many times in the field of OD, no infor-
mation regarding the data quality is provided, even in
cases where the data quality and exactitude inserted
by the user in the dataset(s) is debatable (M. Janssen
and Zuiderwijk, 2012). OD datasets may be released
with a lack of accuracy of their information, which
may be incomplete, unclear, incorrect and non-valid.
Having access to OD files is important, but it is use-
less if we are not able to read and process them
(Kitchin, 2014). Metadata, which is crucial for mak-
ing datasets searchable and findable, may or may not
be delivered although. Providing considerable meta-
data will support and stimulate OD usage (A. Zuider-
wijk and Janssen, 2012).
3 CSV FORMAT AND TABULAR
DATA ANALYSIS
We have focused our work on a specific format: CSV
files. Our choice is based on the fact that CSV is
an open and machine-readable format and it is one
of the most spread OD formats: in the Netherlands, a
study of the OD policy of seven countries (Zuiderwijk
and Janssen, 2014) has shown that standard formats,
and in particular CSV, are used most of the time. In
2014, a benchmark proposal regarding OD available
in the United States OD portal has been presented
(Hoffman and Grinstein, 2012). In this study, it has
been concluded that most of the OD datasets were
available as CSV, XLS and PDF files (N. Veljkovi
´
c
and Stoimenov, 2014). Finally, a recent study regard-
ing the OD policies applied in five different countries
(United States, United Kingdom, Netherlands, Kenya
and Indonesia) has confirmed that CSV is used in all
involved countries except Indonesia, where datasets
are only available as PDF files (Nugroho, 2013). The
simplicity, however, comes with a trade-off: the se-
mantic and syntactic interpretation of CSV files can
be difficult. Getting an overview of the structure
and/or the content of a CSV file is only weakly sup-
ported, and the means are not standardized. The un-
derstanding of a short CSV file is normally simple.
The same is not always true when the size of the CSV
file grows. The number of columns and rows can be
very big making the understanding more difficult.
Figure 1: Simple CSV file.
Figure 2: More complex CSV file.
Methods to analyse tabular data already exists.
For instance, Table Lens is a technique to visualize
and understand the meaning of large tables using a
fisheye approach. The idea of the Fisheye method-
ology is based on a visual distortion where the cen-
tre of the visual perception is zoomed-in while the
other regions displayed are zoomed-out (Sundarara-
jan et al., 2011). This property turns Table Lens more
appropriate for the analysis of precise and small re-
gions of a table. Tableplot Graphics (W. A. Malik
and Gribov, 2010), is used to represent graphically
the cell values of a tabular dataset. It does not anal-
yse and show the type of data analysed. Another re-
lated work on this subject is: Sopan at al. Explor-
ing Distributions - Design and Evaluation (A. Sopan,
M. Freire, M. TaiebMaimon, J. Golbeck, B. Shnei-
derman and Ben. Shneiderman, 2010). However, in
this work, data types were not taken into account ei-
AVisualTechniquetoAssesstheQualityofDatasets-UnderstandingtheStructureandDetectingErrorsandMissing
ValuesinOpenDataCSVFiles
135