Figure 7: A visual analysis of a portion of the matching
scores, showing fading lines and uniform streaks.
case, the returned results were so few that the system
failed in getting relevant information.
The prototype developed for investigating busi-
nesses has been implemented by retrieving official
data from the Italian Business Registry. Unofficial
data is instead gathered through the Google Places
API for the whole province of Lucca, Italy. Users in-
volved in the preliminary tests are quite satisfied, both
with the interface and the results the system shows.
An analysis of raw data that comes out from the
matching score computation is ongoing. In Figure 7 a
fraction of the matching scores is shown. Each row of
pixels represents a commercial activity retrieved from
Google. Each cell of the row is a color-coded match-
ing score describing the similarity between a commer-
cial activity retrieved through Google and a registered
one. We assign colors to the scores using a continu-
ous color scale, clear for high scores and dark for low
ones. For each line we order the scores from higher to
lower, so it is possible to notice fading lines of color.
Fading lines are expected, since they are due to the in-
trinsic uncertainty of the system. Streaks of the same
color for a large number of cells mean instead that the
same exact matching score was computed for a large
number of couples. This shows that the system can be
made more fuzzy in order to become less ambiguous.
For estimating the reliability of the matching score
algorithm, we randomly selected a set of 1.000 estab-
lishments from those extracted from Google Places,
whereof 606 have been manually annotated as hav-
ing a sure match in the Business Registry. We then
ran the algorithm on the same 1.000 entries and com-
puted that 83%, 78% and 62% of correct matches can
be found respectively in the top 10, 3 and 1 relevant
results.
4 CONCLUSION
In this article, we presented an ongoing project con-
sisting in the design and development of an investiga-
tion platform that supports tax inspectors in their tax-
evasion inquiries. The prototypes can be improved
upon. More sophisticated NLP techniques may be
adopted to obtain more accurate results in the entity
extraction phase. Machine Learning clustering algo-
rithms may be tested for limiting the problem of peo-
ple homonymy on the Web. The fuzzy calculation in-
troduced in Section 2.2 may be changed by modify-
ing the logic formula and even by adding new fuzzy
variables. The preliminary tests are promising and
show that the use of OSINF in the investigation of
tax-evaders can be effective. Nevertheless, a massive
testing phase involving users is fundamental for vali-
dating and refining the overall platform.
ACKNOWLEDGEMENTS
We would like to thank the municipality of Fab-
briche di Vallico and ANCI Toscana for funding this
work, Andrea D’Errico, Sergio Bianchi and Alessan-
dro Prosperi for their contribution in the project.
REFERENCES
Aliprandi, C., Irujo, J. A., Cuadros, M., Maier, S., Melero,
F., and Raffaelli, M. (2014). Caper: Collaborative
information, acquisition, processing, exploitation and
reporting for the prevention of organised crime. In In-
telligence and Security Informatics Conference.
Best, C. (2008). Open source intelligence. Mining Mas-
sive Data Sets for Security: Advances in Data Mining,
Search, Social Networks and Text Mining, and Their
Applications to Security.
Ducke, D., Kan, M., and Ivanyi, G. (2010). The Shadow
Economy-A Critical Analysis.
Feige, E. L. and Cebula, R. (2011). America’s underground
economy: measuring the size, growth and determi-
nants of income tax evasion in the us. Crime Law and
Social Change.
Internet Live Stats (2015). www.internetlivestats.com/.
ISTAT (2010). La misura dell’economia sommersa sec-
ondo le statistiche ufficiali. http://www3.istat.it/
salastampa/comunicati/non calendario/20100713 00/
testointegrale20100713.pdf.
Johnson, L. et al. (2007). Handbook of intelligence studies.
Maciołek, P. and Dobrowolski, G. (2013). Cluo: Web-scale
text mining system for open source intelligence pur-
poses. Computer Science.
Neri, F. and Geraci, P. (2009). Mining textual data to boost
information access in osint. In IEEE 13th Interna-
tional Conference in Information Visualisation.
Sogei (2010). Serpico. http://goo.gl/yV7YNF.
TOSCA (2010). Tosca project. http://www.regione.
toscana.it/imprese/innovazione/progetto-tosca.
We Are Social Singapore (2014). Global digital statis-
tics. http://www.slideshare.net/wearesocialsg/social-
digital-mobile-around-the-world-january-2014.
Yang, H.-C. and Lee, C.-H. (2012). Mining open source
text documents for intelligence gathering. In IEEE
International Symposium on Information Technology
in Medicine and Education.
Zadeh, L. A. (1975). The concept of a linguistic variable
and its application to approximate reasoningi. Infor-
mation sciences.
ASIA-AnInvestigationPlatformforExploitingOpenSourceInformationintheFightAgainstTaxEvasion
517