although we do not restrict the application on private
data as well.
Apart from the key functionalities that the EIDER
system provides, another important aspect that worth
mentioning here is the data freshness. Since
information extracted by the EIDER system are Open
Data, freshness is crucial in order to maintain the
quality of data. EIDER employs patented method
(EP14176955.4) to retrieve data at system runtime
thus guarantees information in the EIDER system is
as fresh as possible.
4.2 Virtual Graphs and EIDER Model
We see two layers of data reconciliations in the
underlying system:
• EIDER System initial entity reconciliation
• Virtual Graphs enhanced entity reconciliation
The initial entity reconciliation is mainly text
mining technology focused, which the main purpose
is to allow accurate external entity identification
extraction from multiple Open Data sources. It is a
pre-reconciliation that makes the Big and Open Data
ready for integration.
At the end of EIDER system processing chain,
Big and Open Data are transformed into a common
analyzable format that is Linked Data. These data are
then stored into storage as graphs. To further
strengthen the entity reconciliation function of the
EIDER system, here we employ the Virtual Graphs
technology as post-reconciliation that can apply
graph algorithms to solve problems in the graph
space.
It is worth noting that we have conducted a
systematic literature review, and the results show that
currently, there are no any methodologies, tools, or
proposals that use Virtual Graphs for solving this kind
of problems.
To further explain our solution, we illustrate our
theory in the following figure:
Figure 3: Virtual Graphs and EIDER Model.
In this diagram, Big and Open data from
heterogeneous sources in different data formats are
extracted, (for example, DBpedia (Auer et al, 2007),
Yahoo Finance, GMEI Utility etc.); these data are
then integrated into a graph storage, whose variety of
entity identifications are managed by the Virtual
Graphs. Each node of the graph represents a unique
identifier from external data sources, and the edges
are the relationships between the different nodes.
With the dynamicity nature of the Virtual Graphs, we
are able to maintain an entity reconciliation system,
which is capable of adding/removing any new/old
entity identities at anytime, without breaking the
integrity of the whole data structure in the system.
5 CONCLUSIONS
Entity identification problem is a relatively new topic,
which is emerged with the Big and Open Data
phenomena. The only reference we have found so far
is the white paper presented by James and Nigel
(Powell and Shadbolt, 2014) in 2014. Nevertheless,
the focal point of that paper, is only to present
concerns to organizations and individuals, that they
should be cautious whenever they inventing a new
unique identifier within their system, for the entities
that are already exist.
Our paper addresses the entity identification issue
from a research and engineering perspective. With
our solution - a system that consists of an intelligent
reconciliation platform, the EIDER model, and the
Virtual Graphs technology, we are able to reconcile
multiple entity identification from heterogeneous
Open Data sources. Furthermore, the two layers of
reconciliation make sure the accuracy of the
reconciled entity identifiers. This is not only based on
historical data, but we are also able to incorporate any
new identities at dynamically at system runtime.
From a software engineering’s point of view, our
system is engineered in a generic manner, so that the
solution is applicable to many different domains that
share the same issue, as described in section III.
6 FUTURE WORKS
This paper presents an initial investigation of the
entity identification problem in the Big and Open
Data era. At the time of writing, we have completed
our initial system architecture design; some of the
core components in the EIDER system have also been
implemented, e.g. entity extraction, data integration,
and some basic entity reconciliation. For the forth-
EntityIdentificationProbleminBigandOpenData
407