3 RELATED WORKS
In the last decade multi-dimensional and high-
dimensional indexing in decentralized peer-to-peer
(P2P) networks, received extensive research
attention. In (Aly, 2011) there is proposal of a
distributed k-d tree based on MapReduce framework
(Dean, 2008). In such index structures queries are
processed similar to the centralized approach, i.e., the
query starts in root node and traverse the tree. These
methods exhibit logarithmic search cost, but face a
serious limitation. Peers that correspond to nodes
high in the tree can quickly become overloaded as
query processing must pass through them. In
centralized indexes this was a desirable property
because maintaining these nodes in main memory
allow the minimization of the number of I/O
operations. In distributed indexes it is a limiting factor
leading to bottlenecks. Moreover, this causes an
imbalance in fault tolerance: if a peer high in the tree
fails than the system requires a significant amount of
effort to recover. MIDAS (Tsatsanifos, 2013) is
similar to these works and in particular, MIDAS
implements a distributed k-d tree, where leaves
correspond to peers, and internal nodes dictate
message routing. MIDAS distinguishes the concepts
of physical and virtual peer. A physical peer is an
actual machine responsible for several peers due to
node departures or failures, or for load balancing and
fault tolerance purposes. A virtual peer in MIDAS
corresponds to a leaf of the k-d tree, and
stores/indexes all key-value tuples, whose keys reside
in the leaves rectangle and for any point in space,
there exists exactly one peer in MIDAS responsible
for it. Two algorithms for Nearest Neighbour Queries
are described: the first (expected
) has low
latency and involve a large number of peers; the
second (expected
) has higher latency but
involves far fewer peers. The proposed algorithms
process point and range queries over the
multidimensional indexed space in
hops in
expectance.
4 CONCLUSIONS
The main objective of this work is the proposal of
index with the following characteristics: 1) Must be
used on a large amount of data. The assumption is that
it is not possible or convenient to use a single
workstation to host all the data; 2) It is distributed
over a computer network and ensures the greatest
possible benefits in terms of efficiency (search, insert,
delete), i.e. the performance are close to the
traditional indexes that use a single workstation. The
basic ideas behind are a data structure, called
Decentralized Random Trees (DRT), based on k-d
tree and a novel k-nearest neighbour algorithm,
named random k-nearest neighbour algorithm. The
Decentralized Random Trees represent the main
contribution of this work. With a DTR distributed
over a network of peers a randomly chosen peer can
start the propagation of a query in the network
without involving the peer containing the root of the
tree in about 65% of cases. Furthermore, the first peer
that determines that the search is complete will return
the result. With high probability, more than 98% of
cases, that peer is not the peer containing the root. Of
course, due the distributed nature of the DRT, more
than one query can be running at the same time. The
number of initiated queries is potentially limitless
even if the number of peers limits the number of the
running queries.
REFERENCES
Abele, A., McCrae, J.P., Buitelaar, P., Jentzsch, A.,
Cyganiak, R., 2017. Linking Open Data cloud diagram
2017. http://lod-cloud.net/
Corley, C., Mihalcea, R., 2005. Measuring the semantic
similarity of texts. In Proceedings of the ACL workshop
on empirical modeling of semantic equivalence and
entailment (pp. 13-18). Association for Computational
Linguistics.
Faloutsos, C., Lin, k., 1995. FastMap: A fast algorithm for
indexing, data-mining and visualization of traditional
and multimedia datasets, volume 24. ACM.
Kruskal, J.B., Wish, M., 1978. Multidimensional scaling,
volume 11. Sage.
Gargiulo, F., Gigante, G., Ficco, M., 2015. A semantic
driven approach for requirements consistency
verification. International. Journal of High Performance
Computing and Networking, 8(3):201–211.
Basile, P., De Gemmis, M., Gentile, A.L., Lops, P.,
Semeraro, G., 2007. Uniba: Jigsaw algorithm for word
sense disambiguation. In Proceedings of the 4th
International Workshop on Semantic Evaluations,
pages 398–401. Association for Computational
Linguistics.
Samet, H., 2006. Foundations of multidimensional and
metric data structures. Morgan Kaufmann.
Aly, M., Munich, M., Perona, P., 2011. Distributed k-d
trees for retrieval from very large image collections. In
British Machine Vision Conference, Dundee, Scotland.
Dean, J., Ghemawat, S., 2008. MapReduce: simplified data
processing on large clusters. Communications of the
ACM, 51(1):107–113.
Tsatsanifos, G., Sacharidis, D., Sellis, T., 2013. Index-
based query processing on distributed
multidimensional data. GeoInformatica 17.3 pages:
489-519.
DATA 2018 - 7th International Conference on Data Science, Technology and Applications
238