vidual data points cannot be reproduced from it. Our
algorithm shows weak scalability with the number of
points/dimensions. The limitation of dimension or-
thogonality and overlapping is discussed. Such sit-
uation shall be rare as dimensionality grows higher.
Domain knowledge can benefit our algorithm by pro-
viding guidelines for collapsing particularly noisy or
non-orthogonal dimensions. Finally, this work shows
the potential power of Numba in high-dimensional
data analysis.
ACKNOWLEDGEMENTS
This research was supported by the National Science
Foundation for the grant entitled CAREER: Enabling
Distributed and In-Situ Analysis for Multidimensional
Structured Data (NSF ACI-1453430).
REFERENCES
Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and
Park, J. S. (1999). Fast algorithms for projected clus-
tering. SIGMOD Rec., 28(2):61–72.
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.
(1998). Automatic subspace clustering of high dimen-
sional data for data mining applications. ACM SIG-
MOD Record, 27(2):94–105.
Agrawal, R., Srikant, R., and Others (1994). Fast algo-
rithms for mining association rules. In Proc. 20th
int. conf. very large data bases, VLDB, volume 1215,
pages 487–499.
Altun, K., Barshan, B., and Tunc¸el, O. (2010). Comparative
study on classifying human activities with miniature
inertial and magnetic sensors. Pattern Recognition,
43(10):3605–3620.
Bandyopadhyay, S., Giannella, C., Maulik, U., Kargupta,
H., Liu, K., and Datta, S. (2006). Clustering dis-
tributed data streams in peer-to-peer environments. In-
formation Sciences, 176(14).
Barrena, M., Jurado, E., M
´
arquez-Neila, P., and Pach
´
on, C.
(2010). A flexible framework to ease nearest neighbor
search in multidimensional data spaces. Data Knowl.
Eng., 69(1):116–136.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.
(2011). Distributed optimization and statistical learn-
ing via the alternating direction method of multipliers.
Found. Trends Mach. Learn., 3(1):1–22.
Canonizer. Implementation of mafia subspace clustering on
nvidia gpus. https://github.com/canonizer/gpumafia.
open source code 2012.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer,
T. K., and Harshman, R. (1990). Indexing by latent
semantic analysis. Journal of the American Society
for Information Science, 41(6).
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., and Others
(1996). A density-based algorithm for discovering
clusters in large spatial databases with noise. In Kdd,
volume 96, pages 226–231.
Estrada, T. and Taufer, M. (2012). On the effectiveness of
application-aware self-management for scientific dis-
covery in volunteer computing systems. In Proceed-
ings of the International Conference on High Perfor-
mance Computing, Networking, Storage and Analysis,
pages 80:1–80:11. IEEE Computer Society Press.
Gionis, A., Indyk, P., and Motwani, R. (1999). Similarity
search in high dimensions via hashing. In Proceedings
of the 25th International Conference on Very Large
Data Bases, pages 518–529. Morgan Kaufmann Pub-
lishers Inc.
Goil, S., Nagesh, H., and Choudhary, A. (1999). MAFIA:
Efficient and scalable subspace clustering for very
large data sets. . . . Discovery and Data Mining,
5:443–452.
Indyk, P. and Motwani, R. (1998). Approximate nearest
neighbors: Towards removing the curse of dimension-
ality. In Proceedings of the Thirtieth Annual ACM
Symposium on Theory of Computing, pages 604–613.
Kargupta, H., Huang, W., Sivakumar, K., and Johnson, E.
(2001). Distributed clustering using collective princi-
pal component analysis. Knowledge and Information
Systems, 3(4):422–448.
Kawashima, H., R. Sato, R., and Kitagawa, H. (2008).
Models and issues on probabilistic data streams with
Bayesian Networks. In Proc. of the International Sym-
posium on Applications and the Internet (SAINT).
Liu, Y., Jiao, L. C., Shang, F., Yin, F., and Liu, F. (2013). An
efficient matrix bi-factorization alternative optimiza-
tion method for low-rank matrix recovery and com-
pletion. Neural Netw., 48.
Omercevic, D., Drbohlav, O., and Leonardis, A. (2007).
High-dimensional feature matching: Employing the
concept of meaningful nearest neighbors. In IEEE
11th International Conference on Computer Vision,
pages 1–8.
Quiroz, A., Parashar, M., Gnanasambandam, N., and
Sharma, N. (2012). Design and evaluation of decen-
tralized online clustering. ACM Trans. Auton. Adapt.
Syst., 7(3):34:1–34:31.
Salakhutdinov, R. and Hinton, G. (2009). Semantic hashing.
Int. J. Approx. Reasoning, 50(7):969–978.
Tiwari, D., Vazhkudai, S. S., Kim, Y., Ma, X., Boboila,
S., and Desnoyers, P. J. (2012). Reducing data move-
ment costs using energy-efficient, active computation
on ssd. In 2012 Workshop on Power-Aware Comput-
ing and Systems. USENIX.
ˇ
S
´
ıma, J. and Orponen, P. (2003). General-purpose com-
putation with neural networks: A survey of complex-
ity theoretic results. Neural Computing, 15(12):2727–
2778.
DATA 2017 - 6th International Conference on Data Science, Technology and Applications
240