low us to further reduce the memory used by the Com-
pact algorithm by using an efficient by-value compu-
tation with non-primitive types.
7 CONCLUSIONS
This paper addresses the situation when one has to
manipulate a large textual data set by reading it from
a file, transforming it into objects, processing it and
then writing it back to a file, and all these operations
must be performed in a single in-memory session.
We have analyzed a modern implementation of an al-
gorithm for processing SAM and BAM files, elPrep
(Herzeel et al., 2015), (Herzeel et al., 2019), which
must handle input files up to 100 GB. The conclu-
sion of the elPrep authors was that a Java implemen-
tation for this specific problem suffers from the mem-
ory management offered by JVM (Costanza et al.,
2019). However, when using an object-oriented pro-
gramming platform, one has to take into considera-
tion all aspects regarding memory allocation offered
by that specific platform and to adapt its model and
programming techniques.
We have showed that major improvements can
be obtained by using techniques that are aimed at
reducing the number of created objects. This will
not only save memory but it will also improve run-
time performance by decreasing the overhead of the
Garbage Collector. Using a column-based represen-
tation we have compacted the data set in a man-
ner that boosted the overall score calculated as the
multiplication between used memory and running
time. The penalty incurred by the more elaborate
data model was compensated by a multi-threaded ap-
proach, called chunking-batching, that actually allows
the algorithm to use all available machine cores when
processing the input file.
Given the hardware differences between the ma-
chine used by the elPrep authors and ours, there are
limits on the testing that could be done with the tech-
niques used by this paper. However, using input files
ranging in size from 144 MB to 12 GB, we have
proved that our algorithms are scalable and could per-
form as expected for files of any size, provided the
machine has sufficient memory.
REFERENCES
Abadi, D., Boncz, P., and Harizopoulos, S. (2013). The De-
sign and Implementation of Modern Column-Oriented
Database Systems. Now Publishers Inc., Hanover,
MA, USA.
Costanza, P., Herzeel, C., and Verachtert, W. (2019). Com-
paring ease of programming in C++, Go, and Java for
implementing a next-generation sequencing tool. Evo-
lutionary Bioinformatics, 15:1176934319869015.
D
¨
oring, A., Weese, D., Rausch, T., and Knut, R. (2008).
Seqan an efficient, generic c++ library for sequence
analysis. BMC bioinformatics, 9:11.
Eimouri, T., Kent, K. B., and Micic, A. (2017). Optimiz-
ing the JVM Object Model Using Object Splitting. In
Proceedings of the 27th Annual International Confer-
ence on Computer Science and Software Engineering,
CASCON 17, page 170179, USA. IBM Corp.
Gosling, J., Joy, B., Steele, G. L., Bracha, G., and Buckley,
A. (2014). The Java Language Specification, Java SE
8 Edition. Addison-Wesley Professional, 1st edition.
He, Q., Li, Z., and Zhang, X. (2010). Data deduplication
techniques. volume 21, pages 430 – 433.
Herzeel, C., Costanza, P., Decap, D., Fostier, J., and
Reumers, J. (2015). elprep: High-performance prepa-
ration of sequence alignment/map files for variant
calling. PloS one, 10:e0132868.
Herzeel, C., Costanza, P., Decap, D., Fostier, J., and Ver-
achtert, W. (2019). elprep 4: A multithreaded frame-
work for sequence analysis. PLOS ONE, 14(2):1–16.
Java Platform, Standard Edition (2019). Java De-
velopment Kit Version 11 API Specification.
https://docs.oracle.com/en/java/javase/11/docs/api.
Accessed: 2019-06-01.
Lindholm, T., Yellin, F., Bracha, G., and Buckley, A.
(2014). The Java Virtual Machine Specification, Java
SE 8 Edition. Addison-Wesley Professional, 1st edi-
tion.
Manogar, E. and Abirami, S. (2014). A study on data
deduplication techniques for optimized storage. pages
161–166.
Oaks, S. (2014). Java Performance: The Definitive Guide.
O’Reilly Media, Inc., 1st edition.
Oracle GC (2019). Java Garbage Collection Basics.
https://www.oracle.com/webfolder/technetwork/ tuto-
rials/obe/java/gc01/index.html. Accessed: 2019-06-
01.
Schatzl, T., Dayns, L., and Mssenbck, H. (2011). Optimized
memory management for class metadata in a JVM.
Valhalla, P. (2019). OpenJDK Project Valhalla.
https://openjdk.java.net/projects/valhalla/. Accessed:
2019-06-01.
Vigna, S. (2019). Fastutil 8.1.0. http://fastutil.di.unimi.it/.
Accessed: 2019-06-01.
VisualVM, O. (2019). Java VisualVM.
https://docs.oracle.com/javase/8/docs/technotes/
guides/visualvm/index.html. Accessed: 2019-06-01.
A Case Study on Performance Optimization Techniques in Java Programming
91