for hash table operations. A search generally takes
about one access, and the space utilization may be up
to 90%. This performance is superior to B+-trees for
key-based lookup operations. Linear hashing does not
require an index to lookup bucket locations on storage
if the buckets are allocated continuously on storage or
allocated in fixed size regions. Computing the address
of a record is done by using the output of the hash
function computed on the key to identify the appropri-
ate region (if multiple) and bucket within the region.
Thus, the memory consumed is minimal and consists
of information on the current number of buckets and
next bucket to split.
Collisions are handled using overflow buckets that
are chained to the primary (or home) bucket. The hash
file is dynamically resized when the storage utiliza-
tion (load factor) increases beyond a set amount. At
that point, a new bucket is added to the end of the
hash file and records are divided between the new
bucket and the current bucket to split in the table. It is
this predefined, ordered splitting of buckets that is the
main contribution of linear hashing.
Linear hashing was extended and generalized by
Larson (Larson, 1982) using partial expansions. It
was shown that performance can be increased if dou-
bling of the file size is done in a series of partial ex-
pansions with two generally being a good number.
Search performance is increased at the slight trade-
off of additional algorithm complexity and the need
for buffering and splitting k + 1 buckets in memory
where k is the number of partial expansions. Further
work (Larson, 1985) allowed for the primary buckets
and overflow buckets to use the same storage file by
reserving pre-defined overflow pages at regular inter-
vals in the data file. This work also added the ability
to have multiple overflow chains from a single pri-
mary bucket by utilizing several hash functions to de-
termine the correct overflow chain. Popular database
management systems such as PostgreSQL use imple-
mentations of linear hashing.
Variations of linear hashing optimized for flash
memory use the idea of log buffering to increase
performance. The Self-Adaptive Linear Hash (Yang
et al., 2016) buffers logs of successive operations be-
fore flushing the result to storage. This often de-
creases the total number of read and write operations
and allows for some random writes to be performed
sequentially. Self-Adaptive Linear Hash also adds
higher levels of organization to achieve more coarse-
grained writes to improve the bandwidth. Unfortu-
nately, the extra memory consumed is impractical for
embedded devices.
Embedded systems come in a wide variety of con-
figurations and are often developed and deployed for
particular use cases, which results in software that
is often customized both to the hardware and to the
problem. Arduinos (Severance, 2014) have increased
in usage as their designs are open source and a builder
community has emerged with resources to help de-
velopers. The Arduino Mega 2560, one of the most
popular Arduino boards, has 8 KB of SRAM and a
clock speed of 16 MHz. It also has a microSD card in-
terface for non-volatile, flash-memory storage. With
such limited capabilities, many applications cannot
run on an Arduino without adapting them to the more
resource-constrained environment.
Data structures include special indexed structures
for flash memory (Gal and Toledo, 2005; Lin et al.,
2006). Devices such as smart cards and sensor nodes
cannot afford the code space (often less than 128
KB), memory (between 2KB and 64KB), and en-
ergy requirements for typical database query pro-
cessing. Databases designed for local data storage
and querying on embedded devices, such as Ante-
lope (Tsiftes and Dunkels, 2011), PicoDBMS (Anci-
aux et al., 2003), and LittleD (Douglas and Lawrence,
2014), simplify the queries that are executable and the
data structures and algorithms used. Systems such as
TinyDB (Madden et al., 2005) and COUGAR (Bon-
net et al., 2001) are distributed data systems intended
to manage information over many networked sen-
sors. There has not been an experimental evaluation
of the performance and implementation requirements
for linear hash on embedded devices.
3 IMPLEMENTATION
The implementation of linear hashing requires several
key decisions that are heavily influenced by the lim-
ited resources, flash memory properties, and embed-
ded use cases:
• Bucket Structure - Are buckets stored as a linked
list or in sequential addresses on storage?
• Overflow Buckets - Are overflow buckets in a sep-
arate file or in the data file?
• Deletions - How are deletions handled? How is
free space reclaimed?
• Caching and Memory Usage - How much of the
data structure is memory-resident? Is memory-
usage tuneable for devices with more memory?
The implementation was optimized for the spe-
cific properties of embedded use cases. The goal is to
minimize RAM consumed, favor reads over writes on
flash, and optimize for sequential writing of records.
Many embedded systems perform logging applica-
tions where the device is collecting sensed data over
Adapting Linear Hashing for Flash Memory Resource-constrained Embedded Devices
177