former two are mono-instance, whilst the latter two
are multi-instance. The Storage Engine has two roles
data server (DM) and meta-data server (MS), both
multiple instances. The query engine is homogeneous
and multi-instance. There is a manager (MNG) that is
single instance and single-threaded. Many of these
components can be replicated to provide high
availability, but their nature does not change. Since
replication it is an orthogonal topic, we do not
mention anymore.
4 FACTORS TO BE
CONSIDERED
Leveraging the full potential of multi-core and
NUMA shared memory architectures implies the
understanding of three key concepts namely:
processor affinity, data placement and the notion and
relation of physical and virtual cores.
4.1 Processor Affinity
It is the capability to map a given processing unit to
the execution of a given task. Usually, the selection
of a given CPU is governed by a scheduler that takes
into consideration the systems state and several other
policies in order to load balance tasks to the number
of available processors. When only one core is
available, processes or threads are instructed to start
and halt their execution in order to grant permission
to other threads, ensuring that the resource is shared
among the interested parts. When several CPUs are
available, the scheduler splits the thread’s work
among the available instances and may decide to halt
and reallocate task execution among processors to
achieve load balancing or to comply with other
policies. Under this scenario, NUMA architectures
become problematic since processors and their
respective memory blocks become disassociated.
Processor affinity relies in a modified scheduler to
systematically associate a given task with a given
processor, despite of the available resources. During
the lifetime of a given process or thread, the scheduler
monitors the relevant metrics to ensure that memory
allocation remains local to the process or thread that
is using a given processor. This technique by itself
may significantly harm performance by preventing
the task scheduler to spread load among the available
instances. This implies that this technique should be
accompanied by a smart observing and allocation
scheduler in order to exploit use case specificities that
would leverage this technique. On its own, this
technique may be negative for most use cases as the
scheduler implicitly may provide hints about affinity
without preventing a given processing core to execute
jobs outside of its strict affinity allowance.
4.2 Data Placement
Data placement is the capability to place data close to
the processing cores that are responsible to execute a
given task. Regardless if data placement is achieved
implicitly, where memory blocks are assigned to
specific processing units or, explicitly, where the
application is hardware aware and requests ranges of
memory to be handled by specific processing units;
the more often data can be placed close to the
processing unit that is preforming the computation,
more performance is expected through the overall
reduction in access time. Data placement is thus keen
to translate traditional scheduling policies in order to
favour fast storage mediums that present smaller
access costs over remote memory accesses.
4.3 Virtual Cores/HW Threads Vs
Physical Cores Vs Sockets
A computer might have one or more sockets also
known as NUMA units. Each socket is basically a
CPU that can have one or more cores and has
allocated a memory module. Sockets are structured in
a NUMA hierarchy that can be from 2 levels to an
arbitrary number of levels. In the new Bullion the
NUMA hierarchy is pretty deep. Understanding the
relationship between physical cores and sockets and
the NUMA distance across physical cores is crucial
to minimize the cost of communication across
components. Components that interact frequently will
communicate more efficiently if they are running on
closer cores in the NUMA hierarchy.
Many CPUs have the concept of physical and
virtual cores. Many INTEL CPUs exhibit
hyperthreading that lies in providing two hardware
threads per physical core. They have a superscalar
architecture in which a subset of the instructions can
operate on separate data in parallel. Additionally,
when one thread blocks the other can still run
guaranteeing that the physical core is actually
performing work.
The operating system actually reports the
hardware threads as virtual cores. AMD virtual cores
are more complete than INTEL hyperthreads since
each has a full set of registers, so threads running on
different cores actually do not require to save the
thread registers as it happens with hyperthreading.
The operating system always reports about virtual
ADITCA 2019 - Special Session on Appliances for Data-Intensive and Time Critical Applications
668