entirely transparent to the applications. While such
transparency is desirable, it forces a tight integration
at the memory subsystem either at the physical level
or the hypervisor level. At the physical level the
memory controller needs to be able to handle remote
memory accesses. To avoid the impact of long
memory access latencies, we expect that a large
cache system is required. Disaggregated GPU and
FPGA can be accessed as an I/O device based on
direct integration through PCIe over Ethernet.
Similar to disaggregated memory, the programming
models remain unchanged once the mapping of the
disaggregated resource to the I/O address space of
the local compute node.
In the second approach, the access of
disaggregated resources can be exposed at the
hypervisor/container/operating system levels. New
hypervisor level primitives - such as getMemory,
getGPU, getFPGA, etc. - need to be defined to allow
applications to explicitly request the provisioning
and management of these resources in a manner
similar to malloc. It is also possible to modify the
paging mechanism within the hypervisor/operatoring
systems so that the paging to HDD is now going
through a new memory hierarchy including
disaggregated memory, SSD, and HDD. In this
case, the application does not need to be modified at
all. Accessing remote Nvdia GPU through rCUDA
(Duato 2010) has been demonstrated, and has been
shown to actually outperform locally connected
GPU when there is appropriate network
connectivity.
Disaggregation details and resource remoteness
can also be directly exposed to applications.
Disaggregated resources can be exposed via high-
level APIs (e.g. put/get for memory). As an
example, it is possible to define GetMemory in the
form of Memory as a Service as one of the
Openstack service. The Openstack service sets up a
channel between the host and the memory pool
service through RDMA. Through GetMemory
service, the application can now explicitly control
which part of its address space is deemed remote and
therefore controls or is at least cognizant which
memory and application objects will be placed
remotely. In the case of GPU as a service, a new
service primitive GetGPU can be defined to locate
an available GPU from a GPU resource pool and
host from the host resource pool. The system
establishes the channel between the host and the
GPU through RDMA/PCIe and exposes the GPU
access to applications via a library or a virtual
device.
4 NETWORK CONSIDERATIONS
One of the primary challenges for a disaggregated
datacenter architecture is the latency incurred by the
network when accessing memory, SSD, GPU, and
FPGA from remote resource pools. The latency
sensitivity depends on how the disaggregated
resources are exposed to the programming models in
terms of direct hardware, hypervisor, or resource as
a service.
The most stringent requirement on the network
arises when disaggregated memory is mapped to the
address space of the compute node and is accessed
through the byte addressable approach. The total
access latency across the network cannot be
significantly larger than the typical access time of
DRAM – which is on the order of 75 ns. As a
result, silicon photonics and optical circuit switches
(OCS) are likely to be the only options to enable
memory disaggregation beyond a rack. Large caches
can reduce the impact of remote access. When the
block sizes are aligned with the page sizes of the
system, the remote memory can be managed as
extension of the virtual memory system of the local
hosts through the hypervisor and OS management.
In this configuration, local DRAM is used as a cache
for the remote memory, which is managed in page-
size blocks and can be moved via RDMA
operations.
Disaggregating GPU and FPGA are much less
demanding as each GPU and FPGA are likely to
have its local memory, and will often engage in
computations that last many microseconds or
milliseconds. So the predominant communication
mode between a compute node and disaggregated
GPU and FPGA resources is likely through bulk
data transfer. It has been shown by Reano et al.
(2013) that adequate bandwidth such as those
offered by RDMA at FDR data rate (56 Gb/s)
already demonstrated superior performance than a
locally connected GPU.
Current SSD technologies has a spectrum of
100K IOPS (or more) and ~100 us access latency.
Consequently, the access latency for non-buffered
SSD should be on the order of 10 us. This latency
may be achievable using conventional Top-of-the-
Rack (TOR) switch technologies if the
communication is limited to within a rack. A flat
network across a PoD or a datacenter with a two-tier
spine-leaf model or a single tier spline model is
required in order achieve less than 10 us latency if
the communication between the local hosts and the
disaggregated SSD resource pools are across a PoD
or a datacenter.
ESaaSA2015-WorkshoponEmergingSoftwareasaServiceandAnalytics
48