Date:2022-07-13 11:00:53 Views:884
CXL (Compute Express Link) will be a transformative technology that will redefine the way data centers are architected and built. This is because CXL provides a standardized protocol for cache coherency, memory scaling, and memory pooling across chips. In this article, we'll focus on what Microsoft is doing to help you understand what CXL means for the data center.
The data center is a very expensive thing. Microsoft says that up to 50% of their server costs come from DRAM alone. the capital expenditure required is huge, but the servers you build are not homogeneous. Workloads are not static. They are constantly growing and evolving. The mix of compute resources, DRAM, NAND, and network type will change based on the workload.
A one-size-fits-all model doesn't work, which is why you'll see cloud providers with dozens, if not hundreds, of different instance types. These are trying to optimize their hardware offerings for different workloads. Even so, many users end up paying for something they really don't need.
The instance selection isn't perfect, and neither is the match of those instances to the hardware. With this comes the problem of platform-level memory stranding. Servers are configured for unsuitable instance type scenarios.
The solution to this problem is memory pooling. Multiple servers can share a portion of memory and can dynamically allocate it to different servers. Instead of over-provisioning servers, they can be configured to be closer to the average DRAM to core ratio, and the excessive DRAM requirements of clients can be addressed by memory pooling. This memory pool will communicate via the CXL protocol. In the future, with revisions to the CXL protocol, servers could even share the same memory to handle the same workload, which would further reduce DRAM requirements.
Complex operators with large-scale applications can solve this problem by providing their developers with multiple layers of memory with different bandwidths and latencies. This is untenable for public cloud environments operated by Amazon, Google, Microsoft and others.
Microsoft outlined 3 major functional challenges associated with memory pooling in public cloud environments. Inability to modify customer workloads, including guest operating systems. The memory pooling system must also be compatible with virtualization acceleration technologies, such as direct assignment of I/O devices to VMs and SR-IOV. pooling must also be available for commercial hardware.
They had tried memory pooling in the past, but it required custom hardware designs, changes to VM guests, and reliance on page errors. This combination made it impossible to deploy in the cloud. This is where CXL comes in. Intel, AMD and several Arm partners have joined the standard. CPUs with CXL will start coming out later this year. In addition, three major DRAM manufacturers, Samsung, Micron and SKHynix, have committed to support the standard.
Even with broad support from hardware suppliers, there are still many questions that need to be answered. On the hardware side: How should memory pools be built and how do you balance pool size with the higher latency of larger pools? On the software side: How should these pools be managed and exposed to the guest OS, and how much additional memory latency can cloud workloads tolerate?
At the distribution layer: how should providers schedule VMs on machines with CXL memory, what items in memory should be stored in pools versus directly connected memory, can they predict memory behavior and latency sensitivity to help produce better performance, and if so, with what accuracy are these predictions?
Microsoft has asked these questions and tried to answer them. We will outline their findings here. Their first generation of solution architecture has yielded impressive results.
These gains are likely to expand further as future CXL versions are released and latency is reduced.
First is the hardware layer, which Microsoft tested using multi-port external memory connected directly to 8 to 32 socket CPUs. Memory expansion is accomplished through an external memory controller (EMC) connected to the CXL with four 80-bit ECC DDR5 pool DRAM channels and multiple CXL links to allow multiple CPU slots to access memory. This EMC manages requests and keeps track of ownership of individual memory regions assigned to individual hosts.
The CXL x8 channel has approximately the bandwidth of a DDR5 memory channel. Each CPU has its own faster local memory, but it can also access CXL pooled memory with higher latency, equivalent to a single NUMA hop. Latency increases from 67ns to 87ns across CXL controllers and PHYs, optional retimers, propagation latency, and external memory controllers.
The chart below shows a fixed percentage (10%, 30% and 50%) of the current local DRAM switching to pooled resources. The greater the percentage of pooled memory to local memory, the greater the DRAM savings. In terms of DRAM savings, increasing the number of sockets will disappear quickly.
While a larger pool size and more sockets may seem like the best choice, there are more performance and latency implications here. If the pool size drops to 4 to 8 CPU slots, no retimers are needed. This reduces latency from 87ns to 67ns. In addition, in these smaller slot counts, EMC can connect directly to all CPU slots.
The larger pool of 32 slots connects EMC to a different subset of CPUs. This will allow sharing between a larger number of CPU slots while keeping the number of EMC devices in the CPU ports fixed. A re-timer is required here, which results in a delay of 10ns in each direction.
On the software side, the solution is quite clever.
Microsoft often deploys multi-socket systems. In most cases, VMs are small enough that they fit perfectly on a single NUMA node, core and memory. azure's hypervisor tries to put all cores and memory on a single NUMA node, but in rare cases (2% of the time), VMs have a portion of their resources across sockets. this is not exposed to the user.
Memory pools work in the same way functionally. The memory device will be exposed as a zero-core virtual zNUMA node, with no kernel, just memory. Memory is deviated from this zNUMA memory node, but overflow is allowed. The granularity is 1GB per slice of memory.
The distributed system software layer relies on a prediction of the VM's memory latency sensitivity. Untouched storage is called "frigid memory" and Azure estimates that the 50th percentile VMs have 50% cold (frigid) memory. This number seems pretty rounded. VMs that are not sensitive to memory latency are expected to fully support pool DRAM. a zNUMA node is configured for memory-sensitive VMs for their cold memory only. Prediction is done at VM deployment time, but it is managed asynchronously and changes VM placement when incorrect predictions are detected.
The accuracy of these algorithms is critical to saving infrastructure costs. If not done correctly, the performance impact can be significant.
Considering that the potential performance impact can be significant, moving cloud resident memory to a pool of 67ns to 87ns is very poor.
Therefore, Microsoft benchmarked 158 workloads in two scenarios. One was with only local DRAM control. The other was simulated CXL memory. It should be emphasized that despite Intel's earlier claims that its Sapphire Rapids CXL-enabled platform will be available by the end of 2021. Or that Sapphire Rapids will be available in early 2022. Microsoft is using a 2-way 24C Skylake SP system.
Memory access latency is 78ns when bandwidth exceeds 80GB/s. When one CPU accesses another CPU's memory across a NUMA boundary, this results in an additional 64ns of memory latency. This is very close to the 67ns additional latency of external storage devices (EMC) in low slot count systems.
Twenty percent of the workloads have no performance impact. Another 23% of workloads experienced less than 5% slowdown. 25% of workloads experienced severe slowdown and performance degradation of more than 20%, with 12% of workloads even experiencing performance degradation of more than 30%. Depending on the local versus in-pool inventory of the workload, this number can change quite a bit.
This further underscores the importance of predictive models, which are more accurate and produce fewer false positive slowdowns with Microsoft's random forest ML-based predictive models. As more memory is pooled, the more becomes more important.
As CXL specifications improve, latency decreases, and predictive models improve, the potential for memory pool savings could grow to double-digit percentages of cloud server costs