Date:2022-08-17 10:54:58 Views:660
DPU chip, the biggest difference with the previous GPU and AI chip is that DPU is an integrated acceleration platform that integrates multiple domains of acceleration in one. If GPU and AI acceleration chips are the trend of separation of CPU+xPU single heterogeneous computing, then the emergence of DPU heralds that the whole computing system is gradually moving from single heterogeneous separation to multi-heterogeneous integration.
Of course, DPU is just the beginning, and the fuller integration will be analyzed and introduced in detail in this paper.
1 Characteristics of data center (compared to terminal) system
1.1 Separation of software service and hardware platform
Data center software has a very important requirement, which is high availability (HA). For example, the high availability of back-end services is achieved through load balancer, and the load balancer itself achieves its own high availability through clustering mechanism. The underlying VMs achieve high availability through hot migration, and containers achieve high availability by automatically pulling up new instances in a new environment.
If we focus on the underlying software and hardware, we find that the actual running software (disregarding the virtualization part, which is actually a virtualization mechanism between software and hardware, achieved through hardware-software collaboration) and hardware are actually completely disconnected.
The software entity can run on a different hardware platform, can be migrated in real time, and the upper layer software thinks that the software entity is always highly available and does not feel the changes in the underlying hardware (which has the possibility of disk failures, dead machines or updated servers, etc.).
A complete single hardware platform can be flexibly sliced and diced into many virtual hardware platforms through virtualization, which can support different software entities to run.
1.2 Single server virtualization, multi-system, multi-tenancy, cross-mixing of multi-cluster systems across the data center
As we have talked about before, systems are divided into four categories. Compared with intelligent terminal systems, the biggest features of cloud and edge data center systems are virtualization and servitization, and the biggest differences between data center servers and terminal device systems are.
First, multi-tenant coexistence of multiple systems on a single hardware platform is achieved through virtualization. Hardware platforms need to support virtualization and high scalability, providing multiple personalized virtual hardware platforms. Then it needs to support multiple different or even very different software systems running on their own independent virtual hardware platforms.
On the other hand, the end systems are treated as a single system, while the data center servers are a mix of multiple systems. In this way, we can define the systems that support virtualized multi-system multi-tenant operation of servers running as macro systems.
If a single server is a multi-tenant macro system, then the entire data center or multiple data centers are connected into a mega-cluster of hundreds of thousands or even millions of servers; the systems running on it become a mega-macro system where multiple cluster systems of different tenants are mixed and crossed together and run in parallel.
1.3 Consistency of physical hardware and diversity of virtual "hardware
Many students who work with hardware can't help but enjoy hardware "innovation". For example.
Connecting different numbers of CPUs, GPUs, storage disks, network cards, and other I/O devices through PCIe switches to form different types of hardware servers.
The free combination of multiple computing nodes of different specifications and performance through powerful capabilities such as smart NICs and DPUs.
Also, the enhancement of system capabilities through the enhancement of TOR functions by putting the work of many DPUs into TOR switches and realizing the Smart TOR or TOR-DPU approach.
Various other hardware innovation projects.
But my personal view has always been that the hardware needs to be as simple as possible, with software to achieve all kinds of complex diversity. For example, it is not appropriate for data centers to have different levels of servers and network equipment with different specifications, but rather the ultimate in simplicity and clarity regarding two physical equipment types: compute nodes and network core equipment.
Computing nodes, i.e. servers, core functions are computing and processing of various types of data, and their network functions, as far as possible, are only reflected in the input and output when the high-performance network.
The switch, as an efficient network core device, focuses on network-related processing, involves as little as possible in the user's computation, and remains transparent to the user's computation.
When describing the benefits of the Nitro System on the AWS website, the first thing it says is "faster innovation".
Nitro System brings together a wide variety of building blocks that can be combined in different ways, giving us the flexibility to design and quickly deliver EC2 cloud server instance types with ever-expanding compute, storage, memory and networking options. This innovation also enables bare-metal instances where customers can use their own hypervisor or no hypervisor at all.
Let me explain what this means further: AWS can first completely offset the various software overheads of virtualization with the Nitro system, which completely enables hardware acceleration of virtualization. Then it is very easy to completely virtualize a server's CPU resources, gas pedal resources, memory resources, I/O resources, and various other resources, and then you can recombine them at will, and then you can quickly and efficiently provide users with virtual machine instances of various shapes and sizes, instance types such as compute-optimized, memory-optimized, storage-optimized, network-optimized, GPU/FPGA/ DSA-accelerated optimizations, etc.
To summarize, from the perspective of a cloud computing company, CSPs want as simple and controllable data center network architecture as possible, as simple and consistent server hardware specifications as possible (so that operation and maintenance is the easiest and system stability is the highest), and then through (software) virtualization mechanisms, a variety of (software) virtual hardware platforms are implemented to support VM and container operation.
2 Data center processors: from combined to divided, and then from divided to combined
2.1 Learn more about the von Neumann architecture
The von Neumann architecture is the classic computing architecture, from which we can get the three main components of computing: the computational unit, the storage unit, and the input and output units.
2.2 Phase 1: CPU single computing platform
As shown in the above picture, green represents the computational unit + storage unit, where the computational unit is the CPU. to simplify the system analysis, we omit the storage unit and follow the computational unit by default.
2.3 Phase 2: from combined to separate, CPU + other computing chips for heterogeneous computing platform
Based on the failure of Moore's Law of CPU, the performance of CPU is improving very slowly, only less than 3% per year, and it will take more than 20 years to double the performance. The demand for performance, however, is still on the rise, so CPU+xPU heterogeneous computing is gradually moving to the center stage.
All xPUs, including gas pedals such as GPUs and AI-DSAs, cannot exist alone and require Host CPUs to form a CPU+xPU heterogeneous computing approach for complete computing.
Problems with single CPU + xPU heterogeneous computing itself.
Limited proportion of the acceleratable part to the whole system, e.g. an acceleration of up to 5x for an 80% acceleration share.
The impact of data being carried back and forth between CPUs and gas pedals, with discounted acceleration ratios and some scenarios where the combined acceleration is not significant.
(b) Heterogeneous acceleration explicitly introduces new entities, and the computation becomes an explicit collaboration between two or more entities to complete, increasing the complexity of the overall system.
Although GPU performance is much improved compared to CPU, there is still a significant gap compared to the performance of DSA/ASIC; and the problem of DSA/ASIC is that it cannot adapt to the requirements of complex scenarios for business flexibility, which leads to a huge threshold for large-scale applications.
CPU+xPU architecture, which is CPU-centric, has a long IO path throughout, and IO becomes a bottleneck in performance.
If multiple CPU+xPU-based heterogeneous computations are integrated, there are new problems.
Essentially, each CPU+xPU is an island, and communication between different xPUs can be very cumbersome, all requiring CPU involvement, which is very inefficient and low performance.
In the physical space of a server, only one type of accelerator card can usually be loaded. There does not exist so much space to load so many types and numbers of accelerator cards. Nor do so many accelerated cards allow for it when standing in the way of server power constraints.
The CPU-centric architecture requires CPU participation for all xPU interactions (the P2P approach can alleviate some of the CPU pressure, but interacting across at least 2 buses is still inefficient and the essential problem is not solved.) The CPU is the top priority of the entire system, and as CPU performance improves slowly, the CPU becomes the bottleneck of the entire system, dragging down the overall performance of the entire system significantly .
2.4 Phase 3: The starting point from division to union: a heterogeneous computing platform with DPU + other computing chips
Let's look at the DPU-centric architecture again.
A clarification is needed: a CPU compute-centric architecture is essentially control-centric; a DPU-centric architecture is not naturally equivalent to data-centric either; if the architectural implementation of DPU is still a traditional SOC, then DPU is still essentially control-centric. To be truly data-centric, the entire system architecture needs to be extensively adjusted.
Let's see what specific problems are solved in a hybrid state with multiple heterogeneous computations centered on the DPU.
There is a saying that DPU is a Switch, which does not do specific functions, but only connects numerous PUs. then in that case, DPU actually does nothing, and such DPU is meaningless.
The consensus view on DPU is that it needs to support virtualization, network, storage and security acceleration processing. In this way, the DPU itself performs the computational task of computing a large amount of I/O data, reducing the burden on the CPU and significantly improving the performance of the entire system.
The emergence of the DPU represents a gradual move from separation to consolidation, i.e., the gradual integration of multiple CPUs + xPUs with heterogeneous acceleration into a single processing chip to achieve gradual integration of heterogeneous computing. It heralds the development of server big chip system, which gradually moves from the stage of single CPU to the stage of gradual decomposition of CPU+xPU heterogeneous computing, to the new stage of gradual merging of multiple CPU+xPU continuously integrated into one.
What problems are not solved by DPU?
Computational tasks, including not only I/O-type task processing, but also other system-level and even application-level computational tasks. If the DPU is treated as a comprehensive computing acceleration platform, the DPU can continue to integrate more acceleration functions.
DPU replaces the CPU and becomes the central node, "the dragon slaying boy becomes the evil dragon", the communication between CPU, GPU, and other xPUs is still troublesome and remains a silo.
If the underlying architecture of the whole system is not updated, the DPU-centric architecture is still essentially control-centric computing, not data-centric computing, and still cannot achieve an order of magnitude improvement in the overall data throughput and computing.
There is still the problem of physical space constraints and limited server space, especially for 2U or 1U servers, which also require strong enough computing power. And with the popularity of green data centers, the power consumption constraint on a single server is bound to get bigger and bigger. At this time, the problem of separate CPUs, GPUs and various independent gas pedals is the urgent need to be solved.
Changing from a CPU-centric architecture to a DPU-centric architecture This will make the DPU the top priority, the key "bottleneck" of the system, and an unbearable burden to the system.
2.5 Phase 4: From Division to Integration: A More Efficient Converged Computing Platform
In this new stage, there is no so-called core node, and the CPU, GPU, and DPU (DPU can be seen as a collection of Multi-XPU) are integrated into a single chip. This chip with many fused functions is the Hyper-heterogeneous computing chip (HPU, Hyper-heterogeneous Processing Unit) that we have been mentioning.
HPU can be considered as a fusion chip of CPU+GPU+DPU, but it cannot be simply seen as the integration of the three; HPU needs to decouple the functions of CPU, GPU and DPU, reconfigure the whole system, and form a new data-centric and data flow-driven computing architecture.
It is important to emphasize that hyperheterogeneous convergence chips should not be confused with the concept of hyperconvergence, which is not the same as hyperconvergence often talked about in cloud computing.
Hyper-convergence is designed to integrate large clusters of cloud computing IaaS services into small-scale clusters, which facilitates deployment in private and enterprise clouds.
The hyperheterogeneous convergence chip, on the other hand, emphasizes the overall optimization of the system stack, which is the overall optimization of the system running on the server into a single chip with multiple engines mixed with high efficiency and performance. The super heterogeneous convergence chip can support both hyper-convergence and non-convergence (i.e., extreme deconfiguration and super multi-user super multi-system coexistence).
We can simply divide the system into two planes.
The control and management plane: the software that still runs on the CPU.
Computation and data plane: at this moment, CPU, GPU, other kinds of xPU, even including I/O can be seen as equal to all kinds of computing engines, they complete their own good work, and fully interact to form a more efficient and higher performance of a whole system.
3 Background conditions for big chip convergence
3.1 Condition 1: More than 90% of server systems are relatively lightweight, and a single chip can accommodate
Heavyweight scenarios require independent CPU, GPU and DPU tri-chips, while lightweight scenarios can have independent single-chip fusion solutions to achieve the possibility of order of magnitude performance improvement over the same area of traditional CPU chips, which can cover the arithmetic power and complexity required by lightweight systems.
The lightweight scenario is also able to account for about 90% of the size of all servers: the
Edge servers. Data centers include cloud data centers and edge data centers, and according to relevant data analysis, in the future, edge computing will account for 80% of the entire data center scale.
Enterprise-class servers. Enterprise cloud scenarios, need to support virtualization, but generally do not need to support multi-tenancy. The server's demand for computing power is not as high as the cloud. It is also a lightweight scenario.
Cloud data center servers are broadly divided into two categories: heavyweight business servers and lightweight storage and other types of resource pooling servers. These lightweight scenarios of resource pooling services can be covered by superheterogeneous single chips.
In addition, the standalone superheterogeneous convergence chip, which can also be used as a DPU role, works with CPU and GPU.
3.2 Condition 2: Chiplet technology is mature, enabling single-chip to cover heavyweight scenarios
Chiplet technology provides the possibility to make the system scale up immediately and quickly, so that we can provide a larger scale superheterogeneous fusion chip to cover all kinds of heavyweight computing scenarios, which are used in typical cloud computing business computing server and heterogeneous computing server scenarios.
Thereby, it makes a single chip of super heterogeneous computing (single DIE chip and Chiplet chip composed of multiple DIEs), which can cover all complex computing scenarios of cloud-network edge-end convergence (the distinctive signs of complex computing are: virtualization and servitization).
4 Convergence, the trend of big chips
4.1 Case: NVIDIA Bluefield DPU integrated GPU
NVIDIA DPU roadmap
NVIDIA DPU roadmap, NVIDIA plans to integrate both DPU and GPU into a single chip starting with Bluefield 4th generation.
With the NVIDIA DPU and GPU integrated, and the standalone Grace CPU already in place, is it far enough to integrate the CPU to form a CPU+GPU+DPU superheterogeneous chip with Chiplet technology already mature? (Not far, because it's already there in the autonomous driving side.)
4.2 Case 2: NVIDIA Autonomous Driving Atlan Super Heterogeneous Fusion Chip
Figure NVIDIA's self-driving chip Atlan planned for 2024
The Atlan chip to be released in 2024 will be a fully integrated ARM Neoverse series Grace CPU (NVIDIA data center CPU), a possible Hopper architecture GPU (NVIDIA data center GPU), and a Bluefield DPU (NVIDIA data center GPU). Bluefield DPU (NVIDIA Data Center DPU), which can reach 1000 TOPS on a single chip.
In this way, we can see that a complete superheterogeneous fusion chip with a mix of multiple processing engines has been achieved in the autonomous driving space, and Atlan is using the same processing engine architecture as the data center, allowing seamless cloud-side end-to-end collaboration and even fusion.
From quantitative to qualitative change, Atlan requires a new architecture and a new integration and reconfiguration as the number of integrated units increases, as the performance demand rises, and as the system complexity rises and the requirement for the general and flexible programmability of the chip rises.