Customers Will Demand Near Real Time Performance When Using LLMs
There is a lot of confusion around Inference and LLMs and their impact on networking. Users expect real-time answers similar to what they get using search platforms. However, LLMs do not produce results in the same way as search platforms as they require token processing to generate a response. With the variety of LLMs in the market and the increasing number of parameters, LLM inference will remain a multi-GPU and multi-server problem to provide a high-quality experience.
Training vs. Inference Background
Training and Inference have different compute requirements. However, both put significant demands on networking. Training is a deeply compute-bound workload that requires high-performance RDMA networking to ingest petabytes of training data, distribute across a GPU cluster, perform all-to-all model weight data reduction operations, and regularly store checkpoint information to storage. Training is a compute-bound problem requiring high-performance networking to support synchronous bursts of a unified workload running across 10’s of thousands of distributed GPUs. A high-performance network is critical to achieve computational throughput and minimize model training, which can take months.
By contrast, a single inference workload is less compute-intensive; however, it requires high-performance networking to achieve low latency and maximize the efficiency of thousands of concurrent users running on shared, distributed GPU server cluster infrastructure. Time to first output token matters for users, so high-performance ingest of prompt tokens, models, and user context drives demands on the network. Since interaction and prompt refinements at human timescales are slow, achieving efficient infrastructure utilization requires continuous swapping of user context across the classical memory hierarchy. There are three critical elements to efficient, low-latency inferencing:
- Large-scale single-node memory is achieved via tightly coupled, cache-coherent CPU-GPU chip-to-chip interconnect. This enables transparent spilling of context between fast, limited GPU HBM memory and slower but more abundant and power-efficient LP-DDR5 CPU memory.
- NVLINK cache coherent scale-up connectivity, enabling large models and many user contexts to be transparently distributed across many CPU-GPU nodes.
- High-performance, low-latency RDMA networking is needed to provide an auto-scaled out distributed inference platform and efficient loading/storing of context data to memory and storage.
Model Size Will Continue to Increase
LLMs come in different sizes. For example, Meta’s Llama 3.1 has 405B, 70B, and 8B parameter options. We will see 1T (trillion) models at an industry level shortly. More extensive models provide a better user experience, and, in general, the larger models will give a better answer. A fair view is that as the model increases, it moves from a generalist to a specialist.
Real-Time is Key to Satisfactory Human Experience
Operators need to optimize performance and user experience. While a larger model may fit in one high-end GPU, the tokens will be limited, and the human user will see lag. A good rule of thumb is under 10 tokens/s is slower than typical human reading and will not be real-time. At 40-60 tokens/s, results are available to the user in real-time. To optimize user experience and performance, operators must run LLMs and multiple GPUs.
Networking is Key
The network is critical to stitching these GPUs together. To meet the requirements of user experience and performance, LLMs will need all-to-all connectivity and powerful interconnects. Emphasis on the “s” as most LLMs will benefit from multiple fabrics. Looking at an NVIDIA deployment, this will be a mix of an Ethernet/InfiniBand back-end network and NVLink as the two fabrics. Running with just the Ethernet/InfiniBand network and relying on the server PCIe interconnect will result in a slower processing time. The performance increases significantly by utilizing both Ethernet/InfiniBand and NVLink (soon to be NVSwitch across multiple servers). This allows for more concurrent inquiries into the LLM and a lower cost for the operator. A combination of networks built with different goals enables the performance advantage. If there weren’t one all-to-all and one non-blocking network, more cycles would be spent transporting packets instead of on the actual LLM workload.
Where the market goes in the next 12-18 months
As users consume more LLMs via cloud operators, SaaS providers, and consumer devices, the industry should expect LLMs to be served via large cluster deployments instead of single GPUs. From a networking perspective, this will show up as multiple networks in each rack, with the amount of networking bandwidth in the server increasing significantly. For example, in most large LLM deployments, there will be an Ethernet front-end network, Ethernet/InfiniBand back-end network, and NVLink/UALink/Other GPU fabric network. The multiple networks are the key to unlocking the value of the LLMs and GPUs and increasing the speed of adoption in AI.