Nvidia compares Blackwell and Hopper GPUs on LLM inferencing

Nvidia has announced results for its forthcoming Blackwell GPU in the latest round of MLPerf industry benchmarks, Inference v4.1.

“The Blackwell platform revealed up to 4x more performance than the Hopper architecture on MLPerf’s biggest LLM [large language model] workload, Llama 2 70B, thanks to its use of a second-generation transformer engine and FP4 tensor cores,” according to the company. “The H200 Tensor Core GPU delivered leading results on every test in the data center category, including the latest addition to the benchmark: the 56-billion-parameter Mixtral 8x7B MoE [mixture of experts] LLM.”

Mixture of experts?

MoEs have gained popularity as a way to bring more versatility to LLM deployments, said Nvidia, as they are capable of answering a variety of questions and performing more diverse tasks in a single deployment. “They’re also more efficient since they only activate a few experts per inference, meaning they deliver results much faster than dense models of a similar size.”

Blackwell GPUs can be operated in clusters of up to 72 using Nvidia’s NVLink and NVSwitch hardware.

Its ‘GB200 NVL72’ cluster (right) connects 36 Grace CPUs and 72 Blackwell GPUs in pair of liquid-cooled racks. The GPUs “can act as a single massive GPU to deliver 30x faster real-time trillion parameter LLM inference than the prior generation”, the company said.