New MLPerf Training Benchmark Results Highlight Hardware and Software Innovations in AI Systems

Two new benchmarks added - highlighting language model fine-tuning and classification for graph data.

SAN FRANCISCO--()--Today, MLCommons® announced new results for the MLPerf® Training v4.0 benchmark suite, including first-time results for two benchmarks: LoRA fine-tuning of LLama 2 70B and GNN.

MLPerf Training v4.0

The MLPerf Training benchmark suite comprises full system tests that stress machine learning (ML) models, software, and hardware for a broad range of applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry.

MLPerf Training v4.0 includes over 205 performance results from 17 submitting organizations: ASUSTeK, Dell, Fujitsu, Giga Computing, Google, HPE, Intel (Habana Labs), Juniper Networks, Lenovo, NVIDIA, NVIDIA + CoreWeave, Oracle, Quanta Cloud Technology, Red Hat + Supermicro, Supermicro, Sustainable Metal Cloud (SMC), and tiny corp.

MLCommons would like to especially welcome first-time MLPerf Training submitters Juniper Networks, Oracle, SMC, and tiny corp.

Congratulations to first-time participant SMC for submitting the first-ever set of power results for MLPerf Training. These results highlight the impact of SMC's immersion cooling solutions for data center systems. Our industry-standard power measurement works with MLPerf Training and is the first and only method to accurately measure full system power draw and energy consumption for both cloud and on-premise systems in a trusted and consistent fashion. These metrics are critical for the entire community to understand and improve the overall efficiency for training ML models - which will ultimately reduce the energy use and improve the environmental impact of AI in the coming years.

The Training v4.0 results demonstrate broad industry participation and showcase substantial performance gains in ML systems and software. Compared to the last round of results six months ago, this round brings a 1.8X speed-up in training time for Stable Diffusion. Meanwhile, the best results in the RetinaNet and GPT3 tests are 1.2X and 1.13X faster, respectively, thanks to performance scaling at increased system sizes.

“I’m thrilled by the performance gains we are seeing especially for generative AI,” said David Kanter, executive director of MLCommons. “Together with our first power measurement results for MLPerf Training we are increasing capabilities and reducing the environmental footprint - making AI better for everyone.”

New LLM fine-tuning benchmark

The MLPerf Training v4.0 suite introduces a new benchmark to target fine-tuning a large language model (LLM). An LLM that has been pre-trained on a general corpus of text can be fine-tuned to improve its accuracy on specific tasks, and the computational costs of doing so can differ from pre-training.

A variety of approaches to fine-tuning an LLM at lower computational costs have been introduced over the past few years. The MLCommons Training working group evaluated several of these algorithms and ultimately selected LoRA as the basis for its new benchmark. First introduced in 2021, LoRA freezes original pre-trained parameters in a network layer and injects trainable rank decomposition matrices. Since LoRA fine-tuning trains only a small portion of the network parameters, this approach dramatically reduces the computational and memory demands compared to pre-training or supervised fine-tuning.

“Fine-tuning LLMs is a notable workload because AI practitioners across many organizations make use of this technology. LoRA was the optimal choice for a state-of-the-art fine-tuning technique; it significantly reduces trainable parameters while maintaining performance comparable to fully fine-tuned models,” said Hiwot Kassa, MLPerf Training working group co-chair.

The new LoRA benchmark uses the Llama 2 70B general LLM as its base. This model is fine-tuned with the Scrolls dataset of government documents with a goal of generating more accurate document summaries. Accuracy is measured using the ROUGE algorithm for evaluating the quality of document summaries. The model uses a context length of 8,192 tokens, keeping pace with the industry’s rapid evolution toward longer context lengths.

The LLM fine-tuning benchmark is already achieving widespread adoption, with over 30 submissions in its initial round.

Learn more about the selection of the LoRA fine-tuning algorithm for the MLPerf Training benchmark in this blog.

New GNN benchmark for node classification

MLPerf Training v4.0 also introduces a graph neural network (GNN) benchmark for measuring the performance of ML systems on problems that are represented by large graph-structured data, such as those used to implement literary databases, drug discovery applications, fraud detection systems, social networks, and recommender systems.

“Training on large graph-structured datasets poses unique system challenges, demanding optimizations for sparse operations and inter-node communication. We hope the addition of a GNN based benchmark in MLPerf Training broadens the challenges offered by the suite and spurs software and hardware innovations for this critical class of workload,” said Ritika Borkar MLPerf Training working group co-chair.

The MLPerf Training GNN benchmark is used for a node classification task where the goal is to predict a label for each node in a graph. The benchmark uses an R-GAT model and is trained on the 2.2 terabyte IGBH full dataset, the largest available open-source graph dataset with 547 million nodes and 5.8 billion edges. The IGBH database is a graph showing the relationships between academic authors, papers, and institutes. Each node in the graph can be classified into one of 2,983 classes.

The MLPerf Training team recently submitted MLPerf R-Gat to the Illinois Graph Benchmark (IGB) leaderboard which helps the industry keep track of the state of the art for GNN models, encouraging reproducibility. We are pleased to announce that their submission is currently #1 with a 72% test accuracy.

Learn more about the selection of the GNN benchmark in this blog.

View the results

To view the full results for MLPerf Training v4.0 and find additional information about the benchmarks, please visit the Training benchmark page.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI Safety.

For additional information on MLCommons and details on becoming a member or affiliate, please visit MLCommons.org or contact participation@mlcommons.org.

Contacts

For press inquiries contact: Kelly Berschauer kelly@mlcommons.org

Release Summary

Announcing benchmark results for MLCommons MLPerf Training v4.0, including results for new two benchmarks: LoRA fine-tuning of LLama 2 70B and GNN.

Social Media Profiles

Contacts

For press inquiries contact: Kelly Berschauer kelly@mlcommons.org