MLPerf Releases Latest Inference Results and New Storage … – EnterpriseAI

MLCommons this week issued the results of its latest MLPerf Inference (v3.1) benchmark exercise. Nvidia was again the top performing accelerator, but Intel (Xeon CPU) and Habana (Gaudi1 and 2) performed well. Google provided a peak at its new TPU (v5e) performance. MLCommons also debuted a new MLPerf Storage (v0.5) benchmark intended to measure storage performance under ML training workloads. Submitters in the first Storage run included: Argonne National Laboratory (ANL), DDN, Micron, Nutanix, and Weka.

Digging through the latest Inference results more than 12,000 performance and 5,000 power inferencing results from 120 systems is a challenge. There were a more modest 28 results in the storage category. From a usefulness perspective, MLCommons provides direct access to results spreadsheets that permit potential system users/buyers to drill down onto specific system configurations and benchmark tests for comparison. (Links to Inference Datacenter and Edge v3.1 results and Storage v0.5 results)

In the past, HPCwire has tended to try to cover the full exercise in a single article. The rising number of results and introduction of a new category make this less tenable. Instead, well present a broad overview in this article and drill deeper into some vendor-specific results in separate articles (Nvidia and Intel/Habana). By now, you may be familiar with the MLPerf release cadence which is twice yearly for training and inference, with each released on alternate quarters. - so, inference results are released in spring and (early) fall; training results are released in winter and summer. The HPC Training benchmark is released just once yearly, close to the annual SC conference.

Broadly, inferencing and training are the foundational pieces of ML applications, with training deemed the more computational-intense of the two (i.e. think of training LLMs with trillions of parameters). Inferencing, though, is the volume workhorse, sitting behind every chatbot and similar applications.

MLPerf Inference v3.1 introduced two new benchmarks to the suite. The first is a large language model (LLM) using the GPT-J reference model to summarize CNN news articles; it garnered results from 15 different submitters, reflecting the rapid adoption of generative AI. The second change is an updated recommender, meant be more representative of industry practices, using the DLRM-DCNv2 reference model and larger datasets; it had 9 submissions. These new tests, say MLCommons, help advance AI by ensuring that industry-standard benchmarks represent the latest trends in AI adoption to help guide customers, vendors, and researchers, says MLCommons.

In a pre-briefing, David Kanter, MLCommons executive director, said, We added our first generation recommender a couple of years ago and are now updating it. The LLM (inference) benchmark is brand new and reflects the explosion of interest in what people are calling generative AI, large language models. An LLM had been added to the MLPerf Training benchmark in the spring (see HPCwire coverage, MLPerf Training 3.0 Showcases LLM; Nvidia Dominates, Intel/Habana Also Impress)

No ML benchmarking effort today would be complete without LLM coverage and MLCommon (parent organization for MLPerf) now has that.

Its important to understand large language models operate on tokens. A token is typically a piece of a word. An LLM simply takes a set of tokens as input and predicts the next token. Now, you can chain this together to actually build a predicted sentence. In practice, LLM s are used in a wide variety of applications. You can use them in search and in generating content, like essays or summaries. Summarization is what we do here, said Kanter.

The MLPerf LLM inference benchmark is quite different from the training benchmark, he emphasized.

One of the critical differences is the inference LLM is fundamentally performing a generative task. It's writing fairly lengthy sentences, multiple sentences, [but] its also actually a different and smaller model, he said. Many folks simply don't have the compute or the data to really support a really large model. The actual task we're performing with our inference benchmark is text summarization. So we feed in an article and then tell the language model to summarize the article.

As is MLCommons practice, submitting organizations are invited to submit brief statements on their submissions. These range in quality from pure marketing to providing more granular technical descriptions of a submission's distinguishing features. Given the high number of results, a fast review of the vendor statements can be informative in conjunction with consulting the spreadsheet.

Both Inference and storage submitter statements are appended to the end of this article. As examples, here are a few snippets from a few vendor statements in MLPerf Inference v3.1 exercise:

Azure promoted its online versus on premise showing access to H100 instances. Azure was the only submitter to publish results for virtual machines in the cloud, while matching the performance of on premises and bare metal offerings. This has been possible thanks to innovative technologies including: AI supercomputing GPUs: Equipped with eight NVIDIA H100 Tensor Core GPUs, these VMs promise significantly faster AI model performance than previous generations, empowering businesses with unmatched computational power; Next-generation computer processing unit (CPU): Understanding the criticality of CPU performance for AI training and inference, we have chosen the 4th Gen Intel Xeon Scalable processors as the foundation of these VMs, ensuring optimal processing speed."

CTuning Foundation, the non-profit ML tool developer, noted that it [delivered the new version of the open-source MLCommons CM automation language, CK playground and modular inference library (MIL) that became the 1st and only workflow automation enabling mass submission of more than 12000 performance results in a single MLPerf inference submission round with more than 1900 power results across more than 120 different system configurations.

Google touted its new TPU v5e. TPU v5e systems use multiple accelerators linked together by a high-speed interconnect and can be configured with a topology ranging from 1x1 to 16x16 (256 chips), giving the user the flexibility to choose the system that best meets their needs. This wide range of topology options offered by TPU systems allows users to run and scale AI inference workloads cost-effectively, without compromising on performance."

In this submission, Google Cloud used a TPU v5e system with a 2x2 topology (4 TPU chips) to run the 6-billion-parameter GPTJ benchmark. This benchmark demonstrates both the ease of scaling and the cost-efficiency offered by the TPU v5e systems for inference of large language models. Users can easily add more TPU v5e instances to achieve higher total queries per second (QPS), while maintaining the same performance per dollar advantage.

HPE reported, In the datacenter category, HPE Cray systems with eight (8) NVIDIA GPUs led our portfolio in performance, delivering more than 340,000 samples per second throughput for ResNet-50 Computer Vision, and more than 28,000 samples per second throughput for Bert 99.0 NLP. HPE also submitted for the first time the newly available HPE ProLiant DL380a Gen11 and HPE ProLiant DL320 Gen11 servers with NVIDIA H100 and L4 GPUs. The HPE ProLiant DL380a Gen11 with four (4) NVIDIA H100 GPUs is ideal for NLP and LLM inference. The HPE ProLiant DL320 Gen11 with four (4) NVIDIA L4 GPUs is a 1U server positioned for computer vision inference.

Intel discussed Gaudi2 accelerators, 4th Gen Intel Xeon Scalable processors and Intel Xeon CPU Max Series. Gaudi2 performance on both GPT-J-99 and GPT-J-99.9 for server queries and offline samples are 78.58/second and 84.08/second, respectively. These outstanding inference performance results complement our June training results and show continued validation of Gaudi2 performance on large language models. Performance and model coverage will continue to advance in the coming benchmarks as Gaudi2 software is updated continually with releases every six to eight weeks.

Intel remains the only server CPU vendor to submit MLPerf results. Our submission for 4th Gen Intel Xeon Scalable processors with Intel AMX validates that CPUs have great performance for general purpose AI workloads, as demonstrated with MLPerf models, and the new and larger DLRM v2 recommendation and GPT-J models.

You get the general flavor. Its necessary to dig into the spreadsheet for meaningful comparisons.

The new storage benchmark (v0.5) has been in the works for two years. MLCommons says, Its the first open-source AI/ML benchmark suite that measures the performance of storage for ML training workloads. The benchmark was created through a collaboration spanning more than a dozen leading industry and academic organizations and includes a variety of storage setups including: parallel file systems, local storage, and software defined storage. The MLPerf Storage Benchmark will be an effective tool for purchasing, configuring, and optimizing storage for machine learning applications, as well as for designing next-generation systems and technologies.

Although its being introduced along with the latest inference results, storage performance in ML is typically a more sensitive system element in training. MLCommons notes, Training neural networks is both a compute and data-intensive workload that demands high-performance storage to sustain good overall system performance and availability. For many customers developing the next generation of ML models, it is a challenge to find the right balance between storage and compute resources while making sure that both are efficiently utilized.

MLPerf Storage is intended to help overcome this problem by accurately modeling the I/O patterns posed by ML workloads, providing the flexibility to mix and match different storage systems with different accelerator types. The new benchmark reports results in sample/s and MB/s. Of course, the choice of storage hardware, protocol/filesystem, and network all influence performance.

The MLPerf Storage benchmark suite is built on the codebase of DLIO, a benchmark designed for I/O measurement in high performance computing, adapted to meet current storage needs.

Talking about the motivation and goals for the new benchmark, Kanter said Id heard about pretty large hyperscalers, who deployed really large training clusters, that could not hit their peak utilization because they didn't have enough storage. That [suggested] there's fundamentally a hard problem in storage and one that's under appreciated. Most hyperscalers that are buying 1000s, or tens of 1000s of accelerators also have engineers on staff to design proper storage subsystems."

The key accomplishment is we created a tool that represents ML training IO patterns, that doesn't require having any compute or accelerators, said Kanter. That's important, because if you want to size a storage subsystem for 1000 accelerators, you don't want to have to buy 1000 accelerators. Another interesting thing is its a dynamic tool that is coupled to compute. The metric for MLPerf storage is how many samples per second can be streamed out, for a given compute utilization; so we model a compute subsystem. If your storage falls behind too much, the compute subsystem will be idle, and we only allow 10% idle due to storage.

If the storage system us too slow, you cant run the benchmark, said Kanter. Obviously, these are early days for MLPerf Storage and it will take some time for the community take its full measure. There are already plans for additions. Given its newness, its best look through MLCommons documentation. (Link to MLPerf Storage Benchmark Rules)

Link to MLCommons, https://mlcommons.org/en/

ASUSTeK

ASUStek recently benchmarked its new AI servers using the MLPerf Inference v3.1 suite, aiming to highlight its performance across varied deep learning tasks. Our results exhibit our system's competency in inferencing some of the most demanding models with remarkable efficiency.

In the modern era of AI, speed and efficiency in deploying machine learning models to production are paramount. Enter ASUS GPU Server portfolios - designed to redefine the standards of inference, as validated by our recent MLPerf Inference benchmarks. Harness the power of AI frameworks like TensorFlow, PyTorch, and more. ASUS servers are not just about raw power; they're about smart power. Optimized software-hardware integrations ensure that you get the most out of every tensor operation. Power doesnt have to come at the cost of the planet. ASUS GPU servers not only boast top-tier performance metrics but do so with impressive energy efficiency ratings, as highlighted in the MLPerf power efficiency results. Seamlessly scale your AI workloads. With our multi-GPU configurations and optimized in hardware and software, ASUS GPU servers are built to handle increasing data demands, ensuring youre always ahead of the curve.

System Configuration:

Hardware: ASUS flagship AI Server ESC8000A-E12 with Dual AMD Genoa CPU up to 8 NVIDIA H100 GPUs, and ESC4000A-E12 with Dual AMD Genoa CPU up to 8 L4 GPUs

The results signify the DL system's enhanced performance and capability to address contemporary deep learning challenges, making it an apt choice for researchers and industries requiring accelerated inferencing workloads.

Azure

Microsoft Azure announced the general availability of the ND H100 v5-series for Generative AI at scale. These series of virtual machines vary in sizes ranging from eight to thousands of NVIDIA H100 GPUs interconnected by NVIDIA Quantum-2 InfiniBand networking. Azure was the only submitter to publish results for virtual machines in the cloud, while matching the performance of on premises and bare metal offerings. This has been possible thanks to innovative technologies including:

The ND H100 v5 is now available in the East United States and South Central United States Azure regions. Enterprises can register their interest in access to the new VMs or review technical details on the ND H100 v5 VM series at Microsoft Learn.

CTuning

As a founding member of MLCommons, cTuning.org is committed to democratizing MLPerf benchmarks and making them accessible to everyone to deliver the most efficient AI solutions while reducing all development, benchmarking and optimization costs.

We are proud to deliver the new version of the open-source MLCommons CM automation language, CK playground and modular inference library (MIL) that became the 1st and only workflow automation enabling mass submission of more than 12000 performance results in a single MLPerf inference submission round with more than 1900 power results across more than 120 different system configurations from different vendors (different implementations, all reference models and support for DeepSparse Zoo, Hugging Face Hub and BERT pruners from the NeurIPS paper, main frameworks and diverse software/hardware stacks) in both open and closed divisions!

This remarkable achievement became possible thanks to open and transparent development of this technology as an official MLCommons project with public Discord discussions, important feedback from Neural Magic, TTA, One Stop Systems, Nutanix, Collabora, Deelvin, AMD and NVIDIA, and contributions from students, researchers and even school children from all over the world via our public MLPerf challenges. Special thanks to cKnowledge for sponsoring our developments and submissions, to One Stop Systems for showcasing the 1st MLPerf results on Rigel Edge Supercomputer, and to TTA for sharing their platforms with us to add CM automation for DLRMv2 available to everyone.

Since its impossible to describe all the compelling performance and power-efficient results achieved by our collaborators in a short press-release, we will make them available with various derived metrics (power efficiency, cost, etc) and reproducibility reports at the MLCommons CK playground (x.cKnowledge.org), github.com/mlcommons/ck_mlperf_results and github.com/mlcommons/ck/blob/master/docs/news-mlperf-v3.1.md shortly after official release.

We continue enhancing the MLCommons CM/CK technology to help everyone

automatically co-design the most efficient end-to-end AI solutions

based on their requirements and constraints. We welcome all submitters to join our public MLCommons Task Force on Automation and Reproducibility if you want to automate your future MLPerf submissions at scale.

Connect Tech Inc

As a new member of MLCommons, Connect Tech ran performance and accuracy benchmarks in the Inference: Edge category in its recent MLPerf submission. Using Connect Techs feature-rich Hadron carrier board with the NVIDIA Jetson Orin NX, a high-performance, energy-efficient platform, showcased remarkable levels of performance across various AI workloads.

Connect Tech additionally supports NVIDIA Jetson Orin NX with Photon and Boson carrier boards, and system devices like Polaris and Rudi-NX. By deploying on Connect Techs production-ready hardware, customers can take immediate advantage of Jetson Orin NX for performance improvements and enhanced user experience with robotics and other edge AI applications.

Connect Tech's involvement in MLCommons signifies more than just technical achievement. It reflects the company's commitment to pushing the envelope of what's possible in the world of AI at the edge. The seamless integration of Connect Tech's hardware with NVIDIA's cutting-edge technology presents engineers and scientists with the tools to drive AI and machine learning innovations across diverse industries, including robotics, industrial automation, and healthcare.

Connect Tech is a hardware design and manufacturing company, specializing in rugged, small form factor solutions. As an Elite NVIDIA Jetson ecosystem partner, Connect Tech designs carrier boards, enclosures, and embedded systems for each Jetson generation. With a rich history of innovation, Connect Tech integrates edge AI solutions within various industries, empowering engineers and scientists to harness the potential of machine learning.

Connect Tech remains at the forefront as the world delves deeper into AI and machine learning. Navigating the complex landscape of embedded AI computing is made easier by using NVIDIA and Connect Techs innovative products.

Dell

Enterprise IT is bracing for the most transformative technology trend in decades: generative AI. Dell Technologies is ready to meet this demand with the worlds broadest Generative AI solutions portfolio from desktop to edge to data center to cloud, all in one place.

For the MLPerf inferencing v3.1 benchmark testing, Dell submitted 230 results, including the new GPT-J and DLRMv2 benchmark results, across 20 system configurations. Dell Technologies works with customers and collaborators, including NVIDIA, Intel, and Qualcomm, to optimize performance and efficiency, boosting inferencing workloads, including generative AI.

The Dell PowerEdge XE accelerated server family continues to deliver tremendous performance gains across several benchmarks. Here are some of the latest highlights:

Generate higher quality, faster time-to-value predictions and outputs while accelerating decision-making with powerful solutions from Dell Technologies. Take a test drive in one of our worldwide Customer Solution Centers. Collaborate with our Innovation Lab and tap into one of our Centers of Excellence.

Fujitsu

Fujitsu offers a fantastic blend of systems, solutions, and expertise to guarantee maximum productivity, efficiency, and flexibility delivering confidence and reliability. Since 2020, we have been actively participating in and submitting to inference and training rounds for both data center and edge divisions.

In this round, Fujitsu demonstrated the performance of PRIMERGY CDI with four A100-PCIe-80GB GPUs installed in an external PCIe BOX and measured the benchmark program only for the data center closed division. Fujitsu Server PRIMERGY CDI is expertly engineered to deploy the necessary resources according to each customer's unique workload, releasing them when no longer needed. CDI stands for Composable Disaggregated Infrastructure, a next-generation technology that supports the diversification of data processing. This results in an efficient operation that maximizes resource utilization, while providing user-friendly services that eliminate the drawbacks of traditional physical servers.

As demonstrated by the impressive results of this round, the PRIMERGY CDI confirms that even with GPUs mounted in an external PCIe BOX, it delivers outstanding performance and remarkable scalability for PCIe components.

Our purpose is to make the world more sustainable by building trust in society through innovation. With a rich heritage of driving innovation and expertise, we are dedicated to contributing to the growth of society and our valued customers. Therefore, we will continue to meet the demands of our customers and strive to provide attractive server systems through the activities of MLCommons.

Giga Computing Technology, a subsidiary wholly owned by GIGABYTE, is the enterprise unit that split off from GIGABYTE that designs, manufactures, and sells servers, server motherboards, immersion solutions, and workstations. As the GIGABYTE brand is widely recognized, Giga Computing will continue to use and promote it, and that includes at expos where we will join as GIGABYTE. Although the company name has changed, our customers can still expect the same quality and services as before. Giga Computing strives to do better and that includes greater push for efficiency and cooling with immersion and DLC technology. As well as providing public AI benchmarks.

As one of the founding members of MLCommons, GIGABYTE has continued to support the communitys efforts in benchmarking server solutions for various AI training & inference workloads. In the latest round of MLPerf Inference v3.1, Giga Computing submitted a powerful GIGABYTE system for platforms: Intel Xeon & NVIDIA H100 SXM5, and the results speak for themselves while showing great efficiency as measured in performance/watt. We did find that our system achieved excellent performance in some tests such as rnnt-Server and bert99-offline. We would have liked to have more benchmarks, but due to resource limitations we are not able; however, we found that our partners NVIDIA, Qualcomm, and Krai chose our GIGABYTE servers to do their own testing.

Google

Google Cloud recently launched an expansion to its AI infrastructure portfolio - Cloud TPU v5e - and is proud to announce its performance results in this round of MLPerf Inference (data center category). TPU v5e systems use multiple accelerators linked together by a high-speed interconnect and can be configured with a topology ranging from 1x1 to 16x16 (256 chips), giving the user the flexibility to choose the system that best meets their needs. This wide range of topology options offered by TPU systems allows users to run and scale AI inference workloads cost-effectively, without compromising on performance.

In this submission, Google Cloud used a TPU v5e system with a 2x2 topology (4 TPU chips) to run the 6-billion-parameter GPTJ benchmark. This benchmark demonstrates both the ease of scaling and the cost-efficiency offered by the TPU v5e systems for inference of large language models. Users can easily add more TPU v5e instances to achieve higher total queries per second (QPS), while maintaining the same performance per dollar advantage.

We are looking forward to seeing what Google Cloud customers achieve with the new TPU v5e systems.

HPE

HPE successfully submitted results in partnership with Intel, NVIDIA, Qualcomm, and Krai. HPE demonstrated a range of high-performing inference systems for both the datacenter and edge in Computer Vision, natural language processing (NLP), and large language models (LLM).

In the datacenter category, HPE Cray systems with eight (8) NVIDIA GPUs led our portfolio in performance, delivering more than 340,000 samples per second throughput for ResNet-50 Computer Vision, and more than 28,000 samples per second throughput for Bert 99.0 NLP.

HPE also submitted for the first time the newly available HPE ProLiant DL380a Gen11 and HPE ProLiant DL320 Gen11 servers with NVIDIA H100 and L4 GPUs. The HPE ProLiant DL380a Gen11 with four (4) NVIDIA H100 GPUs is ideal for NLP and LLM inference. The HPE ProLiant DL320 Gen11 with four (4) NVIDIA L4 GPUs is a 1U server positioned for computer vision inference. The HPE ProLiant DL380a Gen11 showed strong inference performance using 4th Gen. Intel Xeon Scalable Processors in CPU-only inference scenarios. The HPE ProLiant DL385 Gen10 Plus v2 with eight (8) Qualcomm Cloud AI 100 Standard accelerators remained well balanced for over-network inference compared to offline datacenter performance. Qualcomm Cloud AI 100 Standard is ideal for both computer vision and NLP inference.

In the Edge category, HPE Edgeline e920d powered by four (4) Qualcomm Cloud AI 100 Standard accelerators remains one of the lowest latency systems in the Edge category for SingleStream and MultiStream inference scenarios. The HPE Edgeline e920d also achieved strong performance improvements in throughput and energy efficiency.

Many thanks to Krais collaboration in achieving high-performance and energy efficiency for Qualcomm Cloud AI 100 accelerators.

IEI

IEI Industry Co., LTD is a leading provider of data center infrastructure, cloud computing, and AI solutions, ranking among the worlds top 3 server manufacturers. Through engineering and innovation, IEI delivers cutting-edge computing hardware design and extensive product offerings to address important technology arenas like open computing, cloud data center, AI, and deep learning.

In MLCommons Inference v3.1, IEI submitted the NF5468M6 system.

NF5468M6 is a highly versatile 4U AI server supporting between 4 and 16 NVIDIA single and double-width GPUs, making it ideal for a wide range of AI applications including AI cloud, IVA, video processing and much more. NF5468M6 offers ultra-high storage capacity and the unique function of switching topologies between Balance, Common and Cascade in one click, which helps to flexibly adapt to various needs for AI application performance optimization.

Intel

Intel is pleased to report MLPerf Inference v3.1 performance results for our Gaudi2 accelerators, 4th Gen Intel Xeon Scalable processors and Intel Xeon CPU Max Series. These results reinforce Intels commitment to delivering the full spectrum of products to address wide-ranging customer AI requirements.

Gaudi2 performance on both GPT-J-99 and GPT-J-99.9 for server queries and offline samples are 78.58/second and 84.08/second, respectively. These outstanding inference performance results complement our June training results and show continued validation of Gaudi2 performance on large language models. Performance and model coverage will continue to advance in the coming benchmarks as Gaudi2 software is updated continually with releases every six to eight weeks.

Intel remains the only server CPU vendor to submit MLPerf results. Our submission for 4th Gen Intel Xeon Scalable processors with Intel AMX validates that CPUs have great performance for general purpose AI workloads, as demonstrated with MLPerf models, and the new and larger DLRM v2 recommendation and GPT-J models.

The results confirm that 4th Gen Intel Xeon Scalable processor with optimized data pre-processing, modeling and deployment tools and optimizations, is an ideal solution to build and deploy general purpose AI workloads with the most popular open source AI frameworks and libraries.

For the GPT-J 100-word summarization task of a news article of approximately 1,000 to 1,500 words, 4th Gen Intel Xeon processors summarized two paragraphs per second in offline mode and one paragraph per second in real-time server mode.

This is the first time weve submitted MLPerf results for our Intel Xeon CPU Max Series, which provides up to 64GB of high-bandwidth memory. For GPT-J, it was the only CPU able to achieve 99.9% accuracy, which is critical for usages for which the highest accuracy is of paramount importance.

With our ongoing software updates, we expect continued advances in performance and productivity, and reporting new training metrics with the November training cycle.

For more details, please see MLCommons.org.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at http://www.Intel.com/PerformanceIndex .

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See

backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary.

Follow this link:
MLPerf Releases Latest Inference Results and New Storage ... - EnterpriseAI

Related Posts

Comments are closed.