High-Performance Computing: HPE brings high-performance system for machine learning

The purchase of Determined AI is apparently bearing the first fruits: Hewlett Packard Enterprise (HPE) has combined the technology of the Californian specialist for distributed deep learning in the high-performance class, acquired in June 2021, with its previous development environment and a fully integrated product called HPE Machine Learning system brought to market.
According to the announcement, the new high-performance system is specially tailored to machine learning (ML) and offers all the elements necessary for highly scalable deep learning in a complete package: In addition to the software platform for machine learning, it includes hardware for computing power, acceleration using built-in accelerators and suitable network components.

High performance computing for machine learners

Especially for medium-sized and small companies, the high demands on the infrastructure are often an obstacle to creating their own AI applications or integrating artificial intelligence (AI) into existing software. According to the manufacturer, HPE's scalable technology should now enable AI teams to develop their ML models faster and on a larger scale than before in order to monitor multiple training jobs together. This would significantly shorten the path to the completion and provision of practical AI applications compared to the previous time. According to the announcement, the training phase required from the concept to the production-ready model could be reduced from several weeks and months to a few days, as Evan Sparks, Vice President at HPE for the areas of AI and high-performance computing (HPC), said in a Pre-release press conference stressed.
HPE Machine Learning Development System: Schematic of the process flows – from data preparation to development and training to deployment and inference of the ML models

According to Sparks, the HPE Machine Learning Development System should open up high-performance computing for the entire AI area. Experimenting with machine learning, building prototypes, developing large-scale models and training and scaling them requires special hardware and software that can process as many workflows as possible quickly and reliably in parallel.
According to the Vice President, a correspondingly robust computing power, the required storage capacity, connectivity and speed (through accelerators) are prerequisites for starting operations. Acquiring such hardware is expensive, and setup and management can also be complex affairs and tie up creativity when labor dedicated to research, development, or engineering is instead occupied with infrastructure tasks. According to Hewlett Packard Enterprise, this is where it starts and, at the preceding press conference, advertised that it would reduce the usual complexity of an ML-compatible environment.

Machine Learning on a grand scale: Aleph Alpha

A reference customer of the HPE Machine Learning System is the Heidelberg AI company Aleph Alpha, which relies on the infrastructure of HPE for its large multimodal AI models developed in Germany, which currently have up to 200 billion parameters CEO Jonas Andrulis announced at the Supercomputing Conference. The start-up now has its own data center for the high-performance training of its models. According to Justin Hotard, Executive Vice President at HPE, Aleph Alpha was able to start training ML models just two days after installing the HPE system. The first results have already seen the market launch, so the multimodal AI model LUMINOUS was officially launched in mid-April.
Evan Sparks emphasized that a partner like Aleph Alpha is of particular interest to HPE, since the training of the large models developed by the Heidelberg-based company is a special case and inherently requires a high-performance power class. The challenge in this use case was to design an overall system that is suitable for large NLP models and capable of both training and inference. The following graphic breaks down the technical setup that Aleph Alpha has chosen for its large-scale adopter needs:
Technical setup at Aleph Alpha: The Heidelberg AI experts needed a particularly powerful system for training and inferencing their large NLP models, and HPE supplied them.

Configurable infrastructure scales as needed

The heart of the infrastructure is the 10th generation Apollo 6500 system, which is equipped with eight Nvidia A100 Tensor Core graphics processors (GPUs) each with 80 GB. In the case of the reference customer, 64 of these are used with a parallel data system memory. The Apollo system of this version is considered to be particularly reliable, stable and highly available, includes NVLink for rapid communication between the GPUs and supports, among other things, scalable processors from Intel and a series of fabrics for high-speed networking with high bandwidth and low latency. The system is optimized for deep learning workflows, but should also be suitable for complex simulation and modeling workloads. AI teams should appreciate the fact that the hardware can be customized, since the GPU topology is highly configurable depending on workloads and needs.
At the heart of the fully integrated ML development system is the HPE Apollo 6500 Gen10 System.

To monitor and control performance, the system offers a sophisticated central cluster administration, the HPE Performance Cluster Manager. The software should be compatible with all high-performance clusters and supercomputers from HPE, support the setup, monitor and manage the hardware and images – it should also be able to control software updates and power management. The infrastructure also has tools on board for managing the individual system components, such as the stackable switches of the Aruba CX 6300 series, which, among other things, regulate access to the company network and enable top-of-rack deployments (ToR) in the data center . The system can be expanded as required from a basic set of various components – the following overview shows the technical components of the smallest version of the ML Development System:
Technical overview of the infrastructure with software and service stack – here the smallest version of the HPE Machine Learning Development System

Its area of application also covers IoT, mobile and cloud computing (exact performance data can be found in the data sheet linked above). It also leverages HPE ProLiant DL325 servers and includes Nvidia InfiniBand HDR switches that ensure high-speed serial communication between the compute and storage components in the servers, clusters and data centers.
The HPE Machine Learning Development System is available now. More information can be found in the blog entry that accompanies the publication.
Swarm intelligence: Framework for GDPR-compliant AI models
Parallel to the new development system, HPE introduced Swarm Learning, a framework for distributed machine learning on the permissioned blockchain. The new approach for decentralized deep learning processes is intended to protect the privacy of the data donors to a particularly high degree: Swarm Learning enables AI models to be trained without merging personal data, since it does not share the basic data, only the insights gained from it.
The first use cases are in medical diagnostics. The cooperation of large organizations can keep sensitive data internally and at the same time merge relevant knowledge into more powerful models. The new approach opens up a significantly more extensive database and should promote global collaboration across organizational boundaries – more information is available in the iX report on the launch of HPE's swarm learning framework.

Related Posts

Leave a Reply

%d bloggers like this: