Meeting modern expectations and unlocking the power of rack-scale design

Meeting modern expectations and unlocking the power of rack-scale design

Today’s data centres are being revolutionised by computing efficiency and performance. Embracing cloud-native strategies and liquid cooling, Michael McNerney, Vice President Marketing and Network Security, Supermicro, explores the transformative impact on scalability, maintenance costs and high-end processing in the digital age.

Michael McNerney, Vice President Marketing and Network Security, Supermicro

As data centres increasingly become the backbone of our digital society, the paradigm for building a modern data centre has evolved. It is no longer about just acquiring a server but rather about considering racks of servers and their interaction. This shift towards rack-scale design is transforming how computing and storage equipment is specified, leading to more optimal configurations, faster results and lower maintenance costs.

Meeting the needs of modern data centres

Today’s applications often require a number of separate executables with defined APIs, each performing a specific task or set of functions. This development in how an application is created is where the cloud-native approach comes into play. It not only allows for innovation at a more manageable level but also relies on other services to perform tasks as needed. The cloud-native approach is a game-changer, enabling complex applications to scale when required, upgrade individual components as new features are added to one part of the code base and allow for continuous development and integration (CI/CD).

When applications are built using this approach, the servers that need to communicate with each other should be networked on the same switch. This reduces any communication and data movement delays from one server to another. By creating a rack-scale design based on the anticipated software architecture, Service Level Agreements (SLAs) can be met, leading to greater customer satisfaction.

Adopting a rack-scale approach comes with its own set of benefits, but it’s not as straightforward as simply filling racks to the top with servers and switches. There are several key considerations to keep in mind, such as the maximum power that can be delivered to the rack and air and liquid cooling layouts. Additionally, data centre designers need to evaluate the exact communication requirements between server units and the speed at which installation can be completed to maximise customer return while minimising errors. These factors can significantly impact the computing density or storage capacity per square metre, and the suitability of lower density racks for many data centres that do not have the forced air high-capacity fans.

Efficiencies of racks at scale

Racks are designed with various heights to accommodate different numbers of servers. From a square metre perspective, filling a rack with servers can reduce rack (or server) sprawl and reduce the number of power connections. Filling racks with compute and storage systems – or a combination of the two – can be beneficial when the different types of systems need to interact with each other with minimum latencies (same switch). Higher density racks allow for more racks to be placed within a certain area, but data centre designers also need to be aware of cooling issues.

Currently, many data centres are equipped for air cooling and will be for the foreseeable future. The application workloads do not require the most performant CPUs or GPUs and thus can remain air cooled. However, the data centre design and rack design are essential components of keeping the systems at design temperatures. Hot and cold aisles must remain separated for air cooling to work efficiently. Nevertheless, the heat load can be spread over a greater area when racks are not filled to their maximum, which may result in lower cubic feet per minute (CFM) of airflow, which is less taxing on the data centre air cooling distribution. Another consideration is the inlet temperature for the systems in the rack. It is essential to understand the limits of the CPUs heat generation and the ability to move air into the racks.

Liquid cooling is becoming a requirement for high end servers (defined to use the fastest CPUs and GPUs). CPUs will soon be in the 500W (TDP) range, and GPUs are each in the 700W range. A complete server can easily be configured to consume 1kW once memory, storage and networking are included. Air cooling is failing to keep up with these cooling demands; thus, liquid cooling is needed moving forward.

The ‘plumbing’ required for Direct-To-Chip liquid cooling is best contained in a rack and includes a Cooling Distribution Unit (CDU), a number of Cooling Distribution Manifolds (CDM), and the hoses needed for both the cold and hot liquid. Since a CDU can handle up to about 100kW of cooling capacity, it is best used in a single rack, although it can be configured with longer hoses (cost) to servers in a different rack. Filling the rack with servers that need to be liquid cooled will lower the cost of the CDU per server.

Optimising longevity and future high-end processing performance

For a number of applications, such as High-Performance Computing (HPC) and AI, the servers are expensive, which means that keeping the servers active is critical to reducing the total cost of ownership (TCO). With applications that scale across servers, installing these large and power-hungry servers in the same rack is preferential to spreading the installation across a data centre. The system-to-system communication needs to be of the highest performance so that applications can scale as needed.

In addition, since these high-end platforms are the types of servers that may require liquid cooling, having the systems in the same rack will be beneficial. As similar server architectures for AI and HPC continue to be the most resource demanding hardware components, a software backplane that controls resources at the rack scale will be needed.

The basic server continues to evolve with new CPUs, GPUs and communication upgrades. While the number of cores per CPU and clock rates increases, the work/watt is even more critical. The progression of this metric enables the Digital Transformation that is going on today, including AI Computing. While the electronics get faster, the server itself must be redesigned every few years to accommodate the latest generation of memory connectivity and communication paths to the storage devices and network devices.

New technologies allow the GPUs to talk to the network directly without involving the CPU. In addition, GPUs can now talk to each other directly without using a relatively slow PCIe path. When these systems are installed in a rack, scaling for many applications becomes more straightforward and does not have to involve a switch. Disaggregated computing may require racks to be redesigned so that a composable infrastructure operates efficiently without extraneous cabling and inherent delays.

Browse our latest issue

Intelligent Data Centres

View Magazine Archive