David Watkins, Solutions Director at VIRTUS, walks us through the catalysts behind liquid cooling developments and their influence on modern data centres.
As Artificial Intelligence continues to revolutionise industries – delivering enhanced efficiency, improved decision-making and accelerated innovation – the computational demands of AI applications are increasing all the time. These requirements translate into substantial heat generation within data centres, demanding more sophisticated and sustainable ways to maintain optimal operating conditions.
Traditional air-cooled systems, commonly employed in many existing data centres, may struggle to effectively dissipate the heat density associated with AI workloads. As AI applications continue to evolve and push the boundaries of computational capabilities, innovative cooling technologies are becoming indispensable.
This article explores the cooling solutions designed to meet the unique thermal management needs of AI-driven data centres, highlighting their benefits and implementation strategies.
An introduction to liquid cooling
Liquid cooling has garnered a lot of attention recently. It is a sophisticated method employed to manage the heat generated by computer hardware, such as servers and Graphics Processing Units (GPUs), by circulating a liquid coolant around or through these components.
This coolant absorbs heat directly from the hardware and efficiently carries it away, offering an effective alternative cooling solution in addition to traditional air-cooling methods.
Liquid cooling systems not only enhance efficiency by effectively dissipating heat but also help to maintain optimal operating conditions for hardware. Additionally, they can contribute to space savings and reduced noise levels compared to their air-cooled counterparts, making them well-suited for modern data centres and computing facilities that prioritise energy efficiency and operational stability.
Liquid cooling can be a more sustainable option than other thermal management technologies. It can reduce the amount of energy used by a facility as it takes less electricity to cool a server than air cooling systems. And whilst liquid cooling supports far denser compute deployments, it is not detrimental to a facility’s Water Usage Effectiveness (WUE) performance.
It’s important to note there are several different liquid cooling methods. Immersion cooling involves submerging specially designed IT hardware (servers and GPUs) in a dielectric fluid, such as mineral oil or synthetic coolant. The fluid absorbs heat directly from the components, providing efficient and direct cooling. It can be deployed as a single method of cooling without the need for traditional air-cooled systems, or in addition, depending on the requirements of the infrastructure. This method significantly enhances energy efficiency and reduces running costs, making it ideal for AI workloads that produce substantial heat.
Direct-to-chip cooling, also known as microfluidic cooling, delivers coolant directly to the heat-generating components of servers, such as Central Processing Units (CPUs) and GPUs. This targeted approach maximises thermal conductivity, efficiently dissipating heat at the source and improving overall performance and reliability.
By directly cooling critical components, the direct-to-chip method helps to ensure that AI applications operate optimally, minimising the risk of thermal throttling and hardware failures. This technology is ideal for data centres managing high-density AI workloads.
A mix-and-match strategy
The versatility and flexibility of liquid cooling technologies provide data centre operators with the option of adopting a mix-and-match approach tailored to their specific infrastructure and AI workload requirements.
Each cooling technology has unique strengths and limitations. Some advanced data centre operators can offer different types of liquid cooling that can be deployed in the same data centre or even the same hall. By combining immersion cooling, direct-to-chip cooling and/or air cooling, providers can leverage the benefits of each method to achieve optimal cooling efficiency across different components and workload types.
As AI workloads evolve and data centre requirements change, a flexible cooling infrastructure that supports scalability and adaptability becomes essential. Integrating multiple cooling technologies provides scalability options and facilitates future upgrades without compromising cooling performance.
For example, air cooling can support High-Performance Computing (HPC) and AI workloads to a degree, and most AI deployments will continue to require supplementary air-cooled systems for networking infrastructure. All cooling types ultimately require waste heat to be removed or re-used, so the main heat rejection system (such as chillers) must be sized appropriately and enabled for heat reuse where possible.
Addressing challenges
Innovative liquid cooling technologies represent a critical frontier in addressing the challenges posed by AI workloads in data centres. But with this innovation comes challenges to adoption, not least of which is the initial investment required for implementing the liquid cooling infrastructure.
Liquid cooling systems can involve higher upfront costs compared to traditional air-based cooling solutions, which can present a barrier to adoption for some data centre operators. Careful cost-benefit analysis and long-term planning to demonstrate the Return On Investment (ROI) of liquid cooling is required to enable successful adoption.
Another challenge is the complexity of liquid cooling system design and integration. Unlike air-based cooling systems, liquid cooling solutions require specialised components, such as Cooling Distribution Units (CDUs) which must be carefully integrated into existing data centre infrastructure. This means that retrofitting older data centres can be expensive as well as complex.
New data centres are likely to be better suited to support HPC and AI workloads because they have been built with these new demands in mind. Data centre providers need to invest in skilled personnel and training to effectively design, deploy and maintain liquid cooling systems tailored to the unique requirements of AI workloads.
Cooling the future
Effective cooling solutions are paramount if data centres are to meet the ever-growing demands of AI workloads. Liquid cooling technologies play a pivotal role in enhancing performance, increasing energy efficiency and improving the reliability of AI-centric operations – although we should note that air-cooling looks likely to remain in the data centre, in some form, for the foreseeable future.