Advanced cooling advances science: One university’s immersion-cooled, supercomputing journey

Advanced cooling advances science: One university’s immersion-cooled, supercomputing journey

Texas Advanced Computing Center (TACC) needed to upgrade its powerful Lonestar5 supercomputer to keep up with its several thousand projects and continue pushing the boundaries of supercomputing. The organisation realised immersion cooling was the only practical way forward. Tommy Minyard, Ph.D., TACC’s Director of Advanced Computing Systems, explains how GRC’s liquid immersion cooling solution addressed its concerns, enabled it to overcome obstacles, reduce its backlog and accomplish its goals.

Texas Advanced Computing Center’s (TACC) unassuming name belies the ground-breaking work for which it is widely known. Located at The University of Texas, Austin, its mission is to enable discoveries that benefit science and society through the application of advanced computing technologies. For this it receives funding from the National Science Foundation (NSF), along with other important research and education institutions.

Since its inception in 2001, TACC has evolved its capabilities while making the best use of its existing hardware investment whenever possible. As a result, it has employed a variety of cooling strategies that include CRAC1 and chiller, in-row, liquid-to-chip, as well as singlephase liquid immersion cooling. Immersion cooling in particular has allowed TACC to continue achieving key scientific advancements by pushing the limits of computing power. Thus, it comes as no surprise that it is now the home of the world’s longest running immersion cooling system. This system was designed by GRC, who pioneered single-phase liquid immersion cooling for data centres.

Background

The TACC was busier than ever. As designer and operator of some of the world’s most powerful computing resources, it had several thousand projects in the queue and needed to upgrade its already powerful Lonestar5 supercomputer to keep up. But in order to make the massive scientific leaps for which it was famous — in fields like quantum mechanics, astrophysics, photovoltaics and biological research — major obstacles stood in the way:

  • Available space
  • Available power
  • Available budget

With the launch of its new Lonestar6 supercomputer, TACC overcame each of those obstacles, reduced its backlog and accomplished its goals with the help of GRC, along with partners, Dell Technologies, OEM Solutions and AMD.

TACC’s partnership with GRC

Faced with increasing power demands, the advent of advanced server and processor technologies and higher operating temperatures, TACC foresaw the limits of air-cooling early on. Largely out of necessity, it soon discovered liquid immersion cooling’s potential to address these challenges.

Starting with a single-rack installation in 2009, TACC has continued stretching the boundaries of supercomputing using Austin-based GRC’s innovative liquid immersion cooling solutions. It has since quadrupled the deployments of ICEraQ-cooled supercomputing systems leading up to Lonestar6.

Lonestar6: TACC’s ‘brightest’ supercomputer yet

The newest in TACC’s Lonestar series of high-performance computing systems, Lonestar6 was deployed specifically to support Texas researchers. Clocking in at an amazing three petaflops, it is three times as powerful as its predecessor and one of the fastest supercomputers at a US university.

Of course, where data centres are concerned, with increased performance comes greater heat production. Working closely with TACC, along with partners, Dell Technologies, OEM Solutions and AMD, GRC has evolved its single-phase immersion cooling systems to overcome the heat dilemma. Because of GRC’s proven performance, TACC chose to cool Lonestar6 with the latest evolution in GRC’s ICEraQ line of single-phase immersion cooling solutions: the ICEraQ Series 10 Quad.

Challenges

While housing a succession of acclaimed supercomputing systems, the Texas Advanced Computing Center has never been immune from the many challenges less celebrated centres face every day. Perhaps the biggest is discovering that air cooling is simply incapable of handling the kind of GPU-heavy compute loads that are in growing demand today — notably HPC, AI and AR/VR applications.

Despite its notoriety, TACC is not exempt from dealing with issues like finite space or limited funding. In the latter case, the ICEraQ Series 10 Quad helped the organisation optimise GPU processing with the allotted grant monies. TACC has faced other distinctive challenges as well, starting with triple-digit weather that is common to its region. Complicating matters even more, “TACC is very unique in that [it has] a number of different cooling technologies and a number of different computing evolutions, all in one facility,” said Brandon Moore, GRC’s Senior Solutions Architect. GRC’s liquid immersion cooling solution addressed all these concerns. Thus, for TACC, immersion cooling soon emerged as the only practical way forward.

Why GRC

GRC’s ICEraQ Series 10 Quad enabled TACC to triple its raw computing power within the same space and power envelope. That alone stood as an overriding reason to choose GRC. But other factors also influenced its decision.

A single ICEraQ Quad with 336 high-performance nodes now carries the bulk of the Lonestar6 compute load. More compute, less space and no infrastructure upgrades made immersion cooling the ideal solution for Lonestar6.

“We had the budget to install 600 nodes,” said Tommy Minyard, Ph.D., TACC’s Director of Advanced Computing Systems. “But we didn’t have the corresponding cooling capacity for it. We evaluated several different vendors and cooling technologies and cost was a huge consideration.”

For both environmental and cost considerations TACC also wanted to minimise the extent of modifications to the data centre, which was another reason the organisation opted for the GRC solution. That and GRC’s historical willingness to partner with TACC and Dell Technologies to problem solve and create ideal solutions.

Immersion cooling delivers impressive results

Thanks in no small part to single-phase immersion cooling, TACC’s Lonestar6 supercomputer delivers three times the performance than its predecessor — with less space, power and expense. That level of productivity has proven itself critical to TACC’s ability to continue extending the boundaries of scientific discoveries.

“When running parallel simulations, you need to squeeze every bit of performance out of these computers,” said Dan Stanzione, Associate Vice President for Research at The University of Texas at Austin, and TACC’s Executive Director. “The only other option would be to run air cooling at hurricane speed, or else slow the chips down.”

The latter was not really feasible, considering that doing so would add to project lead times, reduce the number of projects that can be completed, increase costs due to longer run times and be a very inefficient use of Lonestar6’s processors.

“GRC’s immersion cooling solution has given us the ability to use the densest servers from Dell Technologies and hottest chips from AMD,” Minyard said. “These chassis have 280 W CPUs that run so hot they cannot be cooled by air.”

Other key benefits:

Sustainability

For all its raw power, TACC’s new Lonestar6 supercomputer scores highly when it comes to sustainability, too and is playing a huge role in helping The University of Texas at Austin meet its ESG4 goals.

Cost-efficiency

When it comes to cost-efficiency, TACC banked on its favourable experiences with earlier GRC immersion cooling systems and could not have chosen a better way to cool Lonestar6’s servers (core processing). The organisation knew that GRC’s ICEraQ systems have been proven to cut data centre cooling energy consumption by up to 95%, increasing compute without increasing power.

Minyard continued: “The fact that we didn’t have to worry about additional infrastructure improvements saved us a significant amount of upfront costs.” Down the road, TACC can look forward an overall TCO5 reduction of up to 44% based on data gleaned from GRC’s immersion cooling installations worldwide.

Reliability and safety

Starting with its immersion cooling prototype back in 2009, TACC has found that single phase cooling reduces the failure rate of immersed equipment, due largely to the fact that the system runs at a relatively constant temperature.

What’s more, the ICEraQ system itself is highly reliable because it has fewer moving parts than other cooling technologies. GRC’s ElectroSafe fluids serve as an excellent insulator and protectant.

Flexibility

The innovative, flexible design of the ICEraQ Series 10 Quad makes it ideal for data centre refreshes because it helps make the best use of existing infrastructure. In fact, the Lonestar6 installation required no retrofitting whatsoever. Although raised floors are not a requirement, the Series 10 easily rests atop TACC’s existing raised floor with no reinforcement required beneath it. The system is also very compact.

Servicing and convenience

GRC’s ICEraQ Series 10 Quad has made maintenance easy for TACC. Servers are simply pulled from the rack, coolant is allowed to drip back into the rack and they are then serviced just like any other type of IT equipment. The system also includes new and improved switch mounting, along with more integrated cabling and power distribution. Plus, as Minyard remarked: “The containment is located inside the racks, so it all fits very nicely and looks very neat.”

Helping Texas researchers eclipse the competition

For reasons of reliability, sustainability, servicing, flexibility and sheer cooling power, TACC has been very pleased with GRC’s ICEraQ Series 10 Quad system for Lonestar6. “The users have been happy and the performance has been fantastic,” said Minyard.

TACC continues working closely with GRC, developing new strategies to reliably cool ever-increasing power density needs. Not surprisingly, TACC’s continued track record of performance has translated into more funding.

What the future holds

Data centre hardware typically has a five-year lifespan due to changing technology and advancements in energy efficiency. While it is always difficult to predict the future, the recent past can be very instructive: performance and sustainability demands on data centres are unlikely to decrease, which means power density per server must keep climbing.

One thing is for certain: GRC will rise to the challenge by relentlessly innovating while working closely with Dell Technologies and other Tier I and Tier II technology providers.

GRC will continue being a dedicated partner to support TACC as it continues pushing the limits of supercomputing.

Browse our latest issue

Intelligent Data Centres

View Magazine Archive