Michael McNerney, Senior Vice President Marketing and Network Security, Supermicro, says data centre operators must focus on developing more robust solutions and maintaining the balance of their systems.
While AI is not a new concept, recent technological advancements have required a shift in how businesses across various industries manage their workloads. This evolution has empowered organizations to tackle complex computational challenges and enhance the efficiency of more straightforward tasks. However, data centre operators face mounting pressure to adapt their infrastructure.
AI technologies, ranging from natural language processing to ML models, have transformed traditional data centre configurations, enabling them to meet business demands better. The growing reliance on AI necessitates substantial new views of the infrastructure needed in compute, storage and networking resources to accommodate vast data processing and SLA requirements.
To address these challenges effectively, data centre operators must focus on developing more robust solutions and maintaining the balance of their systems, which could lead to bottlenecks.
One of AI workloads’ most pressing challenges is the demand for specialized and accelerated computing hardware. Traditional CPUs are increasingly inadequate for AI models’ massive parallel processing needs, particularly in training deep learning algorithms. Consequently, data centres are increasingly adopting hardware accelerators such as graphics processing units (GPUs), tensor processing units (TPUs) and other AI-specific chips. These specialized processors are designed to meet the high throughput and low-latency requirements of AI workloads.
However, transitioning to AI-specific hardware involves more than merely replacing CPUs with GPUs. Operators must also consider the heightened power and cooling demands associated with more powerful hardware. AI workloads significantly increase power consumption, prompting data centres to understand how to cool these systems efficiently.
Additionally, operators must ensure that their infrastructure is scalable. AI workloads can grow rapidly, often exponentially, as data accumulates and models are refined. Achieving scalability may involve adopting modular data centre designs or deploying disaggregated infrastructure that allows compute, storage and networking resources to scale independently.
AI compute workloads differ from traditional applications in their requirement for vast amounts of data. ML models necessitate training on extensive datasets, often comprising terabytes or even petabytes of information. This creates a pressing need for more effective data storage solutions in terms of capacity and performance.
Traditional storage architectures, such as spinning disk systems, are unlikely to meet the demands of AI workloads. As a result, many data centres are transitioning to high-performance storage solutions based on solid-state drives (SSDs) and non-volatile memory express (NVMe) protocols. These technologies provide low latency, high-throughput performance essential for feeding data to AI models at the required speed.
Storage systems for AI workloads must be highly flexible. Operators must accommodate the unstructured nature of AI data, which can originate from diverse sources, including text, images and sensor data. Consequently, storage systems must be capable of managing various data types. The increasing need for real-time data processing and the growing adoption of edge computing in AI applications necessitate that data centres can dynamically move data between different storage tiers, from low-latency flash storage for active datasets to cost-effective archival storage for long-term retention.
Data locality is another critical consideration. AI workloads often perform better when located close to the data they process, minimizing latency and enhancing performance. This trend drives the development of distributed storage architectures, where data is stored across multiple physical locations closer to its point of use.
As AI workloads become more prevalent, networking infrastructure emerges as a crucial area for data centre operators to address. The substantial data volumes involved in AI training and inference necessitate networking systems capable of handling significant traffic within and between data centres and the cloud. Traditional Ethernet-based networks may struggle to meet the latency requirements of AI.
Another vital aspect of networking is ensuring high levels of redundancy and reliability. Given that AI workloads are often mission-critical, any service interruption can have severe consequences. This reality has prompted some data centre operators to adopt software-defined networking (SDN) solutions, which offer enhanced flexibility and control over network traffic, facilitating the management of AI workloads across distributed infrastructures and ensuring consistent performance under heavy loads.
Finally, data centre operators must be equipped to support hybrid and multi-cloud environments, as many organizations opt to run AI workloads across both on-premises data centres and public cloud platforms. This necessitates robust networking solutions that seamlessly connect private and public cloud resources, allowing data and workloads to transition fluidly between environments without compromising security or performance.
As AI workloads push data centre infrastructure to its limits, the associated power and cooling requirements escalate dramatically. High-density compute nodes, such as GPUs and AI accelerators, generate significantly more heat than traditional servers, necessitating more sophisticated cooling solutions, such as liquid cooling.
Liquid cooling systems can effectively dissipate heat and be scaled to meet the demands of dense AI workloads. The liquid-cooling solution leads to long-term savings by reducing overall energy consumption and extending hardware lifespan. Additionally, these solutions contribute to strategies aimed at minimizing carbon footprints through the adoption of energy-efficient hardware, thereby improving power usage effectiveness (PUE) of the data centre.
The rise of AI workloads presents both challenges and opportunities for data centre operators. To seize these opportunities, operators must rethink their infrastructure strategies, encompassing compute power, storage, networking, and cooling. By investing in scalable, high-performance systems capable of addressing the unique demands of AI, data centres can position themselves to meet the increasing demand for AI services while maintaining efficient and reliable operations.
AI is not merely another workload; it is rapidly becoming a cornerstone of modern enterprise IT. Data centre operators who can adapt their infrastructure to meet the demands of AI will be well-positioned to support the next wave of innovation and digital transformation.