MEDIA ARTICLE The Real Costs of AI Nov 22, 2023 Daniel Pointon Group Chief Technology Officer STT GDC SHARE Link copied! The growth of artificial intelligence is also requiring a rethink of data centre engineering and design, as more servers and components have to be packed into the same physical space to allow for greater computational power. PHOTO: PEXELS Artificial intelligence (AI) is here to stay, at least according to research firm Gartner’s latest report on the top 10 technology trends of 2024, with three of them revolving around AI. The promise of AI as a catalyst for transformations across industries is an exciting one that is finally materialising after decades of anticipation, prompting technology companies and infrastructure operators to take notice. Technology reporters noted that “AI” was mentioned more than 140 times in a two-hour keynote address at Google I/O 2023. The term was used 59 times during Meta’s secondquarter earnings call and over 70 times on Alphabet’s earnings call for Q2. In sectors as varied as healthcare, finance, transportation, entertainment and productivity, AI’s potential is now tangible. It is no surprise that market intelligence firm IDC projects global AI revenue to reach US$154 billion in 2023 and surpass US$300 billion by 2026. However, the AI boom comes with hidden costs, including energy consumption, greater strain on existing digital infrastructure and sustainability challenges. This necessitates a closer examination of these hidden concerns and, conversely, the innovations that are propelling us towards a more sustainable AI-driven future. AI’s energy footprint and environmental impact Training these state-of-the-art AI models demands vast computational resources. Generative AI training needs large graphics processing unit (GPU) clusters – a modern-day supercomputer – working overtime, consuming large amounts of energy. The costs involved are not for the faint-hearted either, with one research house forecasting that generative AI data server infrastructure plus operating costs will exceed US$76 billion by 2028. Furthermore, the environmental implications of AI’s energy demands are increasingly difficult to stomach if sustainability isn’t a core consideration. While data around the carbon footprint of generative AI queries is scarce, a recent published paper attempts to illustrate its potential impact. Alphabet’s chairman John Hennessy has stated that interacting with a large language model could likely cost 10 times more than a standard keyword search. it is evident that as generative AI continues to gain popularity, both energy costs and the carbon footprint impact will be in the spotlight, giving a clear advantage to those who can support these future workloads sustainably. Beyond energy and environmental demands, the growth of AI is also requiring a rethink of data centre engineering and design. As computational needs surge, the trend is shifting towards higher rack densities within data centres. This means packing more servers and components into the same physical space, allowing for greater computational power. This isn’t a new trend, given that the proliferation of generalpurpose public cloud computing had already started down this path of densifying computing. However, the emergence of AI serves to significantly accelerate this trend, with exploding demand for training, machine learning and inference applications combining with chip thermal design power (TDP) breaking new limits. TDP is a measure (in watts) of the maximum heat a computer chip, such as a central processing unit (CPU) or GPU, can use. Modern data centres with high-density racks typically offer 10-30 kilowatts (kW) per rack, with typical use cases including cloud computing and content delivery at scale such as video, gaming, and social media. These rack densities are well-suited to CPU-dominant computing workloads and perhaps modest GPU applications. However, when considering the latest generation GPUs such as Nvidia’s H100, which is the key ingredient of the next wave of AI, it becomes evident that we will require rack densities well above 30 kW, soon reaching 100 kW or more per rack. These densities go far beyond the practical and economical limits of air-based cooling and are driving an industry-wide transition to liquid cooling at an accelerated pace. These ultra-high-density set-ups demand specialised infrastructure design considerations, novel power distribution methods and, crucially, advanced cooling mechanisms such as liquid cooling. The new requirements will be a challenge for older data-centre operators whose facilities weren’t initially built with the flexibility to adapt to these applications. Interestingly, Meta announced late last year that it would pause construction on two data centre projects to redesign them for the deployment of AI infrastructure as part of an overall shift of resources towards AI. Given that data-centre builds typically span a few years, Meta clearly views this rework as a small price to pay in exchange for the substantial potential that AI offers its business. Innovations for sustainability, and at scale With the rise in TDP and rack densities, traditional cooling methods are proving inadequate as we reach the limits of conventional air cooling. Today’s data centres, catering to the demands of AI, must lean on alternative and more sophisticated solutions such as liquid cooling, which can offer direct and more effective ways to cool the computer infrastructure while boosting efficiency. That said, implementing liquid cooling comes with its own set of challenges, including the nascent supply chain, the close coupling required between servers and cooling systems, the ability to provide redundancy in case of failure, the lack of standards, and, very importantly, risk of leakage. Modular and scalable set-ups can go a long way in addressing some of these issues, such as our recently announced partnership with Firmus Technologies to deploy “sustainable AI factories”, using our immersion-cooled HyperCube platform. These platforms integrate powerful GPU clusters running on Nvidia L40S, Nvidia A100 and H100 GPUs with purpose-designed liquid immersion cooling tanks housing safe and non-conductive liquid. They serve as the foundation for our infrastructure-as-a-service offering called Sustainable Metal Cloud (SMC). Hosted within ST Telemedia Global Data Centres’ facilities, SMC also signals a broader trend towards specialised infrastructure offerings that democratise access to AI compute for businesses and organisations looking to tap AI but are finding it daunting to source for GPUs amid a global supply shortage, much less having the capabilities to build and run their GPU clusters and data centres. This lets even smaller enterprises leverage the right type of computing infrastructure for sophisticated AI applications without the need for colossal infrastructure investments. Equally important is the use of scaled, immersion-cooling, ideally placed to address the macro sustainability of the datacentre sector. While the combination of Nvidia’s accelerated computing platform and the HyperCube’s advanced cooling capabilities already improve energy efficiency of the compute by up to 30 per cent through the elimination of server fans and improved power supply, the overall package also has the benefit of cutting power usage effectiveness – a metric used to determine how efficiently a data centre uses energy – and carbon emissions significantly. Elsewhere, in a poetic twist, as much as AI is the driving force behind these challenges of growing energy consumption within the data centre, AI can also be used to help. Using predictive analytics and real-time monitoring, AI algorithms can be used to optimise data-centre cooling, dynamically adjusted based on needs, ensuring not only efficiency but also improved reliability. To achieve this goal, we partnered with leading control-system vendors to use AI and machine-learning technology within our data centres to improve cooling-system efficiency. Solving real problems while ensuring a sustainable and innovative future Technology and, indeed, AI are playing a critical role in helping to solve the world’s climate change challenges. The insights gleaned from analysing climate data, for example, can help make better climate pattern models, which in turn help us make more informed predictions, which can have an impact on the effectiveness of potential mitigation strategies. Many in the industry are already hard at work applying deeplearning models trained on Nvidia GPUs to solve real-world problems. For example, researchers from the University of California at Berkeley and Santa Cruz, and the Technical University of Munich, are tackling earthquake predictions, while a medical-device company is developing technology that enables surgeons to evaluate tissue biopsies in the operating room, with AI-accelerated insights in just minutes, versus what would otherwise take weeks going through a pathology lab. The benefits of AI to society are both radical and immense. It has the potential not only to transform economies but also to revolutionise how we work, live and play. While the current challenges posed by AI’s expansion are formidable, they are not insurmountable. Through ingenuity, innovation and a commitment to sustainability, we can democratise access and ensure a cost-effective path for organisations both large and small to leverage AI. This will pave the way for an AI-driven world while still prioritising environmental responsibility. By understanding the true costs of AI, from energy to infrastructure, we can ensure that our journey into the future is as sustainable as it is innovative. The writer is group chief technology officer of ST Telemedia Global Data Centres. This article was first published in The Business Times on 16 November 2023.