Six Core Challenges in the Scalable Deployment of Intelligent Computing Centers

Release Date:

2026-02-25

As the global wave of artificial intelligence sweeps the world, intelligent computing centers—serving as the “digital foundation” that underpins large-model training, inference, and a wide range of intelligent applications—have moved from pilot projects at isolated sites to large-scale deployment. By 2025, more than 250 intelligent computing centers will have been built or are under construction nationwide, with a total computing power of 280 EFLOPS, spanning eight major hub nodes including the Beijing–Tianjin–Hebei region, the Yangtze River Delta, and the Guangdong–Hong Kong–Macao Greater Bay Area.

In fact, the large-scale deployment of intelligent computing centers is far from a mere stacking of GPUs; it is a complex systems engineering endeavor that spans project initiation, solution design, resource procurement, construction and deployment, system go-live, and final acceptance and delivery. Behind this undertaking lie not only rapid technological iteration but also challenges such as operational and maintenance complexity and an immature industry ecosystem. Therefore, “building” an intelligent computing center well is merely the starting point—truly testing whether the center can be delivered ready for use and sustainably operated over the long term.

This article will examine from Six dimensions: technical architecture, engineering implementation, energy efficiency management, software stack maturity, industry collaboration, and business models. , systematically analyzes the core challenges of large-scale delivery for intelligent computing centers.

Complexity of the Technical Architecture: Challenges in Heterogeneous Computing Power and High-Speed Interconnectivity

The foundation of an intelligent computing center is computational power; however, compared with traditional cloud data centers, its technical architecture is significantly more complex, primarily in two key aspects:

In the realm of heterogeneous computing power management, AI accelerator chips are currently in a phase of diversified competition: NVIDIA GPUs continue to dominate the market, while domestically developed GPUs, NPUs, DPUs, and FPGAs are steadily emerging. However, chips from different vendors exhibit significant differences in programming interfaces, driver optimization, and software ecosystems, making it exceedingly challenging to pool and centrally schedule computing resources.

How can we efficiently integrate chips with different architectures within a single intelligent computing center? How can we prevent resource fragmentation? And how can we abstract away the underlying heterogeneity for upper-layer AI developers? To this day, no fully mature solutions exist for these challenges.

In terms of high-speed interconnectivity and cluster scale, training large models requires parallel collaboration across tens of thousands of GPUs, which in turn demands networks with extremely high bandwidth and low latency. Currently, InfiniBand and high-speed Ethernet (such as RoCE) are the mainstream choices; however, as the number of nodes scales to several thousand or even tens of thousands, network bottlenecks, topology design, and traffic scheduling all become critical constraints.

For example, training a large model at the GPT-4 scale requires a GPU cluster on the order of tens of thousands of accelerators, with network architecture complexity and engineering reliability requirements that far exceed those of traditional internet applications. The inherent uncertainty in the technical architecture means that scaling up and delivering intelligent computing centers is not a matter of “copy-and-paste”; rather, each deployment necessitates a fresh balance among compute resource provisioning, interconnect design, and software adaptation, leading to progressively longer delivery cycles and higher risks.

Complexity of Project Implementation: Full-Stack Challenges from the Data Center to Liquid Cooling

The core challenges in building traditional data centers lie in power supply and thermal management, while these issues are further exacerbated in the context of intelligent computing centers.

First, the challenge of power supply and distribution posed by high power density. A high-end AI server can consume more than 10 kW, whereas a conventional general-purpose server typically draws only 2–3 kW. This means that, for the same cabinet footprint, the power demand of an intelligent computing center increases several-fold. Ensuring the stability of large-scale power supply and properly designing redundant power paths have thus become bottlenecks in scaling up deployment.

Second, the complex implementation of liquid-cooling systems. Air cooling can no longer meet the thermal-management requirements of ultra-high-power servers, making liquid cooling the standard configuration for intelligent computing centers. However, liquid cooling entails a host of intricate challenges, including pipeline installation, coolant circulation, and operational safety: How can coolant leaks be prevented? How can chip reliability be ensured in high-humidity environments? And how can equipment from multiple vendors be integrated with diverse liquid-cooling solutions? As a result, the construction of data-center facilities for intelligent computing centers has evolved from the traditional “civil engineering plus air cooling” model to a multidisciplinary engineering system.

Third, there is the challenge of delivery timelines and engineering coordination. From design to go-live, the delivery cycle for an intelligent computing center typically spans 12 to 18 months. This creates an inherent mismatch with the rapid evolution of the AI industry: by the time a center is delivered, chip iterations may already have advanced, and new architectural optimization requirements will inevitably drive the need for retrofitting. This “delivery–obsolescence” paradox represents a major practical hurdle in scaling up such projects.

Energy Efficiency and Green, Low-Carbon Development: The Sustainability Challenges Behind Scaling Up

According to estimates, an AI computing cluster with a capacity of 10,000 GPUs can consume hundreds of millions of kilowatt-hours of electricity annually—equivalent to the residential electricity consumption of a medium-sized city. As the number of intelligent computing centers grows rapidly, energy consumption and carbon emissions are becoming increasingly pressing concerns.

First, there is the challenge of PUE (Power Usage Effectiveness). Although liquid cooling can reduce PUE to as low as 1.1 or even 1.05, maintaining long-term stability in large-scale clusters remains difficult. Any fluctuations in the cooling system can lead to degraded energy efficiency and even jeopardize cluster operations.

Secondly, there is an insufficient supply of green energy. The “East Data, West Computing” initiative calls for the construction of data centers in western regions to leverage clean energy. However, in practice, the supply of clean energy is inherently volatile, and transmission distances are limited, making it difficult to fully align computing demand with energy supply.

Third, there is the challenge of balancing energy efficiency with performance. Computing resource scheduling often requires striking a trade-off between “full-load performance” and “energy-saving mode.” How to ensure the efficiency of AI training while avoiding unnecessary energy waste is a critical issue that must be addressed in large-scale operations.

Maturity of the software stack: a gap from AI frameworks to compute resource scheduling

Hardware can be stacked through procurement, but the maturity of the software ecosystem is what determines whether an intelligent computing center is truly “easy to use.”

First, AI frameworks lack sufficient compatibility. Mainstream AI frameworks such as PyTorch and TensorFlow are highly optimized for NVIDIA GPUs, but offer only limited support for domestically produced chips. As a result, many domestic GPU vendors must independently adapt their deep-learning operator libraries, leading to high migration costs and a suboptimal user experience for developers.

Second, the computing power scheduling and resource management systems are still imperfect. Traditional Kubernetes is not fully suitable for large-scale AI clusters. Task scheduling involves multi-dimensional requirements—such as GPU memory size, interconnect topology, job priority, and energy consumption policies—that are far more complex than those in conventional cloud-native scheduling. To date, mature AI-specific computing power scheduling systems are still under active development.

Third, observability and operations tools are inadequate. As cluster scale expands to the tens of thousands of GPUs, even the smallest fault can result in substantial losses. Therefore, real-time monitoring and predictive analytics for GPU health, network topology, and job execution status represent critical gaps that urgently need to be addressed in the software stack of intelligent computing centers.

Challenges in Industrial Collaboration: Government–Enterprise Relations and Upstream–Downstream Bargaining

The construction of intelligent computing centers is typically led by the government, implemented by enterprises, and supported by the entire industry chain. While this model offers advantages in accelerating the deployment of computing infrastructure, it also gives rise to coordination challenges.

First, there is a misalignment between policy and demand. In some cities, the AI application ecosystem remains immature, resulting in low utilization of computing power and substantial underutilization of GPU resources. This coexistence of “excess” and “shortage” in computing capacity has evolved into a structural contradiction.

Second, bargaining power is unevenly distributed across the upstream and downstream segments. Complex conflicts of interest exist among chip manufacturers, OEMs, data center operators, and AI companies. When chip supply is tight, chipmakers wield decisive influence, forcing operators and application providers to accept higher costs.

Exploring the Business Model: How Can Computing Power Be Monetized?

The ultimate goal of scaling up the delivery of intelligent computing centers is to establish a sustainable business model. However, monetizing computing power remains in the exploratory phase.

First, there is an imbalance between the training market and the inference market. While large-model training is concentrated among a handful of industry giants, inference represents a much broader market. However, inference workloads are far more sensitive to latency and cost; thus, striking a balance between maximizing compute-resource utilization and enabling flexible billing remains a significant challenge.

Second, the challenges of Computing-as-a-Service (CaaS). While many intelligent computing centers tout “Computing-as-a-Service,” compared with cloud computing, the elastic scaling and task scheduling for AI workloads are far more complex, making it still a long way off to achieve true “convenience on par with utilities like water and electricity.”

Third, the investment payback period is excessively long. Large-scale intelligent computing centers typically require investments on the order of tens of billions, yet the monetization model for the computing power market remains unclear. Operators face substantial upfront capital expenditures coupled with uncertain long-term returns, which is a major impediment to large-scale deployment.

Concluding Remarks

The large-scale deployment of intelligent computing centers is not a task that any single enterprise can accomplish on its own; rather, it is a systemic undertaking that requires industry-chain collaboration and policy guidance. To overcome the challenges outlined above, coordinated innovation across the upstream and downstream sectors is essential. The focus must extend beyond merely “building” and “constructing” to emphasize the efficiency and value of “usage” and “operations,” so that intelligent computing centers can truly serve as a powerful engine for empowering countless industries.

Data center,PDU,Busbar trunking,Small busbar,Busbar,Intelligent Busbar

Six Major Trends in Data Centers in 2026

Twenty Years of Core Design Insights for Data Center Networking: The Evolution from On-Premises Data Centers to Cloud-Native Architectures

Six Major Trends in Data Centers in 2026

Twenty Years of Core Design Insights for Data Center Networking: The Evolution from On-Premises Data Centers to Cloud-Native Architectures

Six Core Challenges in the Scalable Deployment of Intelligent Computing Centers

Related News

SAF Coolest v1.3.1.1 设置面板JYOSD-ZUEU-XXZSE-SZV

无数据提示

V1.3.1 SVG图标库请自行添加图标，用div包起来，并命名使用