Skip to main content
DATA CENTRE

4 infrastructure mistakes that slow down AI rollouts - and how to avoid them.

c5a5880befd76eb5a78dcd8772b9522 Michael Wang Apr 27, 2026
Data Centres AI

Most post-mortems on underperforming AI deployments focus on model selection, data quality, or tooling choices. Infrastructure is rarely the headline. It should be.

The physical network layer - cabling, connectivity, pathway design, fibre architecture - is now a direct determinant of GPU cluster performance. The density of the fibre connections, the quality of the physical connections: these are not background conditions. They are performance variables. And decisions made early in a data centre design, often before the AI workload requirements are fully evaluated, can constrain performance in ways that are expensive and disruptive to undo.

Here are four infrastructure mistakes that consistently surface in AI deployments - and what avoiding them requires.

  1. Designing for bandwidth without considering loss

Traditional IT network design centres on throughput: how much data can the fabric move per second. For AI training and inference workloads, that framing misses the critical variables.

GPU-to-GPU communication in distributed AI training relies on collective operations - AllReduce, AllGather, Broadcast - that require tight synchronisation across dozens or hundreds of accelerators simultaneously. The underlying transport, typically RDMA over Converged Ethernet (RoCEv2) or InfiniBand, is acutely sensitive to packet loss and latency variation. Even sub-0.1% packet loss can trigger retransmission events that stall GPU synchronisation, causing idle time to accumulate across the entire cluster. In a 512-GPU training cluster, a single congested link can degrade total cluster utilisation by a factor that makes the training period economically nonviable.

Designing for AI means specifying a lossless, low-latency fabric from the outset: quality optical connections that eliminate insertion loss variation, fibre infrastructure that removes the physical causes of signal degradation. Bandwidth capacity is necessary but not sufficient.

  1. Locking into Single-Generation Cabling

AI infrastructure refresh cycles are compressing. Deployments built for 100G two years ago in traditional DC deployment are already under pressure from 400G GPU interconnects. Deployments being specified for 400G today need a credible path to 800G - and 1.6T is already on the roadmap.

The mistake is optimising the physical layer for today's speed tier without leaving room to move. This typically manifests as cabling choices that cannot support higher modulation formats, connector configurations that cannot accept next-generation transceiver form factors, or pathway fill rates that leave no room for additional cable as port densities increase.

AI infrastructure also calls for different Network architecture such as Mesh and Spine/leaf. A future-proof cabling design accounts for where the technology is going, not just where it is. For fibre, that means specifying OM4 or OM5 multimode - or single mode - at densities that accommodate the next generation of transceivers without a full replant. For fibre patch panels, modular design allows future expansion due to change of transceiver form factor, new density due to the new generation of AI-POD. The cabling has a lifespan measured in decades; the active electronics above it do not. Treating them as having the same refresh horizon is one of the most common and costly mistakes in data centre infrastructure planning.

  1. Underestimating fibre count requirements

Nothing reveals the difference between traditional IT network design and AI fabric design faster than fibre count. Conventional data centre planning models - built around server-to-ToR connectivity and north-south traffic patterns - dramatically underestimate the fibre requirements of GPU clusters.

AI fabrics are east-west dominant. Every GPU needs a high-bandwidth, low-latency path to every other GPU it works with, directly or through the spine. High-radix spine switches, which are the backbone of any serious AI fabric, can present 64 or more 400G ports per device. A single pod of GPU servers, fully cabled to a non-blocking fabric, can require thousands of individual fibre connections, also called Scale-up networking. Scale-out fabric that connect across multi-pod deployment also becomes one of the most significant physical change lead by the AI impact.

The density of fibre deployed for this change can lead to several times more fibre to be installed before AI training take place.

High-density pre-terminated MPO cabling systems - designed for rapid, error-free deployment at the fibre counts AI fabrics demand - are the practical answer to this problem, both for initial deployment and for the phased expansions that follow.

  1. Treating cable management as an afterthought

Cable management in GPU-dense environments is not a housekeeping matter. It is a reliability issue.

Cable management determines the maintainability of the infrastructure over its operational life. A congested cable tray in a 400G environment means that any change - a transceiver swap, a port reassignment, a fabric expansion - carries the risk of disturbing adjacent connections. In an environment where physical connector quality is a direct performance variable, that risk is not trivial. Fibre that is bent below minimum bend radius, stressed at connector interfaces, or subject to repeated handling without proper strain relief is fibre that will degrade signal integrity over time.

Routing, labelling, and pathway capacity need to be part of the infrastructure design process from day one - not resolved during commissioning when options are limited.

Infrastructure built for what AI actually demands

The common thread across all four mistakes is timeline: decisions made before the full demands of the AI workload are understood, optimised for conditions that will not persist. The remedy is designing the physical layer with the same forward-looking discipline applied to compute and software architecture - specifying for longevity, density, and headroom rather than the minimum viable configuration for today.

Aginode's data centre connectivity portfolio - including high-density fibre cabling, pre-terminated MPO solutions, and duplex LC connectivity supporting speeds up to 400G and beyond - is designed for exactly these environments.
 

Aginode solutions for Data centres
More

Share this

About the author

c5a5880befd76eb5a78dcd8772b9522

Michael Wang

Michael Wang (王君原) is the APAC & MEA Product Director at Aginode. He is an expert in structured cabling systems, serving as a member of the Subcommittee on Interconnection of Information Technology Equipment (SAC/TC28/SC25) under China’s National Technical Committee for Information Technology Standardization. He is also an active expert in the ISO/IEC JTC1 SC25 WG3 working group, contributing to the development and revision of both national and international standards. Michael has co-authored several industry white papers and specializes in intelligent building cabling and data center infrastructure design and planning, with extensive hands-on project experience.