Musk’s Colossus In Memphis To Reach 3 Million GPUs, Prepares To Eat The World

Share This:

From “technocracy.news”

Musk named Colossus after the 1970 science fiction horror flick, Colossus: The Forbin Projectwhere the US builds a military supercomputer that takes the US and the world hostage, threatening nuclear annihilation. Watch the movie. Musk is expanding Colossus with another 100 acre purchase in Memphis, TN, to scale Colossus up to 3 million state-of-the-art AI processors. Got power, Memphis?

⁃ Patrick Wood, Editor.

A step-by-step analysis and projection of the xAI’s Colossus supercomputer project. Memphis development and xAI have talked about a 1+ million chip supercluster at the site. Considering AI rack density power, internal power allocation for 1 million chips, potential for expansion beyond to 2-3 million GPUS and a likely project timeline for 2025–2027. Industry norms are used to estimate.

This projects that there is demand to make AI training another 15-30X of compute beyond 1 million B200s. This would be 3 million Rubins, Ultrarubin chips or Dojo 3 or Dojo 4 chips in 2027. 3X more chips and chips that are perhaps 10X the compute.

Summary Timeline
Dec 31, 2025: 1 million GPUs, 1,641 MW, 15,625 racks. [Can delay by 3 month or more as everything has to go right. 800K chips with delays beyond 1.2 GW]

Dec 31, 2026: 2 million GPUs, 3,281 MW, 31,250 racks.

Dec 31, 2027: 3 million GPUs, 4,922 MW (or 3,150 MW with efficiency), 46,875 racks.

Acceleration Factors
GPU Pace: From 100k chips set up in 122 days, 92 days and projected to 60 days per 100,000 GPUs reflects Musk’s optimization (e.g., Tesla Shanghai: 12 months to production).

Power Pace: 33 MW/month to 100 MW/month + bursts aligns with xAI’s rapid turbine deployment and TVA’s 1.2 GW pledge by 2027.

Constraints Mitigated: 750,000 sq ft and 13 MGD cooling suffice; chip supply (NVIDIA’s $1.08B deal, $5Billion Dell/Nvidia deals) and permits (Memphis support) keep pace.

This timeline assumes xAI sustains its aggressive scaling, leveraging parallel workflows and regional cooperation.

1. AI Rack Density Power from $5 Billion in Dell B200-Enabled Servers
Given:

Colossus has a 750,000 square foot facility with 200,000 H100/H200 chips currently installed and are using less than 250 MW of power.
Tesla megapacks are used to prevent any power demands impacting peak power usage for other residential and business customers
A $5 billion investment in Dell B200-enabled servers is used for the expansion to 1 million chips.
Rack density power refers to the power consumed per rack of servers, typically measured in kilowatts (kW) per rack.

Current Rack Density Calculation:

The existing 200,000 NVIDIA H100/H200 GPUs use 250 MW (250,000 kW).
Each H100 GPU consumes 700 W (0.7 kW) at peak, and H200s are similar or slightly higher (800 W). Assuming an average of 750 W per GPU:
200,000 GPUs × 0.75 kW = 150,000 kW (150 MW) for GPUs alone.
The remaining 100 MW (250 MW total – 150 MW GPUs) accounts for cooling, networking, CPUs, and other infrastructure (~0.5 kW per GPU in overhead).

Industry-standard AI racks (e.g., NVIDIA HGX or Dell systems) typically house 64 GPUs per rack (8 servers × 8 GPUs each, per Supermicro designs in context).
200,000 GPUs ÷ 64 GPUs/rack = ~3,125 racks.
250,000 kW ÷ 3,125 racks = ~80 kW per rack currently.

Dell B200 Servers:

The NVIDIA B200 (Blackwell) GPU is significantly more power-hungry than H100/H200, with estimates of 1,000–1,200 W per GPU due to its 141 GB HBM3e memory and 20 PFLOPS performance.

Assuming Dell B200 servers follow a similar 8-GPU-per-server, 64-GPU-per-rack design:
At 1,100 W per GPU (midpoint estimate): 64 GPUs × 1.1 kW = 70.4 kW for GPUs.
Adding ~50% overhead for cooling and infrastructure (industry norm for high-density AI, per McKinsey data): 70.4 kW × 1.5 = ~105 kW per rack.
The $5 billion investment likely funds part of the expansion to 1 million chips. If all were B200s:
1,000,000 GPUs × 1.1 kW = 1,100 MW for GPUs alone.
Total power with overhead: 1,100 MW × 1.5 = 1,650 MW.
Number of racks: 1,000,000 ÷ 64 = ~15,625 racks.
Rack density: 1,650,000 kW ÷ 15,625 racks = ~105 kW per rack.

105 Kilowatts Per Rack

The AI rack density power for Dell B200-enabled servers is approximately 105 kW per rack, assuming a $5 billion investment equips a significant portion of the 1 million-chip goal with B200 GPUs.

2. Internal Power for 1 Million Chips

Given:

Current setup: 200,000 H100/H200 chips at 250 MW.
Planned: 1 million chips with 250 MW existing + 950 MW from STM-130 turbines = 1,200 MW total.

Power Estimate:

If all 1 million chips are B200s (1,100 W each):
1,000,000 × 1.1 kW = 1,100 MW for GPUs.
Total with 50% overhead: 1,100 MW × 1.5 = 1,650 MW.

Current plan provides 1,200 MW (250 MW + 950 MW), which falls short by ~450 MW for a full B200 deployment.

If a mix of H100/H200 (750 W) and B200 (1,100 W) is used, let’s assume 50% each:
500,000 H100/H200 × 0.75 kW = 375 MW.
500,000 B200 × 1.1 kW = 550 MW.
Total GPU power = 925 MW.
With 50% overhead: 925 MW × 1.5 = ~1,387 MW.

The 1,200 MW available covers ~86% of this mixed-chip scenario (1,200 ÷ 1,387), suggesting either:
A slightly reduced density or efficiency tweaks (e.g., lower clock speeds).
A phased rollout until more power is secured.

Water Cooling:

13 million gallons/day (MGD) is ~9,028 gallons per minute (GPM).
For 1,650 MW (full B200 scenario), liquid cooling typically requires ~0.5–1 GPM per kW (per data center norms).
1,650,000 kW × 0.75 GPM/kW = 1,237,500 GPM (~1.78 MGD), well within capacity.
1,200 MW needs 900,000 GPM (1.3 MGD), also easily met.

Answer:

Internal power for 1 million chips is ~1,387 MW with a mixed H100/H200-B200 setup, or 1,650 MW if all B200s. The planned 1,200 MW covers the former with adjustments or requires more power for the latter. Water supply (13 MGD) is sufficient for either.

3. Expansion Beyond 1 Million Chips

Power Options:

Mobile STM-130 Turbines: Each Solar Turbines STM-130 provides 16 MW. 950 MW requires ~59 units (950 ÷ 16). Adding another 450 MW (to reach 1,650 MW) needs ~28 more turbines, totaling 87. Space and permits are constraints, but xAI’s rapid deployment (17–18 turbines already) suggests feasibility.

TVA Grid Upgrades: Context shows TVA approved 150 MW initially, with plans for 1.2 GW by 2026–2027. Beyond 1,200 MW, additional substations (4–6 per context) could add 600–900 MW, pushing total capacity to 1,800–2,100 MW.

Dojo 2/3 or Rubin Chips: Tesla’s Dojo chips (e.g., Dojo 2 in 2025, Dojo 3 in 2026) or NVIDIA Rubin (2026) may lower power per chip (e.g., 500–800 W via efficiency gains, per Etched/Taalas trends). For 1.5 million chips at 700 W each:
1,500,000 × 0.7 kW = 1,050 MW GPUs + 525 MW overhead = 1,575 MW, within reach with upgrades.

Possibility:

1.5–2 million chips are plausible by 2027 with 1,800–2,100 MW from turbines and TVA, especially if newer chips reduce power needs.

4. Possible Project Timeline (2025–2027)
Assumptions:

Current: 200,000 GPUs, 250 MW, 750,000 sq ft.
Goal: 1 million GPUs, 1,200 MW, 1 million sq ft by 2026 (per X post sentiment).
Pace history and projected

GPU Installation Pace:
First 100,000 GPUs: 122 days (~820 GPUs/day).

Second 100,000 GPUs: 90 days (~1,111 GPUs/day).

Acceleration trend: Each subsequent 100,000 GPUs reduces installation time by ~10–15 days (e.g., 75 days, 60 days), plateauing at ~60 days due to logistics and workforce limits

Power Installation Pace:
Current: 250 MW installed; 950 MW (STM-130 turbines) planned. Context suggests 100 MW added in ~3 months (33 MW/month).

Acceleration: Assume power scales with GPU installs, doubling to 66 MW/month by mid-2025 (e.g., 17 turbines in 90 days = 272 MW), then 100 MW/month by 2026 with TVA grid upgrades.

Power Needs:
1 million GPUs: 15,625 racks × 105 kW/rack = 1,641 MW.

2 million GPUs: 31,250 racks × 105 kW/rack = 3,281 MW.

3 million GPUs: 46,875 racks × 105 kW/rack = 4,922 MW.

Space: 750,000 sq ft fits 41,250–51,562 racks (2.6–3.3 million GPUs), so 3 million GPUs is feasible.

2025: Reach 1 Million GPUs
Starting Point (March 08, 2025): 200,000 GPUs, 250 MW, 3,125 racks.

GPU Installation Rate:
3rd 100,000 (to 300,000): 75 days (~1,333 GPUs/day).

4th–10th 100,000 (to 1,000,000): Average 65 days each (1,538 GPUs/day), reflecting acceleration plateau.

Power Installation Rate:
Initial: 66 MW/month (doubled from 33 MW/month).

Mid-2025: 100 MW/month with TVA/grid scaling.

Q1–Q2 2025:
March 08–June 01 (85 days): +100,000 GPUs (75 days + 10-day buffer) = 300,000 GPUs.
Racks: 4,687. Power: 250 MW + 198 MW (3 months × 66 MW/month) = 448 MW.

June 02–Dec 01 (183 days): +700,000 GPUs (7 × 65 days = 455 days, but parallelized to 183 days with multiple crews).
Total GPUs: 1,000,000. Racks: 15,625.

Power: 448 MW + 600 MW (6 months × 100 MW/month) = 1,048 MW.

Additional: 17 STM-130 turbines (272 MW) in 90 days = 1,320 MW by Oct 01.

Q3–Q4 2025:
Oct 01–Dec 31 (92 days): Final 321 MW to 1,641 MW (e.g., 20 turbines + TVA boost).
GPUs: 1,000,000 stabilized. Power: 1,641 MW.

Facility: 156,250 sq ft (10 sq ft/rack), 38% of usable space.

Outcome by Dec 31, 2025:
1 million GPUs, 1,641 MW, 15,625 racks. Achieved ~9 months from March 08 with accelerated pace.

2026: Scale to 2 Million GPUs
GPU Installation: 60 days per 100,000 GPUs (1,667 GPUs/day, max crew efficiency).

Power Installation: 100 MW/month sustained, plus 300 MW bursts from turbine batches (e.g., 19 turbines in 60 days).

Target: 2 million GPUs = 31,250 racks, 3,281 MW.

Q1–Q2 2026:
Jan 01–June 30 (181 days): +1,000,000 GPUs (10 × 60 days = 600 days, parallelized to 181 days).
GPUs: 2,000,000. Racks: 31,250.

Power: 1,641 MW + 600 MW (6 × 100 MW/month) + 572 MW (2 × 286 MW turbine batches) = 2,813 MW.

3–Q4 2026:
July 01–Dec 31 (184 days): Optimize and add 468 MW (e.g., TVA substation + 29 turbines).
Power: 3,281 MW by Nov 01.

Space: 312,500 sq ft (76% of 412,500 sq ft usable).

Outcome by Dec 31, 2026:
2 million GPUs, 3,281 MW, 31,250 racks. Achieved ~21 months from March 08.

2027: Scale to 3 Million GPUs
GPU Installation: 60 days per 100,000 GPUs.

Power Installation: 100 MW/month + 500 MW bursts (e.g., TVA upgrades, 31 turbines in 90 days).

Target: 3 million GPUs = 46,875 racks, 4,922 MW (or less with efficient chips).

Chip Efficiency: Assume Dojo 3/Rubin at 700 W/GPU by mid-2027:
3,000,000 × 0.7 kW = 2,100 MW GPUs + 50% overhead = 3,150 MW.

Q1–Q2 2027:
Jan 01–June 30 (181 days): +700,000 GPUs (7 × 60 days, parallelized to 180 days).
GPUs: 2,700,000. Racks: 42,187.

Power: 3,281 MW + 600 MW (6 × 100 MW/month) = 3,881 MW (B200) or 2,835 MW (mixed efficient chips).

Q3–Q4 2027:
July 01–Dec 31 (184 days): +300,000 GPUs (3 × 60 days, parallelized to 90 days).
GPUs: 3,000,000. Racks: 46,875.

Power: 3,881 MW + 1,041 MW (TVA + 65 turbines) = 4,922 MW (B200) or 3,150 MW (Dojo 3/Rubin).

Space: 468,750 sq ft (10 sq ft/rack), fits within 412,500–515,620 sq ft usable.

Outcome by Dec 31, 2027:
3 million GPUs, 4,922 MW (B200) or 3,150 MW (efficient chips), 46,875 racks. Achieved ~33 months from March 08.

Summary Timeline
Dec 31, 2025: 1 million GPUs, 1,641 MW, 15,625 racks. [Can delay by 3 month or more as everything has to go right. 800K chips with delays beyond 1.2 GW]

Dec 31, 2026: 2 million GPUs, 3,281 MW, 31,250 racks.

Dec 31, 2027: 3 million GPUs, 4,922 MW (or 3,150 MW with efficiency), 46,875 racks.

Acceleration Factors
GPU Pace: From 122 days to 60 days per 100,000 GPUs reflects Musk’s optimization (e.g., Tesla Shanghai: 12 months to production).

Power Pace: 33 MW/month to 100 MW/month + bursts aligns with xAI’s rapid turbine deployment and TVA’s 1.2 GW pledge by 2027.

Constraints Mitigated: 750,000 sq ft and 13 MGD cooling suffice; chip supply (NVIDIA’s $1.08B deal) and permits (Memphis support) keep pace.

This timeline assumes xAI sustains its aggressive scaling, leveraging parallel workflows and regional cooperation.

BACKGROUND on Physical Space for Racks

Space Feasibility: Racks in 750,000 Square Feet
Rack Footprint: A standard AI server rack (e.g., NVIDIA HGX or Dell design) occupies ~10 square feet, including front/back clearance and aisles (per ASHRAE data center guidelines). High-density setups might squeeze this to 8–10 sq ft/rack.

Usable Space: Data centers typically allocate 50–60% of total floor space to IT equipment (racks), with the rest for cooling, power distribution, and support (per Uptime Institute norms).
Assume 55% usable: 750,000 sq ft × 0.55 = 412,500 sq ft for racks.

Racks Possible:
At 10 sq ft/rack: 412,500 sq ft ÷ 10 = 41,250 racks.

At 8 sq ft/rack (tight layout): 412,500 sq ft ÷ 8 = 51,562 racks.

GPUs Supported:
41,250 racks × 64 GPUs/rack = 2,640,000 GPUs.

51,562 racks × 64 GPUs/rack = 3,299,968 GPUs.

For 1 Million GPUs:
1,000,000 GPUs ÷ 64 GPUs/rack = 15,625 racks.

Space required: 15,625 × 10 sq ft = 156,250 sq ft (37.9% of 412,500 sq ft), or at 8 sq ft/rack = 125,000 sq ft (30.3%).

Conclusion: The 750,000 sq ft facility has ample space for 15,625 racks (1 million GPUs), leaving room for 25,625–35,937 additional racks (1.64–2.3 million more GPUs) depending on layout efficiency.

Read full story here…

Share This: