Equilibrium Infra Bulletin #41: Decentralized AI Training, Hardware-Accelerated SVM Chain, State of Interop, and more...

Jan 09, 2025

Equilibrium designs, builds, and invests in core infrastructure for the decentralized web. We are a global team of ~30 people who tackle challenges around security, privacy, and scaling.

🔍 Frontier Training: The past, present, and future of decentralized training

⚡️ Topic Summary

Training frontier AI models require enormous data centers with tens of thousands of interconnected GPUs. The high cost of training (estimated to reach ~$10bn in the near future) acts as a strong centralizing force and consolidates power among the select few companies who can afford it. Decentralized training aims to enable companies to distribute workloads across a heterogeneous network of GPUs not physically co-located in a single data center, reducing barriers to entry and shifting the power dynamics.
One important thing to note is that distributed training ≠ decentralized training: Decentralized training is a subset of distributed training and in addition to hardware not being physically co-located, it’s also typically heterogeneous and untrusted.
While decentralized training is promising in theory, it faces three main hurdles:
- Bandwidth and latency: Training requires high bandwidth, but decentralized networks operate over the public internet (bandwidth often limited to 100Mb/s to 1Gb/s). Through optimizations such as DisTrO and SWARM, the bandwidth requirements can be reduced by several orders of magnitudes.
- Trustlessness and Privacy: If participants are “untrusted,” how do we ensure they do not tamper with model parameters or leak data? We can either use cryptography (leveraging ZKP, FHE, or TEEs, of which TEEs are the most feasible from an efficiency standpoint) or rely on economic incentives (similar dynamics to POS networks).
- Scale: The key question is whether decentralized networks can achieve the same scale as centralized models with billions or trillions of parameters. Interestingly, some research indicates that compute times grow more than communication time as models get larger. This means that communication becomes a relatively smaller issue with larger models.

🤔 Our Thoughts

While the decentralized approach adds overhead (mainly from communication, fault tolerance, and potentially cryptographic proofs), it still has the potential to be cheaper overall than centralized training. This is due to less need for cooling (spread-out devices), avoiding the upfront cost of building data centers, tapping into latent hardware, and most importantly, leveraging crypto-economic incentives (instead of paying for compute upfront, network participants are given ownership in the trained model and a share of future revenues).
At this point, multiple research teams have shown that it’s technically possible to train substantial models in a distributed manner (across networks of geographically diverse hardware). They’ve also achieved communication savings of hundreds to thousands of times. However, truly trustless and secure decentralized training is still something we’re working towards as an industry.

💡 Research, Articles & Other Things of Interest

📚 Solayer Chain: Presenting InfiniSVM - a hardware-accelerated SVM blockchain.

📚 5 pieces on AI x crypto: What, where, how: Five highlights from the a16z crypto team on interesting content related to the intersection of crypto and AI.

📚 The State of Interop (2025): Interop between different chains remains a key issue, but there’s also been a lot of progress.

🤌 Personal Recommendations From Our Team

📚 Reading: Plastic List Report: Testing 300 Bay Area foods for plastic chemicals produced some interesting findings.

🎧 Listening: Avishai Cohen Trio - Festival de jazz de Leverkusen 2024: New Avishai Cohen Trio concert video.

💡 Other: Kakizome, Japanese way of new-years resolution: Instead of “I’m going to read at least one book per week”, a better approach may be “It’s the year of reading”.