Equilibrium Infra Bulletin #49: Research Deep Dive Edition - Decentralized Diffusion Networks
Jommi here! Enjoy this special deep dive edition of our newsletter :)
Equilibrium designs, builds, and invests in core infrastructure for the decentralized web. We are a global team of ~30 people who tackle challenges around security, privacy, and scaling.
🔍 🤿 Jommi’s Research Deep Dive - Decentralized Diffusion Networks
⚡️ Topic Summary
Hey all, this is Jommi from Equilibrium. I wanted to go on a slightly different direction than our regular newsletter, and go deeper into a single topic. The idea is to showcase some of the internal R&D work we do at Equilibrium as we explore new topics and seed new ideas.
An interesting research paper came out earlier this year called Decentralized Diffusion models (link). It outlines a method for training SOTA AI image/video generators, known as diffusion models, across a distributed network of nodes, seemingly with some elements of decentralization. I wanted to understand this paper and its implications for decentralized AI. We’ve heard about attempts to train autoregressive models (think ChatGPT), but could something about diffusion models make them a better fit for utilizing distributed and permissionless networks?
The basics: Diffusion models work by learning to reverse a noise-adding process; they start with random static and gradually refine it into a coherent output based on their training. The core idea examined is how to split this complex training process across many independent computers, potentially forming a distributed or even permissionless system.
Decentralized Diffusion Models
As stated earlier, the primary focus of this deep dive was the "Decentralized Diffusion Models" (DDM) paper. Its central concept involves dividing the training data into distinct clusters based on content similarity using a technique called DINOv2 clustering. You can think of DINOv2 as an AI that understands the content of images and gives each one a numerical 'description (embedding). Then, k-means is an algorithm that groups data points with similar 'descriptions' together into a predefined number (K) of clusters. So basically its a mathematical way of finding similarity in a dataset, and then creating K clusters based on that similarity.
After this, separate "expert" diffusion models are trained in isolation on each data partition. Key aspects of the DDM approach include:
Isolated Training: Experts train independently without needing to communicate, removing heavy bandwidth needs that exist between clusters in traditional methods of training.
Specialization: Each expert becomes specialized in generating content similar to its assigned data partition.
Router Mechanism: At inference (generation) time, a separate, lightweight "router" model directs user requests to the most appropriate expert(s).
Theoretical Soundness: The paper shows that this ensemble of experts collectively optimizes the same goal as a single, large, centrally trained model - and thus should achieve similar quality in results!
Compositional Diffusion Models
To add some more context, I took a look at another similar paper offering an alternative strategy called "Compositional Diffusion Models" (CDM). Instead of training full "expert" models on data shards, CDM focuses on training smaller, independent components (like LoRA adapters or specialized prompts) on wholly different data subsets. At inference time, these components are mathematically combined (composed) to generate the final output, reflecting the knowledge learned across all distributed participants. Advantages relevant to decentralization include:
Enhanced Data Privacy: Participants only need to share small trained components, not their raw data.
Modularity: Components can be easily added or removed as participants join or leave the network.
Potential for Attribution: The way components are combined might allow tracing contributions back to specific participants.
🤔 Our Thoughts
From the perspective of building a truly decentralized and permissionless network, these papers offer valuable foundational explorations but are not complete solutions. The positive takeaways include:
Proof of Concept for Task Partitioning: Both DDM and CDM demonstrate that diffusion model training can be effectively broken down into smaller, isolated tasks.
Lowering Communication Barriers: The emphasis on isolated training (DDM) or sharing small components (CDM) drastically reduces the need for high-speed interconnects typical of centralized training clusters.
Decoupling Training and Coordination: The router (DDM) or composition mechanism (CDM) separates the heavy lifting of model training from the lighter task of coordinating or combining the results.
However, significant challenges remain when transitioning from the controlled environments described in the papers to a dynamic, trustless, permissionless setting. Key hurdles include:
Centralized Bottlenecks: Both DDM (the data clustering and router training) and CDM (optimal classifier training) rely on centralized steps that require access to information about all data partitions, which contradicts the permissionless ideal. Maybe there could be a way to design a consensus mechanism to solve this?
Need for Trust & Verification: The papers unfortunately assume honest participants. A permissionless network requires robust mechanisms to verify the quality and integrity of contributions and prevent malicious actors.
Dynamic Network Management: Real-world p2p networks see nodes join and leave (churn). The static expert assignments in DDM or component sets in CDM need adaptation for dynamic task allocation and network membership.
Incentives: Participation in a permissionless network requires motivation. Designing effective incentive structures (e.g., using crypto or reputation systems) to reward honest work and resource contribution is still hard and technical approaches usually do not account for this.
Practical considerations, such as hardware, also affect the realities of decentralized networks: Training even a single DDM expert model requires substantial GPU resources (estimated 50-100GB VRAM), typically found only in high-end datacenter GPUs (like NVIDIA A100s). This significantly limits the ability of average users with consumer-grade hardware (max ~24GB VRAM) to participate directly in training. So it seems consumer participation in training in this kind of paradigm still is not possible.
However, using the network for inference (generating images) might be feasible on high-end consumer cards on existing models - but I have not research how inference results could be used to improve a diffusion model specifically. Maybe we can have some learnings from the Autoregressive model side, where networks like Prime Intellect’s INTELLECT-2 is using crowd-sourced inference traces for improving their model via Reinforcement Learning - although that has its limitations too.
Hope you liked this deep dive! Please reach out to us if you have any further questions, and especially if you are an expert in this topic and want to correct some parts. Until next time! 👋👋
🔥 News From Our Friends
Applications for ZuBerlin, an immersive residency bridging humanity and technology (Berlin June 14 - 22) is closing soon. This year the focus will be on L2 Interop, Cryptography, Protocol Architecture, Onchain Data, Future Society – and of course MEV and PBS topics. Apply at zuberlin.city before its too late!
🤌 Personal Recommendations From Our Team
📚 Reading: Foucault's Pendulum - Umberto Eco: A classic book that intricately satirizes conspiracy theories by showing how imagination can dangerously blur reality and fiction - quite a timely book even today.
🎧 Listening: How MegaETH's Node Specialization Enables Real-Time Applications A new podcast series by Austin from Doublezero, featuring a deep dive into the technical belly of the beast of MegaETH.
💡 Other: In an interesting experiment, Finland passed a bill that bans smartphones from schools. The law passed just a day ago and will come into force after summer vacation. Will this be the first of many such legislations?