Equilibrium Infra Bulletin #51: Opportunities In Decentralized AI Data Markets, Private Blockchain Infra For Institutions, and more...
Equilibrium designs, builds, and invests in core infrastructure for the decentralized web. We are a global team of ~30 people who tackle challenges around security, privacy, and scaling.
🔍 Beyond the Data Crawl: How the AI Data Landscape is Evolving
⚡️ Topic Summary
The prevailing meta in data and AI training has long been “more is better”. However, we are now seeing signs of diminishing marginal returns from additional data and approaching the limit of available natural text data (excluding synthetic data).

Instead, there is a shift toward quality over quantity, as not all data is created equal. Factors such as consistency, completeness, and feature accuracy influence the usefulness of data. This trend is also evident in the AI data marketplace, which comprises a growing number of participants focusing on niche verticals, data curation, and specialized datasets. We’re also seeing the line between supplier and consumer blurring, for example, X and xAI.
In the quest to find more data, there has also been a growing number of IP infringements and copyright lawsuits. Examples include Meta using LibGen’s library of pirated books for training, allegedly following in the footsteps of OpenAI and Mistral. Meanwhile, Thomson Reuters (TR) recently won a case against Ross Intelligence, which had used TR’s data to train a legal research model. Multi-year licensing agreements are also common, but often unavailable to smaller players due to the high cost which increases barriers to entry.
🤔 Our Thoughts
The crypto-counterparts to the above chart (AI Data Marketplace) have primarily focused on decentralized training, inference, and collecting raw data through scraping. Blockchains and token incentives are primarily leveraged to source cheaper resources, coordinate across participants, provide stronger guarantees, and pay contributors globally.
Highly curated and structured data represent an interesting next step - benefiting both open-source and corporate LLM applications, including pre-training, fine-tuning, and RAG models. Another interesting angle is collecting highly specialized datasets (e.g. for robotics) that require some human input or reinforcement.
One key challenge with data markets, particularly when implemented in a decentralized setting, is accurately pricing the data and sufficiently rewarding contributors. The price is partly based on the usefulness of the data, which can be difficult to determine beforehand. Some possible approaches include:
Bounty-based: Price and specifications of the data are set by the customer or a central operator, targeting a specific use case. However, often leads to some redundancy (multiple operators providing similar datasets), which is ultimately reflected in the price that the user pays.
Market-based: Two-sided marketplace to set the price, but challenging to balance privacy (contributors don’t want to reveal data beforehand) and correct pricing with incomplete information.
Usage-based: Instead of an upfront payment, data contributors receive a share of the downstream value created based on their data. However, this requires being able to track what data was used by whom and for what purpose, in order to attribute usage back to the contributors.
Another challenge many data markets have faced in the past is that data is often most valuable if it’s novel and loses its “edge” as more parties get access to it. Maintaining exclusivity or scarcity of data in a decentralized setting requires new solutions to address wrongful use. Relying on traditional courts is unfeasible as these networks often span multiple jurisdictions and participants aren’t even known.
That said, when considering the evolution of AI training, it’s likely that more value can be added in curating and structuring datasets rather than collecting raw data through scraping. There is a potentially interesting opportunity to develop a blockchain-based solution that combines token-based incentives with data collection, curation, and structuring.
💡 Research, Articles & Other Things of Interest
📚 Sei Giga: The First Multi-Proposer EVM Layer 1 Blockchain: Sei’s new design aims to deliver 5 gigagas throughput and sub-400ms finality. Competition in the high-performance EVM landscape is heating up! 🔥
📚 ZKsync Prividium: Private Blockchain Infra Built for Institutions: Aims to provide enterprise-grade privacy (validium), built-in compliance, customizability, and seamless connection to Ethereum. The first in-production deployment of Prividium is Memento ZK Chain, which is built in collaboration with Deutsche Bank and designed to bring fund servicing fully onchain.
📚 Introducing Solana Attestation Service: An open, permissionless protocol for verifiable credentials, now live on Solana mainnet. Allows trusted issuers to associate off-chain information (such as KYC checks, geographic eligibility, or accreditation status) with a user’s wallet. These attestations are signed, verifiable, and reusable across applications without exposing sensitive data onchain or duplicating verification steps.
📚 Alpenglow: Solana's Great Consensus Rewrite: With the new consensus design, Solana is replacing TowerBFT and PoH with Votor (voting and block finalization logic) and Rotor (data dissemination protocol). The core aim is to reduce the actual finality from 12.8s to 150ms (median) and lay the foundation for future protocol upgrades, such as MCP.
📚 The Case for and Against MCP: Benefits include increased censorship resistance, scaling the base protocol instead of outsourcing, and dispersing MEV. The challenges/downsides include increasing HW requirements, data availability of invalid transactions, needing a finality gadget, and additional competition in ordering.
🤌 Personal Recommendations From Our Team
📚 Reading: KYC Is The Crime: The recent Coinbase customer data leak incident serves as a timely reminder to rethink our approach to compliance.
🎧 Listening: Medieval Eminem - One Hour Mixtape
💡 Other: Unknown Species of Bacteria Discovered in China's Space Station: It’s almost like… we’ve seen this movie before 😰