Filecoin Spark: Common Critiques

Written by

Patrick Woodhead

Date published

December 13, 2024

In this blog, we will take a look at some common critiques of Filecoin Spark and give our best response to them. We will begin by reintroducing the protocol. This blog assumes a solid level of understanding about Filecoin.

What is the goal of Spark?

The goal of Spark is to measure the retrievability of Filecoin so that it can be improved. The meme that "you can't retrieve your data from Filecoin" is out there and we hope that Filecoin Spark data can be used to dismiss that narrative.

How does Spark work?

The best way for a restaurant critic to review a restaurant is to sample the menu. The best way to audit the retrievability of data from Filecoin is to sample some retrievals. Spark randomly samples retrievals on Filecoin.

Who is this data valuable to?

When you want to choose a local restaurant, you may look at the Google Maps star rating and some recent reviews. You essentially look at a sample of previous clients or even professional critics who write about their experiences. You are much less likely to trust a restaurant if it has a low rating or has no reviews at all. These reviews are valuable to prospective clients of that restaurant.

The endgame for Spark is for Spark data to be valuable to the (prospective) clients of Filecoin because it helps them to build trust in Filecoin as a network and to choose which Filecoin Storage Providers (SPs) to store their data with.

Pre-endgame, while Filecoin is working towards having the table stakes functionality of a storage offering when compared to Web2 incumbents, Spark data is valuable to the FIL+ team who are designing incentives to achieve these table stakes, and to the Filecoin Foundation who also wants to make Filecoin great again.

The data is also currently valuable to SPs who can use it to improve their retrieval setup. However, they are unlikely to be the long term funders of Spark, just like restaurants don't fund Michelin critics or Google reviews, or at least they shouldn't for risk of conflict of interest.

Spark critiques and our responses

"Spark doesn't retrieve data the right way"

On the surface, Spark is simply trying to sample retrievals from Filecoin in the manner in which Filecoin wants clients to retrieve their data, using public info. This is already a contentious point in the Filecoin community because some people think data should be retrievable by payload CID, others think just by piece CID. Some think Filecoin should just be archival storage. Spark is not opinionated as to how you are supposed to retrieve from Filecoin, only that the data should be retrievable. Spark can adapt to the latest canonical way to retrieve. Spark can have variants that sample-retrieve data from Filecoin in different ways.

"Spark data is used by FIL+ and we don't like FIL+"

Spark is independent from FIL+. Spark data is used by FIL+ allocators, and in particular the FIDL allocator, to decide which SPs to give datacap to. SPs are reluctant to serve retrievals compared to just storing sealed copies of the data because it essentially means they have to pay for twice the storage space (an extra unsealed copy for each sealed copy) and also more bandwidth and infra to serve retrievals. If you can get a 10x datacap multiplier without having to do all the retrieval stuff it is much more cost efficient.

Despite this, the data shows how successful the FIL+ team has been in improving the Spark retrieval success rate of Filecoin. Over the last 6 months the overall Spark RSR of Filecoin has risen from 1% up to over 18%. N.B. I believe this is as much a win for the FIDL team as for Spark which is simply the data provider.

Also if you look at the current top performing SP, their Spark RSR has increased from around 20% to very close to 100% over the last 6 months. Our team actually worked directly with this SP to debug their retrieval related issues. As a prospective client, I would feel much more confident to store with this SP having seen this data.

"Spark RSR does not equal Filecoin RSR"

If SPs can game their Spark score then it is not a good indicator of the actual Filecoin RSR. Our team is aware of this and we regularly discuss all known attack vectors and think up mitigations. We believe Spark scores are currently a good heuristic for the actual retrievability of data on Filecoin. However, as the stakes get higher for having a good Spark score, more sophisticated attacks will emerge.

"A proof of retrieval doesn't exist"

This paper gives an impossibility result around a crisp proof of retrieval protocol. If there was a way to do proof of retrieval, the world class cryptographers at Protocol Labs would've already shipped it. A proof of retrieval is not possible. The Spark protocol starts with this fact and tries to build a protocol that incentivises the actors to report honestly based on reputation and incentives rather than solve the impossible.

"You cannot prove that an SP is storing a hot copy, this needs something like PoRep - SPs can just proxy the Spark retrieval request onto someone who is storing the hot copy"

We don't view this as a problem. Spark is testing that the provider is able to serve the content from a deal on behalf of the network. IPFS and Filecoin are based on content addressing, which is about the network’s ability to serve content, not about the ability to fetch it from a specific location. However, clients need to know which node to at least ask for the hot copy. This is what we can get from IPNI. What's more, this fact leaves space for SPs to try to save costs on hot storage - they can cooperate with other SPs to guarantee that at least one hot copy is available nearby that can be served back to the client. The same goes for PDP.

"Some of the Spark Protocol is still run by the Space Meridian team"

Yes - some parts are still run by our team but these parts are e2e verifiable and repeatable, or at least have a path to being so. Eventually we plan to make the whole protocol decentralised or at least federated but our approach has been to ship pragmatically and iteratively. You can see details on e2e retrievability here.