Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modification: Updates for removing limit for DataCap requests on LDN applications #851

Closed
simonkim0515 opened this issue Mar 28, 2023 · 34 comments
Assignees
Labels
Proposal For Fil+ change proposals

Comments

@simonkim0515
Copy link
Collaborator

simonkim0515 commented Mar 28, 2023

Issue Description

Want to resurface proposal #594 and discussion #227

The current LDN process (refer to https://github.com/filecoin-project/filecoin-plus-large-datasets#current-scope) allows clients to request a maximum of 5 PiB of DataCap per application.

The Fil+ program is receiving an increasing number of projects that wants a storage capacity exceeding 5 PiB, and as a result, proposing that the DataCap request limit on LDNs should be removed. While the DataCap request will become higher, it will decrease redundant client applications and provide a more accurate capture of the total dataset size that clients are requesting. With the change of the LDN limit, the allocation calculation will change as well to avoid clients receiving a disproportionately large amount of DataCap in a single allocation.

Proposed Solution(s)

Make sure the allocation rate stays the same

The current allocation process:

First allocation: lesser of 5% of total DataCap requested or 50% of weekly allocation rate
Second allocation: lesser of 10% of total DataCap requested or 100% of weekly allocation rate
Third allocation: lesser of 20% of total DataCap request or 200% of weekly allocation rate
Fourth allocation: lesser of 40% of total DataCap requested or 400% of weekly allocation rate
Fifth allocation onwards: lesser of 80% of total DataCap request or 800% of weekly allocation rate

Max request scenario:
Client requests 5 PiB of total DataCap, 1 PiB per week onboard rate
First allocation: 0.25 PiB
Second allocation: 0.5 PiB
Third allocation: 1 PiB
Fourth allocation: 2 PiB
Fifth allocation onwards: Remaining balance (1.25PiB)

The new allocation process:

First allocation: lesser of 50% of weekly allocation rate or 0.25PiB
Second allocation: lesser of 100% of weekly allocation rate or 0.5PiB
Third allocation: lesser of 200% of weekly allocation rate or 1PiB
Fourth allocation onwards: lesser of 400% of weekly allocation rate or 2PiB

The tranche sizes for the allocation will still be similar to how the program currently is and the speed that client’s are receiving DataCap should not change.

As part of this change, the scope is explicitly adjusted to state that if clients would like to apply for more than 15 PiB, they should go through the E-Fil+ pathway. The E-Fil+ pathway serves as a testing ground for enterprise data and new use cases for client onboarding.

Timeline

The proposed solution will likely take at least 1-2 weeks to implement.

This will be a 6 week experiment and will make a decision whether or not to continue this or not.

Technical dependencies

The validation bot has to be updated.

Create a new label for applications that apply for 5+ "Very Large Application"

End of POC checkpoint (if applicable)

Recommending that we check in after 6 weeks and 12 weeks to look at potential abuse of this.

@simonkim0515 simonkim0515 added the Proposal For Fil+ change proposals label Mar 28, 2023
@TrueBlood1
Copy link

Filecoin, a currency that once shone,
Now abused, a victim of fraudsters alone,
Fake apps and their deceitful game,
Leading us astray, taking us to shame.

And yet we pretend not to see,
Letting them thrive and run free,
Watching as they steal and deceive,
At the expense of those who believe.

The worst part is we even enable,
By increasing the limit, we'll become more unstable,
Giving abusers more room to play,
Leaving honest users with a price to pay.

@Wengeding
Copy link

Boosting network capacity and pulling up FIL market value is the direction everyone wants to see. But the brutal reality is that clients are applying for too many DCs with insufficient data, and plenty of clients can't provide their data proof.

I hope the governance team can focus on the proof of data size and miner ownership at this stage. I believe the overall network growth will improve once these are addressed. Then we can come back to this proposal.

@kevzak
Copy link
Collaborator

kevzak commented Mar 29, 2023

@RawTechVentures do you have any specific ideas to enable clients to prove their dataset size? I agree, we should implement.

@kevzak
Copy link
Collaborator

kevzak commented Mar 29, 2023

@simonkim0515 I just want to clarify one point. When you say to go through the "E-Fil Pathway" that should not mean they will be specifically applying for the E-Fil private data pilot.

Instead, we will look to create a different, similar pathway to E-Fil where applicants will need to complete a ‘Full check’ of upfront requirements including:

  • KYC/KYB
  • SP Plan verification
  • Proof of Dataset size

@Kevin-FF-USA
Copy link
Collaborator

Discussed on the Mar28 Governance Call: #843

@kevzak
Copy link
Collaborator

kevzak commented Mar 29, 2023

Also, for any applicant applying for Datacap (0-15PiB), I think at a minimum they should go through KYC before creating an application: #833

@Wengeding
Copy link

@RawTechVentures do you have any specific ideas to enable clients to prove their dataset size? I agree, we should implement.

For public datasets, I expect clients to share the dataset size or the sum of all buckets directly. Notaries do not have enough time to get into AWS to download, parse, etc. Clients should be well aware of what they are storing.

For corporate datasets, I expect clients to share screenshots of their database, such as cloud drives, local disks, etc. This could also be applied to E-Fil.

@kyokaki
Copy link

kyokaki commented Mar 31, 2023

Agree with RawTechVentures, I think the client definitely knows their data better than the notary. If the notary misunderstands the client's information, it creates more time and unnecessary procedures that could lead to the loss of valuable clients.

@TrueBlood1
Copy link

The integrity of the Filecoin+ program relies on the trust and honesty of its participants, which in many cases is missing. Increasing LDN application limits will result in massive DC fraud. @simonkim0515 how do you plan to avoid the threat that fake data will pose to the ecosystem?

@Wengeding
Copy link

Agree with RawTechVentures, I think the client definitely knows their data better than the notary. If the notary misunderstands the client's information, it creates more time and unnecessary procedures that could lead to the loss of valuable clients.

Unfortunately, we are talking about two different things. What I’m asking is proof of the dataset that notary can cross-check. Notaries need to verify all the information provided by the client and not believe it outright. This is why notaries exist. We as notaries are supposed to review LDN requests following the due diligence steps outlined in our notary application, which is necessary for the consensus mechanism.

I agree that efficiency may cause client dissatisfaction. PL team is currently actively working on various tooling to help notaries make more productive decisions, but that is another topic.

@herrehesse
Copy link

I must agree with @TrueBlood1 - As a large LDN requestor myself I struggle with the trenches also, but looking at the massive abuse I would strongly disagree with making things easier.

Too much bad actors. Remove that first and then simplify things.

Good proposal nevertheless.

@simonkim0515
Copy link
Collaborator Author

Hey everyone, really appreciate all of the comments and feedback. I’ll try my best to address everyone’s concerns, but it seems like most of the concerns revolve around the abuse/trust that’s going on with the community. First and foremost, the main point of pushing this proposal is not intended to minimize abuse, but to decrease duplicates. For clients who want to onboard massive amounts of data, this new proposal will help accurately represent datasets from clients before giving up a substantial amount of DataCap. Furthermore, those clients who apply for a high amount of DataCap will have a higher expectation and fidelity over time. If we go down this path, the due diligence done by notaries could go slower based on applications that are applying for more than 5+PiB because there will be more diligence needed to be done, which could help decrease bad applications. Again, this proposal is to help the people in the community, who want to bring in good projects to the network and the governance team is working on other initiatives to tackle the abuse/trust issues.

@BobbyChoii
Copy link

Minimizing abuse and decreasing duplicates are not contradictory. But if it leads to a lot more DC abuse just to reduce duplicate applications. I don't think this is worth trying.

@TrueBlood1
Copy link

For clients who want to onboard massive amounts of data, this new proposal will help accurately represent datasets from clients before giving up a substantial amount of DataCap. Furthermore, those clients who apply for a high amount of DataCap will have a higher expectation and fidelity over time. If we go down this path, the due diligence done by notaries could go slower based on applications that are applying for more than 5+PiB because there will be more diligence needed to be done, which could help decrease bad applications.

Ideally yes. But you're not considering (1) more participants will get DataCap by impersonating clients with massive amounts of data, (2) many notaries are working with each other - If they want to get any application moving forward, they can and they will. Especially with FIP 0056 being rejected, this modification will create more room for bad actors to play the system and harm the interests of decent community members.

@dkkapur
Copy link
Collaborator

dkkapur commented Apr 11, 2023

I actually disagree with most of the pushback to this proposal. Specific points:

(1) more participants will get DataCap by impersonating clients with massive amounts of data

I don't see how this change makes any difference towards supporting more participants in the pipeline. If anything, this should help because existing applicants often get more DataCap by requesting DataCap from multiple applications at the same time, thereby getting more in initial tranches overall vs. had the request been consolidated in one place. This also significantly streamlines the experience for actors not abusing the system, ensuring their historicity is tracked and notaries have more confidence in supporting the application over time.

(2) many notaries are working with each other - If they want to get any application moving forward, they can and they will

Agree, but again, this change does not make this worse or better.

I think what is being missed here is that the tranche sizes are effectively left untouched even though the request max is going up from 5 to 15 PiB. This means that the amount of DataCap floated to a client stays the same. The only reasonable argument that this makes abuse easier is that a client does not need to spam apply... but that's not hard either when you can copy paste across applications.

One big benefit of this change is that by consolidating where clients are requesting DataCap and making it easier to track them over time in one place - it actually makes it easier for (1) automation to help with identification of potential abuse, i.e., CID checker, (2) notaries to more easily identify risk cases because all (or at least more) information is consolidated in one place). Additionally, several clients onboarding large datasets have run into issues with subsequent allocations etc. because of needing to use multiple applications at the same time.

FWIW the original use case for multiple applications was to ensure that progress could be tracked on a per SP ID basis, not just because more than 5 PiBs of DataCap was needed. The precedence for the latter was to open up these applications in order. This option has been used liberally for fast onboarding, and this change represents an improvement to the client experience as well as abuse detection. If anything.. we should be thinking about removing the cap entirely (i.e., why 15? why not 25, or 100?) and focusing more on tranche sizes as they represent the actual DataCap that exists in the system and what can be "run away with" at a given point in time.

In summary, I'm strongly in favor of this change:

  • client UX for "good actors" is improved at practically no cost to abuse in the system
  • ability to catch abuse cases is improved, tracking is improved, notaries can more easily catch up with application context

@TrueBlood1
Copy link

Before considering this change, we all should bear in mind that even if client abuse is found, the sealed Datacap will not be revoked.

I think what is being missed here is that the tranche sizes are effectively left untouched even though the request max is going up from 5 to 15 PiB. This means that the amount of DataCap floated to a client stays the same.

Completely disagree. The new process allocates up to 4PiB, 2.2 times more than now in round 5 and beyond. Current allocation gives community members more opportunities to put a stop to misuse requests. Worst scenario - 5PiB, then the client needs to go through reviews again, which "good actors" can easily pass.

I must say that this modification comes from good intentions, to benefit more decent participants, improve data-onboarding efficiency, etc. Nevertheless, human nature is greedy, if fraud without cost can create more benefits, there will surely be more people taking this approach. Why would anyone stop at 5, 10, or 14 petabytes when they can do 15?

@Chris00618
Copy link

I'd like to point out that the capacity of 15PiB greatly exceeds that of a normal dataset, and the harm of doing so goes beyond promoting abusive behavior, causing more people to no longer believe in filecoin.

@kevzak
Copy link
Collaborator

kevzak commented Apr 11, 2023

@Chris00618 When you say "normal" dataset. Are you talking only about open, public datasets?

Perhaps we need to differentiate here when talking about datasets, because there are different dataset types available to select on the LDN application now. And enterprise level data can easily exceed 15PiB, especially if you include 5x copies.
data type

@Chris00618
Copy link

Hi @kevzak, In fact, from the current practical point of view, storing enterprise datasets into filecoin is currently still in the pilot, and the need to store more than 5 copies on different nodes on a large scale is obviously not mainstream. I think we can research some information about this with the people from AWS, GCP or Azure. Thank you.

@simonkim0515
Copy link
Collaborator Author

simonkim0515 commented Apr 14, 2023

Hey all, again always appreciate the feedback from everyone.

To prevent exceeding the current DataCap allocation, a proposal suggests that tranches for the fourth allocation and onwards be set at a consistent 2 PiBs. This would ensure that the DataCap flow remains the same as it is today.

As the program enters the Quality Phase, there will be increased attention on identifying and addressing any instances of program abuse.

cc @BobbyChoii @TrueBlood1

@cryptowhizzard
Copy link

Very supportive of this request when added:

To prevent exceeding the current DataCap allocation, a proposal suggests that tranches for the fourth allocation and onwards be set at a consistent 2 PiBs. This would ensure that the DataCap flow remains the same as it is today.

@Chris00618
Copy link

I still can't agree that the impact of this proposal will be benign. Applications beyond 5PiB are not mainstream. As the symbol of real data on Filecoin, DataCap holds its value.

Example:

The current LDN applications have a relatively low demand for more than 5PiB. If data will be uploaded to the network excessively in the above way, it will cause a lack of trust in filecoin and even doubts about data value.

@cryptowhizzard
Copy link

cryptowhizzard commented Apr 24, 2023

In my opinion, the aggregation of data and excessive copies are problematic. I have already opened several modifications to address this issue.

Allowing every dataset to be part of one LDN application is actually important to enforce strict rules and regulations for proving the dataset size and distribution. While it is important to address concerns about "fake" size and unnecessary copies, this can be accomplished through the use of checker-tools, community oversight, and other measures, rather than limiting applications to 5P. All of this is easier to enforce with a single LDN per dataset, and less open LDN applications.

Limiting aggregation and excessive copies while allowing >5PiB applications to remain as one single LDN, would benefit the community in my opinion more than it favours the abusers.

@Dave-lgtm
Copy link

Adding more room for fraud to the existing problems will not make Filplus better. Data aggregation, sp ownership, data size, excessive copies, DC fraud, etc. are all very serious problems in existing LDN applications. We should consider raising the limit after tooling reaches a more mature phase and the above problems have been significantly reduced.

@laurarenpanda
Copy link

In my opinion, the aggregation of data and excessive copies are problematic. I have already opened several modifications to address this issue.

Allowing every dataset to be part of one LDN application is actually important to enforce strict rules and regulations for proving the dataset size and distribution. While it is important to address concerns about "fake" size and unnecessary copies, this can be accomplished through the use of checker-tools, community oversight, and other measures, rather than limiting applications to 5P. All of this is easier to enforce with a single LDN per dataset, and less open LDN applications.

Limiting aggregation and excessive copies while allowing >5PiB applications to remain as one single LDN, would benefit the community in my opinion more than it favours the abusers.

Totally agree with this point.
And we should also consider more about applications, both LDN and E-FIL+, with >500TiB raw data and >5 PiB DC requirements and their user experience for both client and notary. Since we want more data owners to join this ecosystem and choose Filecoin to store data, a better data onboarding process with FIL+ and higher efficiency would greatly help.
Rules are undoubtedly important, as well as measures for avoiding abuse. But we shouldn't let these limit us from optimizing the application process and meeting existing needs.

@simonkim0515
Copy link
Collaborator Author

Hopefully I can address a lot of the concerns here.

I always appreciate everyone providing their inputs on this proposal. However, I'd like to redirect our focus to the proposal's intended purpose, which is to address duplicate client applications and gain a more accurate understanding of the exact dataset that is intended to be onboarded. Although many of the negative concerns that were raised with this proposal, the proposal can be a six-week trial to assess its feasibility. During this trial period, we could introduce a new label, "Very Large Application," for all applications exceeding 5PiBs. With this new label and the proposed time bound, this trial would provide valuable analysis into whether the proposal aligns with the overall program objectives.

@TrueBlood1
Copy link

However, I'd like to redirect our focus to the proposal's intended purpose, which is to address duplicate client applications and gain a more accurate understanding of the exact dataset that is intended to be onboarded.

What is the percentage of very large clients that exceed 5PB demand in open LDN applications?

@dkkapur
Copy link
Collaborator

dkkapur commented Apr 27, 2023

@TrueBlood1 circling back to this: #851 (comment) - you're right, I had it in my head for some reason that the upper bound was the same. Thanks for pointing it out. FWIW the change since then proposed by @simonkim0515 addresses this, so I can conveniently pull the same argument now 😅.

Just looking at the new applications / recent apps in GitHub right now: 11 out of 25 have "n/m" or something similar listed, because they are looking at onboarding numbers >5 PiB.
Screen Shot 2023-04-27 at 1 27 14 AM

IMO this change really boils down to the following for me:

  1. better experience for clients that don't need to artificially split applications and track them in different places/create new addresses
  2. better for tracking and tool building, context is in the same place
  3. better for due diligence, notaries have less things to see sharded data across

Combining this with a limited DataCap deployment rate to me is very compelling, especially because of point (2). It is hard to track applications across issues and build comprehensive maps of what clients are doing - it makes building abuse mitigation tooling much harder.

FWIW, T&T WG is already checking anything that requests 5 PiBs since it raises eyebrows anyway as others have pointed out. I would imagine this will be the same for higher numbers as well, cc @raghavrmadya on this too / what T&T WG would do differently in a world with higher limits.

There's been relatively strong arguments brought out for and against this, but if we can scope down time/number of apps as @simonkim0515 is projecting, we should go forward with this at least as a test. This variable should eventually have less indexing on it, since its much more about onboarding rate/throughput in the system today than being able to correctly estimate total DataCap needed.

@MegaFil
Copy link

MegaFil commented May 4, 2023

@dkkapur Can't agree with your logic.

"Just looking at the new applications / recent apps in GitHub right now: 11 out of 25 have "n/m" or something similar listed, because they are looking at onboarding numbers >5 PiB."

This scenario is inherently unnatural: data is a very valuable resource, and public data even more so. In an objective sense this thread is reducing checkpoints for some doubtful applications and will magnify the coverage of abuse. It is absurd if we are pushing this proposal under the pretext of making it easier to go through the review.

Highly recommended to suspend the trial of this proposal.

@simonkim0515 simonkim0515 changed the title Modification: Updates for increasing limit for DataCap requests on LDN applications Modification: Updates for removing limit for DataCap requests on LDN applications May 4, 2023
@dkkapur
Copy link
Collaborator

dkkapur commented May 5, 2023

This scenario is inherently unnatural: data is a very valuable resource, and public data even more so. In an objective sense this thread is reducing checkpoints for some doubtful applications and will magnify the coverage of abuse. It is absurd if we are pushing this proposal under the pretext of making it easier to go through the review.

Hold on - I think you're agreeing with me?

  • Yes, this is unnatural - why are we making it harder for both owners of valuable data and observers of potentially abuse data to track what's happening by splitting it across more applications?
  • No - this does not reduce checkpoints since the rate of DataCap flow is comparable, and if anything, reduced, when you have less applications in parallel
  • So no, the goal is not to make it easier to go through review. The goal is to make it easier to review. Those are different. The goal is to make it so that notaries and community members can look in less places and see all the context more easily, and tools can more reliably provide insights into client applications so that we can catch potential abuse better and reduce false positives

FWIW - I havn't seen any other reasonable pushback at this point in time, and running this as a safe / gated / bounded experiment is a good thing for abuse mitigation. If we can agree that this is a change we should roll back if it goes poorly, I think we should proceed with implementation.

@cryptowhizzard
Copy link

Greetings @dkkapur,

We wholeheartedly agree with your proposal and consider it viable to initiate and closely monitor. Would it be feasible for us to volunteer as a test subject? As our "NIH NCBI Sequence Read Archive" application currently awaits over another 20 applications of 5 PiB each, we would welcome the opportunity to experiment with this methodology, consolidating data in a clearly visible central location.

Kindly inform us of any additional assistance we can provide. Our most recent application bears the number #1932.

@TrueBlood1
Copy link

Why is the quality focus phase nonetheless focused on quantity? Why are many proposals that could reduce DC fraud stalled? #832 (comment) #790

https://filecoinproject.slack.com/archives/C0405HANNBT/p1683060263696829
I'm not against this change but I don't think this is the right time to implement it. In the current bear market, participants need solid faith - real data, not more junk data to bulk the network.

@MegaFil
Copy link

MegaFil commented May 5, 2023

Most people probably have the impression that abuse has become more common since the speed of accelerated LDN approval. Deliberately adding accelerated switches for a need that in itself is unreasonable is not something I can understand and support in any case.

Also, I have a clear aversion to the same applicant applying for a huge datacap (e.g. 1/n,2/n...) This is contrary to the decentralized vision of filecoin.

@simonkim0515
Copy link
Collaborator Author

Thank you to all who have provided comments. The proposal will get passed and is scheduled to deploy in the near future. It is important to note, however, that this initiative will be conducted as a time-limited experiment. The final decision on whether to continue with this proposal will be based on the outcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Proposal For Fil+ change proposals
Projects
None yet
Development

No branches or pull requests