-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modification: Updates for removing limit for DataCap requests on LDN applications #851
Comments
Filecoin, a currency that once shone, And yet we pretend not to see, The worst part is we even enable, |
Boosting network capacity and pulling up FIL market value is the direction everyone wants to see. But the brutal reality is that clients are applying for too many DCs with insufficient data, and plenty of clients can't provide their data proof. I hope the governance team can focus on the proof of data size and miner ownership at this stage. I believe the overall network growth will improve once these are addressed. Then we can come back to this proposal. |
@RawTechVentures do you have any specific ideas to enable clients to prove their dataset size? I agree, we should implement. |
@simonkim0515 I just want to clarify one point. When you say to go through the "E-Fil Pathway" that should not mean they will be specifically applying for the E-Fil private data pilot. Instead, we will look to create a different, similar pathway to E-Fil where applicants will need to complete a ‘Full check’ of upfront requirements including:
|
Discussed on the Mar28 Governance Call: #843 |
Also, for any applicant applying for Datacap (0-15PiB), I think at a minimum they should go through KYC before creating an application: #833 |
For public datasets, I expect clients to share the dataset size or the sum of all buckets directly. Notaries do not have enough time to get into AWS to download, parse, etc. Clients should be well aware of what they are storing. For corporate datasets, I expect clients to share screenshots of their database, such as cloud drives, local disks, etc. This could also be applied to E-Fil. |
Agree with RawTechVentures, I think the client definitely knows their data better than the notary. If the notary misunderstands the client's information, it creates more time and unnecessary procedures that could lead to the loss of valuable clients. |
The integrity of the Filecoin+ program relies on the trust and honesty of its participants, which in many cases is missing. Increasing LDN application limits will result in massive DC fraud. @simonkim0515 how do you plan to avoid the threat that fake data will pose to the ecosystem? |
Unfortunately, we are talking about two different things. What I’m asking is proof of the dataset that notary can cross-check. Notaries need to verify all the information provided by the client and not believe it outright. This is why notaries exist. We as notaries are supposed to review LDN requests following the due diligence steps outlined in our notary application, which is necessary for the consensus mechanism. I agree that efficiency may cause client dissatisfaction. PL team is currently actively working on various tooling to help notaries make more productive decisions, but that is another topic. |
I must agree with @TrueBlood1 - As a large LDN requestor myself I struggle with the trenches also, but looking at the massive abuse I would strongly disagree with making things easier. Too much bad actors. Remove that first and then simplify things. Good proposal nevertheless. |
Hey everyone, really appreciate all of the comments and feedback. I’ll try my best to address everyone’s concerns, but it seems like most of the concerns revolve around the abuse/trust that’s going on with the community. First and foremost, the main point of pushing this proposal is not intended to minimize abuse, but to decrease duplicates. For clients who want to onboard massive amounts of data, this new proposal will help accurately represent datasets from clients before giving up a substantial amount of DataCap. Furthermore, those clients who apply for a high amount of DataCap will have a higher expectation and fidelity over time. If we go down this path, the due diligence done by notaries could go slower based on applications that are applying for more than 5+PiB because there will be more diligence needed to be done, which could help decrease bad applications. Again, this proposal is to help the people in the community, who want to bring in good projects to the network and the governance team is working on other initiatives to tackle the abuse/trust issues. |
Minimizing abuse and decreasing duplicates are not contradictory. But if it leads to a lot more DC abuse just to reduce duplicate applications. I don't think this is worth trying. |
Ideally yes. But you're not considering (1) more participants will get DataCap by impersonating clients with massive amounts of data, (2) many notaries are working with each other - If they want to get any application moving forward, they can and they will. Especially with FIP 0056 being rejected, this modification will create more room for bad actors to play the system and harm the interests of decent community members. |
I actually disagree with most of the pushback to this proposal. Specific points:
I don't see how this change makes any difference towards supporting more participants in the pipeline. If anything, this should help because existing applicants often get more DataCap by requesting DataCap from multiple applications at the same time, thereby getting more in initial tranches overall vs. had the request been consolidated in one place. This also significantly streamlines the experience for actors not abusing the system, ensuring their historicity is tracked and notaries have more confidence in supporting the application over time.
Agree, but again, this change does not make this worse or better. I think what is being missed here is that the tranche sizes are effectively left untouched even though the request max is going up from 5 to 15 PiB. This means that the amount of DataCap floated to a client stays the same. The only reasonable argument that this makes abuse easier is that a client does not need to spam apply... but that's not hard either when you can copy paste across applications. One big benefit of this change is that by consolidating where clients are requesting DataCap and making it easier to track them over time in one place - it actually makes it easier for (1) automation to help with identification of potential abuse, i.e., CID checker, (2) notaries to more easily identify risk cases because all (or at least more) information is consolidated in one place). Additionally, several clients onboarding large datasets have run into issues with subsequent allocations etc. because of needing to use multiple applications at the same time. FWIW the original use case for multiple applications was to ensure that progress could be tracked on a per SP ID basis, not just because more than 5 PiBs of DataCap was needed. The precedence for the latter was to open up these applications in order. This option has been used liberally for fast onboarding, and this change represents an improvement to the client experience as well as abuse detection. If anything.. we should be thinking about removing the cap entirely (i.e., why 15? why not 25, or 100?) and focusing more on tranche sizes as they represent the actual DataCap that exists in the system and what can be "run away with" at a given point in time. In summary, I'm strongly in favor of this change:
|
Before considering this change, we all should bear in mind that even if client abuse is found, the sealed Datacap will not be revoked.
Completely disagree. The new process allocates up to 4PiB, 2.2 times more than now in round 5 and beyond. Current allocation gives community members more opportunities to put a stop to misuse requests. Worst scenario - 5PiB, then the client needs to go through reviews again, which "good actors" can easily pass. I must say that this modification comes from good intentions, to benefit more decent participants, improve data-onboarding efficiency, etc. Nevertheless, human nature is greedy, if fraud without cost can create more benefits, there will surely be more people taking this approach. Why would anyone stop at 5, 10, or 14 petabytes when they can do 15? |
I'd like to point out that the capacity of 15PiB greatly exceeds that of a normal dataset, and the harm of doing so goes beyond promoting abusive behavior, causing more people to no longer believe in filecoin. |
@Chris00618 When you say "normal" dataset. Are you talking only about open, public datasets? Perhaps we need to differentiate here when talking about datasets, because there are different dataset types available to select on the LDN application now. And enterprise level data can easily exceed 15PiB, especially if you include 5x copies. |
Hi @kevzak, In fact, from the current practical point of view, storing enterprise datasets into filecoin is currently still in the pilot, and the need to store more than 5 copies on different nodes on a large scale is obviously not mainstream. I think we can research some information about this with the people from AWS, GCP or Azure. Thank you. |
Hey all, again always appreciate the feedback from everyone. To prevent exceeding the current DataCap allocation, a proposal suggests that tranches for the fourth allocation and onwards be set at a consistent 2 PiBs. This would ensure that the DataCap flow remains the same as it is today. As the program enters the Quality Phase, there will be increased attention on identifying and addressing any instances of program abuse. |
Very supportive of this request when added:
|
I still can't agree that the impact of this proposal will be benign. Applications beyond 5PiB are not mainstream. As the symbol of real data on Filecoin, DataCap holds its value. Example:
The current LDN applications have a relatively low demand for more than 5PiB. If data will be uploaded to the network excessively in the above way, it will cause a lack of trust in filecoin and even doubts about data value. |
In my opinion, the aggregation of data and excessive copies are problematic. I have already opened several modifications to address this issue. Allowing every dataset to be part of one LDN application is actually important to enforce strict rules and regulations for proving the dataset size and distribution. While it is important to address concerns about "fake" size and unnecessary copies, this can be accomplished through the use of checker-tools, community oversight, and other measures, rather than limiting applications to 5P. All of this is easier to enforce with a single LDN per dataset, and less open LDN applications. Limiting aggregation and excessive copies while allowing >5PiB applications to remain as one single LDN, would benefit the community in my opinion more than it favours the abusers. |
Adding more room for fraud to the existing problems will not make Filplus better. Data aggregation, sp ownership, data size, excessive copies, DC fraud, etc. are all very serious problems in existing LDN applications. We should consider raising the limit after tooling reaches a more mature phase and the above problems have been significantly reduced. |
Totally agree with this point. |
Hopefully I can address a lot of the concerns here. I always appreciate everyone providing their inputs on this proposal. However, I'd like to redirect our focus to the proposal's intended purpose, which is to address duplicate client applications and gain a more accurate understanding of the exact dataset that is intended to be onboarded. Although many of the negative concerns that were raised with this proposal, the proposal can be a six-week trial to assess its feasibility. During this trial period, we could introduce a new label, "Very Large Application," for all applications exceeding 5PiBs. With this new label and the proposed time bound, this trial would provide valuable analysis into whether the proposal aligns with the overall program objectives. |
What is the percentage of very large clients that exceed 5PB demand in open LDN applications? |
@TrueBlood1 circling back to this: #851 (comment) - you're right, I had it in my head for some reason that the upper bound was the same. Thanks for pointing it out. FWIW the change since then proposed by @simonkim0515 addresses this, so I can conveniently pull the same argument now 😅. Just looking at the new applications / recent apps in GitHub right now: 11 out of 25 have "n/m" or something similar listed, because they are looking at onboarding numbers >5 PiB. IMO this change really boils down to the following for me:
Combining this with a limited DataCap deployment rate to me is very compelling, especially because of point (2). It is hard to track applications across issues and build comprehensive maps of what clients are doing - it makes building abuse mitigation tooling much harder. FWIW, T&T WG is already checking anything that requests 5 PiBs since it raises eyebrows anyway as others have pointed out. I would imagine this will be the same for higher numbers as well, cc @raghavrmadya on this too / what T&T WG would do differently in a world with higher limits. There's been relatively strong arguments brought out for and against this, but if we can scope down time/number of apps as @simonkim0515 is projecting, we should go forward with this at least as a test. This variable should eventually have less indexing on it, since its much more about onboarding rate/throughput in the system today than being able to correctly estimate total DataCap needed. |
@dkkapur Can't agree with your logic. "Just looking at the new applications / recent apps in GitHub right now: 11 out of 25 have "n/m" or something similar listed, because they are looking at onboarding numbers >5 PiB." This scenario is inherently unnatural: data is a very valuable resource, and public data even more so. In an objective sense this thread is reducing checkpoints for some doubtful applications and will magnify the coverage of abuse. It is absurd if we are pushing this proposal under the pretext of making it easier to go through the review. Highly recommended to suspend the trial of this proposal. |
Hold on - I think you're agreeing with me?
FWIW - I havn't seen any other reasonable pushback at this point in time, and running this as a safe / gated / bounded experiment is a good thing for abuse mitigation. If we can agree that this is a change we should roll back if it goes poorly, I think we should proceed with implementation. |
Greetings @dkkapur, We wholeheartedly agree with your proposal and consider it viable to initiate and closely monitor. Would it be feasible for us to volunteer as a test subject? As our "NIH NCBI Sequence Read Archive" application currently awaits over another 20 applications of 5 PiB each, we would welcome the opportunity to experiment with this methodology, consolidating data in a clearly visible central location. Kindly inform us of any additional assistance we can provide. Our most recent application bears the number #1932. |
Why is the quality focus phase nonetheless focused on quantity? Why are many proposals that could reduce DC fraud stalled? #832 (comment) #790 https://filecoinproject.slack.com/archives/C0405HANNBT/p1683060263696829 |
Most people probably have the impression that abuse has become more common since the speed of accelerated LDN approval. Deliberately adding accelerated switches for a need that in itself is unreasonable is not something I can understand and support in any case. Also, I have a clear aversion to the same applicant applying for a huge datacap (e.g. 1/n,2/n...) This is contrary to the decentralized vision of filecoin. |
Thank you to all who have provided comments. The proposal will get passed and is scheduled to deploy in the near future. It is important to note, however, that this initiative will be conducted as a time-limited experiment. The final decision on whether to continue with this proposal will be based on the outcome. |
Issue Description
Want to resurface proposal #594 and discussion #227
The current LDN process (refer to https://github.com/filecoin-project/filecoin-plus-large-datasets#current-scope) allows clients to request a maximum of 5 PiB of DataCap per application.
The Fil+ program is receiving an increasing number of projects that wants a storage capacity exceeding 5 PiB, and as a result, proposing that the DataCap request limit on LDNs should be removed. While the DataCap request will become higher, it will decrease redundant client applications and provide a more accurate capture of the total dataset size that clients are requesting. With the change of the LDN limit, the allocation calculation will change as well to avoid clients receiving a disproportionately large amount of DataCap in a single allocation.
Proposed Solution(s)
Make sure the allocation rate stays the same
The current allocation process:
First allocation: lesser of 5% of total DataCap requested or 50% of weekly allocation rate
Second allocation: lesser of 10% of total DataCap requested or 100% of weekly allocation rate
Third allocation: lesser of 20% of total DataCap request or 200% of weekly allocation rate
Fourth allocation: lesser of 40% of total DataCap requested or 400% of weekly allocation rate
Fifth allocation onwards: lesser of 80% of total DataCap request or 800% of weekly allocation rate
Max request scenario:
Client requests 5 PiB of total DataCap, 1 PiB per week onboard rate
First allocation: 0.25 PiB
Second allocation: 0.5 PiB
Third allocation: 1 PiB
Fourth allocation: 2 PiB
Fifth allocation onwards: Remaining balance (1.25PiB)
The new allocation process:
First allocation: lesser of 50% of weekly allocation rate or 0.25PiB
Second allocation: lesser of 100% of weekly allocation rate or 0.5PiB
Third allocation: lesser of 200% of weekly allocation rate or 1PiB
Fourth allocation onwards: lesser of 400% of weekly allocation rate or 2PiB
The tranche sizes for the allocation will still be similar to how the program currently is and the speed that client’s are receiving DataCap should not change.
As part of this change, the scope is explicitly adjusted to state that if clients would like to apply for more than 15 PiB, they should go through the E-Fil+ pathway. The E-Fil+ pathway serves as a testing ground for enterprise data and new use cases for client onboarding.
Timeline
The proposed solution will likely take at least 1-2 weeks to implement.
This will be a 6 week experiment and will make a decision whether or not to continue this or not.
Technical dependencies
The validation bot has to be updated.
Create a new label for applications that apply for 5+ "Very Large Application"
End of POC checkpoint (if applicable)
Recommending that we check in after 6 weeks and 12 weeks to look at potential abuse of this.
The text was updated successfully, but these errors were encountered: