Skip to content
This repository has been archived by the owner on Jul 18, 2024. It is now read-only.

[DataCap Application] Commoncrawl(2/3) #2287

Closed
1 of 2 tasks
nicelove666 opened this issue Dec 1, 2023 · 107 comments
Closed
1 of 2 tasks

[DataCap Application] Commoncrawl(2/3) #2287

nicelove666 opened this issue Dec 1, 2023 · 107 comments

Comments

@nicelove666
Copy link

nicelove666 commented Dec 1, 2023

Data Owner Name

Commoncrawl

What is your role related to the dataset

Data Preparer

Data Owner Country/Region

United States

Data Owner Industry

Life Science / Healthcare

Website

https://commoncrawl.org/

Social Media

https://commoncrawl.org/

Total amount of DataCap being requested

15PiB

Expected size of single dataset (one copy)

1.5P

Number of replicas to store

10

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

  • Use Custom Multisig

Identifier

No response

Share a brief history of your project and organization

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Is this project associated with other projects/ecosystem stakeholders?

Yes

If answered yes, what are the other projects/ecosystem stakeholders

https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2204
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2045
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1947
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1946
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1845
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1846
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1847
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1848

Describe the data being stored onto Filecoin

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (Country/Region)

United States

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

We use a script to package the files originally stored in the nginx file server into tar files. Each tar file is controlled to be around 17-30G. Finally, the tar file package is converted into a car file. After the conversion is completed, a record of the car file and The metadata of the source file information is stored in our local system for later query.

If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

This website has a lot of data, as far as I know, no one has systematically stored all the data on the Filecoin network.

Please share a sample of the data

https://commoncrawl.org/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

  • I confirm

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Yearly

For how long do you plan to keep this dataset stored on Filecoin

2 to 3 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Africa, North America, Europe

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), Shipping hard drives, Lotus built-in data transfer

How do you plan to choose storage providers

Slack, Filmine, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

f02824140 
f02824157
f02831201
f02831202
f02816081
f02816095
f02841613
f02199203  
f02760664
f02223170

How do you plan to make deals to your storage providers

Boost client, Lotus client

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

Copy link

Thanks for your request!
Everything looks good. 👌

A Governance Team member will review the information provided and contact you back pretty soon.

@nicelove666
Copy link
Author

nicelove666 commented Dec 1, 2023

The bot of 2204 seems to be bug, the bot did not trigger the next round of signature requests. We submitted the cooperative SP in detail on 2204. The following are the SPs that this LDN will cooperate with. We have listed the SPs, location, and entities in detail. We look forward to your early approval us and review of online transactions. Thank you for your trust and hard work. , we believe together that Filecoin will get better and better.

Provider Location SP Entity or Personal
f02841613 HK Coffeecould
f02831201 GuangDong Juwu Mine
f02831202 GuangDong Juwu Mine
f02816081 Singapore KRAL
f02816095 Singapore KRAL
f02824157 BeiJing zhongchuangyun
f02824140 BeiJing zhongchuangyun
f02223170 tianyou avn
f02199203 Inner Mongolia Richard
f02760664 Inner Mongolia Richard

@cryptowhizzard
Copy link

Give me a proper index files of your deals, give me two EU and USA miners with no VPN. Show me retrieval on those unsealed copies. Say your data is correct, love to support.

But NOT like this, the current form.

@nicelove666
Copy link
Author

First, we met in a video conference. Maybe you forgot that I am an American. I have been in China recently, the SPs I cooperate with are mainly in Asia, but this does not mean that we do not have foreign SPs.We also have a team in Singapore, and I am contacting FF to meet.During labweek, I also went to Türkiye.

Second, If you care about the sp we work with, I'd be happy to tell you,but they won't start now:
f01422327(Japan)
f02229545 (Los Angeles)
f02252024(United States)

Third, next week or around December 15th, we will have a European SP to start.it is a new SP, I don not know the SP until it is started. When the SP is established, we will disclose it in advance.

Fourth, now that the AC robot is online, leave everything to the intelligent AC robot, and we will meet the requirements of the AC robot.
WX20231201-233004@2x

@nicelove666
Copy link
Author

I hope you can push it forward @kevzak @Filplus-govteam @Sunnyiscoming @Kevin-FF-USA @galen-mcandrew

@cryptowhizzard
Copy link

@nicelove666 Business names and contact information please. I will check for VPN use.

@nicelove666
Copy link
Author

All information is public and we have submitted it. In addition, please use a professional and recognized website to check, such as this website https://seon.io/.
There are many similar testing websites. I hope you can make your testing website public so that you can have credibility. I also hope that your testing tool will produce the same results as this type of recognized testing website.

@Sunnyiscoming
Copy link
Collaborator

Hello, per the filecoin-project/notary-governance#922 for Open, Public Dataset applicants, please complete the following Fil+ registration form to identify yourself as the applicant and also please add the contact information of the SP entities you are working with to store copies of the data.

This information will be reviewed by Fil+ Governance team to confirm validity and then the application will be allowed to move forward for additional notary review.

@herrehesse
Copy link

herrehesse commented Dec 5, 2023

All information is public and we have submitted it.

Can you show me in here with contact information? Love to perform due diligence.

@nicelove666
Copy link
Author

We submitted it, hope to see your progress @Sunnyiscoming

@nicelove666
Copy link
Author

WX20231205-185036@2x

@nicelove666
Copy link
Author

Is there any update here? @Sunnyiscoming

@ghost
Copy link

ghost commented Dec 7, 2023

@nicelove666 - where are the 10 SPs onboarding these 10 copies? We see 3 SPs listed on your registration form.

f02841613 | coffeecloud | HK | no | ted
f02831201 | Juwu Mine | GuangDong | no | Jon
f02831202 | Juwu Mine | GuangDong | no | Jon
f02824157 | zhongchuangyun | BeiJing | no | lisa
f02824140 | zhongchuangyun | BeiJing | no | lisa

@nicelove666
Copy link
Author

nicelove666 commented Dec 8, 2023

Hello, @Filplus-govteam, we fill in the SP according to the requirements of the registration form. However, only 5 SPs can be filled in the registration form.

In order to show our cooperative SPs , we have listed the cooperative SPs in detail in the application form,Hope you can push us forward.

The bot of 2204 seems to be bug, the bot did not trigger the next round of signature requests. We submitted the cooperative SP in detail on 2204. The following are the SPs that this LDN will cooperate with. We have listed the SPs, location, and entities in detail. We look forward to your early approval us and review of online transactions. Thank you for your trust and hard work. , we believe together that Filecoin will get better and better.

Provider Location SP Entity or Personal
f02841613 HK Coffeecould
f02831201 GuangDong Juwu Mine
f02831202 GuangDong Juwu Mine
f02816081 Singapore KRAL
f02816095 Singapore KRAL
f02824157 BeiJing zhongchuangyun
f02824140 BeiJing zhongchuangyun
f02223170 tianyou avn
f02199203 Inner Mongolia Richard
f02760664 Inner Mongolia Richard

@ghost
Copy link

ghost commented Dec 8, 2023

This shows 6 SPs. You said 10 copies, who is storing all the copies?

Also can you show proof of 1.5PiB dataset from commoncrawl? Which dataset?

@nicelove666
Copy link
Author

@Filplus-govteam Why are these 6 SPs, not 10 SPs? Is the counting unit of SPs "company" or "node"?
There are 10 nodes here from 6 companies.

@nicelove666
Copy link
Author

I hope we can set a clear rule about the number of sps, whether they are companies or nodes. then we can launch a issues, everyone will abide by this rule.

@nicelove666
Copy link
Author

nicelove666 commented Dec 8, 2023

https://commoncrawl.org/
This data set has at least 4P of data. With 10 backups, we can apply for a total of 40P. If I calculated it wrong, please tell me. Thank you.

@large-datacap-requests large-datacap-requests bot added very large application For LDN applications over 5+ PiB ready to sign labels Feb 2, 2024
@Tom-OriginStorage
Copy link

checker:manualTrigger

Copy link

DataCap and CID Checker Report Summary1

Storage Provider Distribution

⚠️ 2 storage providers have unknown IP location - f02831201, f02831202

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients2

✔️ No CID sharing has been observed.

Full report

Click here to view the CID Checker report.
Click here to view the Retrieval Dashboard.

Footnotes

  1. To manually trigger this report, add a comment with text checker:manualTrigger

  2. To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Copy link

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaceaavg2vqv7hflzy3cjnyqzjezn4qd4tbod55kstofjyyw5hdhuqa6

Address

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

Datacap Allocated

512.00TiB

Signer Address

f1q6bpjlqia6iemqbrdaxr2uehrhpvoju3qh4lpga

Id

e04c36db-606a-4fdd-b503-46c6ff7188f9

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceaavg2vqv7hflzy3cjnyqzjezn4qd4tbod55kstofjyyw5hdhuqa6

Copy link

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacecqqujfdunzbvnggew65fvv5fmsf6jrtw4wfy7j5qel7pyt6bgpog

Address

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

Datacap Allocated

512.00TiB

Signer Address

f1jvvltduw35u6inn5tr4nfualyd42bh3vjtylgci

Id

e04c36db-606a-4fdd-b503-46c6ff7188f9

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecqqujfdunzbvnggew65fvv5fmsf6jrtw4wfy7j5qel7pyt6bgpog

Copy link

The issue reached the total datacap requested. This should be closed

Copy link

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

--
Commented by Stale Bot.

Copy link

This application has not seen any responses in the last 14 days, so for now it is being closed. Please feel free to contact the Fil+ Gov team to re-open the application if it is still being processed. Thank you!

--
Commented by Stale Bot.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests