You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
The filenames in code-clippy dedup dataset are wrong. In the repo with multiple files - though various files are present, they share a single random filename not adhering to the correct file extension as well. While for gpt-code-clippy training efforts this might not be an issue since only content of files might matter, it would be really great if this issue can be fixed or mentioned clearly otherwise.
sample code to reproduce the issue (prints filenames in first 100 rows of jsonl)
import os
import json
import uuid
import zstandard
import subprocess
def loadJsonL(fname):
import json
data = []
with open(fname) as fp:
for line in fp.readlines():
data.append(json.loads(line))
return data
def processZSTLink(url):
zstfile = url.split('/')[-1]
print(url)
out = subprocess.run(f"wget {url}", shell=True, stdout=subprocess.DEVNULL)
jsonlfile = zstfile[:-4]
with open(zstfile, 'rb') as compressed:
decomp = zstandard.ZstdDecompressor()
with open(jsonlfile, 'wb') as destination:
decomp.copy_stream(compressed, destination)
data = loadJsonL(jsonlfile)
newData = []
for row in data[:100]:
file_name = row['meta']['file_name']
repo_name = row['meta']['repo_name']
print(f"{repo_name}/{file_name}")
processZSTLink('https://the-eye.eu/public/AI/training_data/code_clippy_data//code_clippy_dedup_data/test/data_2814_time1626332048_default.jsonl.zst')
The text was updated successfully, but these errors were encountered:
Hello, Naman. Thanks for the patience. We are trying to do something about it. While we are at it, one quick fix to at least get the file extension right would be to use lib-magic. So it would look like,
Hi,
The filenames in code-clippy dedup dataset are wrong. In the repo with multiple files - though various files are present, they share a single random filename not adhering to the correct file extension as well. While for gpt-code-clippy training efforts this might not be an issue since only content of files might matter, it would be really great if this issue can be fixed or mentioned clearly otherwise.
sample code to reproduce the issue (prints filenames in first 100 rows of jsonl)
The text was updated successfully, but these errors were encountered: