Wrong filenames in dataset #71

Naman-ntc · 2021-10-22T06:54:52Z

Hi,
The filenames in code-clippy dedup dataset are wrong. In the repo with multiple files - though various files are present, they share a single random filename not adhering to the correct file extension as well. While for gpt-code-clippy training efforts this might not be an issue since only content of files might matter, it would be really great if this issue can be fixed or mentioned clearly otherwise.

sample code to reproduce the issue (prints filenames in first 100 rows of jsonl)

import os
import json
import uuid
import zstandard
import subprocess

def loadJsonL(fname):
    import json

    data = []
    with open(fname) as fp:
        for line in fp.readlines():
            data.append(json.loads(line))
    return data


def processZSTLink(url):
    zstfile = url.split('/')[-1]
    print(url)
    out = subprocess.run(f"wget {url}", shell=True, stdout=subprocess.DEVNULL)    
    jsonlfile = zstfile[:-4]    
    with open(zstfile, 'rb') as compressed:
        decomp = zstandard.ZstdDecompressor()
        with open(jsonlfile, 'wb') as destination:
            decomp.copy_stream(compressed, destination)

    data = loadJsonL(jsonlfile)
    newData = []
    for row in data[:100]:
        file_name = row['meta']['file_name']
        repo_name = row['meta']['repo_name']        
        print(f"{repo_name}/{file_name}")


processZSTLink('https://the-eye.eu/public/AI/training_data/code_clippy_data//code_clippy_dedup_data/test/data_2814_time1626332048_default.jsonl.zst')

The text was updated successfully, but these errors were encountered:

reshinthadithyan · 2021-10-22T07:06:40Z

Hello Naman, Thanks. This looks like something important. I'll check and will let you know.

reshinthadithyan · 2021-10-31T00:33:53Z

Hello, Naman. Thanks for the patience. We are trying to do something about it. While we are at it, one quick fix to at least get the file extension right would be to use lib-magic. So it would look like,

import magic
import os
import json
import uuid
import zstandard
import subprocess

def loadJsonL(fname):
    import json

    data = []
    with open(fname) as fp:
        for line in fp.readlines():
            data.append(json.loads(line))
    return data


def processZSTLink(url):
    zstfile = url.split('/')[-1]
    print(url)
    out = subprocess.run(f"wget {url}", shell=True, stdout=subprocess.DEVNULL)    
    jsonlfile = zstfile[:-4]    
    with open(zstfile, 'rb') as compressed:
        decomp = zstandard.ZstdDecompressor()
        with open(jsonlfile, 'wb') as destination:
            decomp.copy_stream(compressed, destination)

    data = loadJsonL(jsonlfile)
    newData = []
    for row in data[:100]:
        file_name = row['meta']['file_name']
        repo_name = row['meta']['repo_name']    
        text = row['text']
        extension = magic.from_buffer(text) 
        print(f"{repo_name}/{file_name}")


processZSTLink('https://the-eye.eu/public/AI/training_data/code_clippy_data//code_clippy_dedup_data/test/data_2814_time1626332048_default.jsonl.zst')

Not sure of the accuracy that would be yielded by doing so. But I hope it helps.

Naman-ntc · 2021-11-01T06:03:45Z

Hi, thanks for suggesting lib-magic
I did something similar (combination of acorn and tsc to detect if file is js/jsx/ts/tsx or not) ;

reshinthadithyan self-assigned this Oct 22, 2021

reshinthadithyan added the bug Something isn't working label Oct 22, 2021

reshinthadithyan added the enhancement New feature or request label Oct 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong filenames in dataset #71

Wrong filenames in dataset #71

Naman-ntc commented Oct 22, 2021 •

edited

Loading

reshinthadithyan commented Oct 22, 2021

reshinthadithyan commented Oct 31, 2021

Naman-ntc commented Nov 1, 2021

Wrong filenames in dataset #71

Wrong filenames in dataset #71

Comments

Naman-ntc commented Oct 22, 2021 • edited Loading

reshinthadithyan commented Oct 22, 2021

reshinthadithyan commented Oct 31, 2021

Naman-ntc commented Nov 1, 2021

Naman-ntc commented Oct 22, 2021 •

edited

Loading