Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misidentifies ASCII file with non-source code #16

Closed
gameengineer opened this issue Jan 6, 2020 · 2 comments
Closed

Misidentifies ASCII file with non-source code #16

gameengineer opened this issue Jan 6, 2020 · 2 comments

Comments

@gameengineer
Copy link

I have been getting false positives for some ASCII text files that do not contain source code. In fact here is the contents of one of those files that gets reported as "C++". Those two lines in the text file come back as C++ code as reported by guesslang Guess().language_name()

'These are the contents of file1\nAnd this is the second line of file1'

Here is the snippet of code I use to open and read the contents of all ASCII type files in python. This is a

def getSourceLanguage(filename):
"""
Using guesslang test the text/plain file types if they are written
in a programming language
"""

langName = "ASCII text"
# 1. Open file read contents to string
# 2. run guesslang on the string
try:
    with open(filename, 'r') as theFile:
        fileContents = theFile.read()
        langName = Guess().language_name(fileContents)
except:
    print(f"Error: could not open {filename}")
    logger.error(f"Error: could not open {filename}")

return langName

Is there a way to tighten up the language type checking?

@gameengineer
Copy link
Author

Ok I just read this statement from the "How does GuessLang guess" section of the docs.
"Other from that, very small files can be misclassified. " Maybe it misidentifies as source code because my test ASCII file is so small.

@yoeo yoeo mentioned this issue Jun 13, 2020
@yoeo
Copy link
Owner

yoeo commented Jun 14, 2020

Hi @gameengineer,

I added a check on the prediction probabilities to reduce the chances to classify non source code as source code.

However some text files may now be categorized as Markdown (as Markdown is a text formatting language).

In addition to that Guesslang may have issues classifying some short code snippets, as stated in the documentation.

Thank you for the feedback.

@yoeo yoeo closed this as completed Jun 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants