Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cli.py: encoding='utf-8' #696

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

BaseMax
Copy link

@BaseMax BaseMax commented Nov 3, 2024

The issue happened in our project at SalamLang/Salam#265 in pre-commit for lining YAML files.

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 445: character maps to

@coveralls
Copy link

coveralls commented Nov 3, 2024

Coverage Status

coverage: 99.825%. remained the same
when pulling 546d683 on MaxFork:supports-utf8
into e118296 on adrienverge:master.

@BaseMax
Copy link
Author

BaseMax commented Nov 3, 2024

cc @adrienverge @jbampton

@adrienverge
Copy link
Owner

Hello and thanks for the proposal. Could you check out other pull requests related to character encoding? How does this one differ from them?

@BaseMax
Copy link
Author

BaseMax commented Nov 3, 2024

Hi @adrienverge, happy connecting.

There are total 3 merge requests related to encoding.
1- #630
2- #240
3- #696 (CURRENT MERGE REQUEST)

The https://github.com/adrienverge/yamllint/pull/630/files#diff-2e0288fc9fc3cda09f90a25f76bedb9ce0cea019d01147b436e575c71a3e674eR222 merge request looks fine but it doesn't have the change I applied.

My problem is that I have Persian UTF8 text in my YAML files and the problem was related to the 'cli.py' file.

Related to my issue https://github.com/adrienverge/yamllint/pull/240/files looks like a good patch as it can automatically detect the encoding and then use that in reading the file but I can see your comments there and it seems you are not happy to add new dependencies. Q: "I'm very against adding dependencies (like chardet)."

@adrienverge
Copy link
Owner

Hello Max, thanks. It looks like #630 solves the same problem but is more complete and future-proof. Also, your PR doesn't fix encoding problems for other files such as configuration. What do you think?

My problem is that I have Persian UTF8 text in my YAML files and the problem was related to the 'cli.py' file.

In the meantime, a solution is to tell Python to read files as UTF-8 by default:

export PYTHONUTF8=1
yamllint your-file.yaml

@BaseMax
Copy link
Author

BaseMax commented Nov 5, 2024

Thank you @adrienverge, I added PYTHONUTF8 var to our pre-commit env config. SalamLang/Salam@db7e870

@jbampton and I will do more testing.

@asears
Copy link

asears commented Dec 21, 2024

I started a similar PR on this before seeing this one. Having the option to do errors="ignore" and other encoding / error handling options would be beneficial. The other in-flight PR looks really comprehensive though it might take awhile to release. This seems like a useful summary of the options for improvements in that other PR. https://llego.dev/posts/comprehensive-guide-opening-closing-files-python/

In our scenario, we have a few emoji characters in yaml causing the charmap error.

I was able to simply fix this issue with with open(file, newline='', errors="ignore") as f:
The fix in this PR to set to UTF-8 also works for us on Windows, as does the env var fix.

Could this PR be tested, merged and released if it supports the solution while that much larger PR goes through the process? It will help with Github workflows and other yaml files which may use emoji characters. Might solve some long standing issues on Windows.

If there are any concerns with the feature it could be an experimental or preview arg feature toggle option?

Thanks for the helpful tool Looking forward to a future --fix option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants