Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add encoding parameter to creating tar entry #364

Merged
merged 6 commits into from
Jun 19, 2020

Conversation

itn3000
Copy link
Contributor

@itn3000 itn3000 commented Jun 25, 2019

This PR is similar to #182, but this PR adds System.Text.Encoding to TarEntry related API.
default is same as current master behavior(omit upper byte) because more faster than using encoding if tar archive assured having ASCII filename only.

I certify that I own, and have sufficient rights to contribute, all source code and related material intended to be compiled or integrated with the source code for the SharpZipLib open source product (the "Contribution"). My Contribution is licensed under the MIT License.

default is same as current master behavior(omit upper byte)
@itn3000
Copy link
Contributor Author

itn3000 commented Aug 6, 2019

Codacy says "TarInputStream.CreateEntry(byte[] headerBuffer, Encoding enc) should be static method".
But other TarInputStream.CreateEntry functions are not static, should I CreateEntry(byte[] headerBuffer, Encoding enc) make static?

@piksel
Copy link
Member

piksel commented Aug 6, 2019

The problem right now with Codacy is that the current code is way beyond bad.
I think it's still better to be consistent here than to adhere to Codacy since it's better to do a full cleanup in another commit of all the functions.

@piksel
Copy link
Member

piksel commented Apr 12, 2020

As stated in #182:
The specification suggests using extended headers for encodings other than ASCII, but GNU Tar writes the UTF8-encoded bytes as the Name so I guess we should too.

This should be fine.

Copy link
Member

@piksel piksel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great implementation, my argument name change suggestion resulted in 40+ changes, hope that doesn't seem to daunting (was meant as a quick way to use the suggestions).

As for the defaults, I would mark the encoding-less constructors as

[Obsolete("No Encoding for Name field is specified, any non-ASCII bytes will be discarded")]

And on that note, the behaviour of passing null as the encoding is not specified. It should be documented.

And one last thing. The test does only test for UTF-8 in, UTF-8 out. It does not verify the actual bytes in the name. The reason for testing this is that Encoding.UTF8 emits BOM, which I don't think we want in the name bytes (but I can be wrong here, I will check what GNU Tar does).
If, instead, UTF8Encoding() is used, it will not emit any BOM. Maybe we should guide the consumer to making the right choice here as well?

Thanks for your contribution!

src/ICSharpCode.SharpZipLib/Tar/TarArchive.cs Outdated Show resolved Hide resolved
src/ICSharpCode.SharpZipLib/Tar/TarArchive.cs Outdated Show resolved Hide resolved
src/ICSharpCode.SharpZipLib/Tar/TarArchive.cs Outdated Show resolved Hide resolved
src/ICSharpCode.SharpZipLib/Tar/TarArchive.cs Outdated Show resolved Hide resolved
src/ICSharpCode.SharpZipLib/Tar/TarArchive.cs Outdated Show resolved Hide resolved
src/ICSharpCode.SharpZipLib/Tar/TarOutputStream.cs Outdated Show resolved Hide resolved
src/ICSharpCode.SharpZipLib/Tar/TarOutputStream.cs Outdated Show resolved Hide resolved
src/ICSharpCode.SharpZipLib/Tar/TarOutputStream.cs Outdated Show resolved Hide resolved
test/ICSharpCode.SharpZipLib.Tests/Tar/TarTests.cs Outdated Show resolved Hide resolved
@itn3000
Copy link
Contributor Author

itn3000 commented Apr 14, 2020

Thank you for reviewing, I'd like to reflect the review results.

And on that note, the behaviour of passing null as the encoding is not specified. It should be documented.

Should it be written on README or API comment?
It may be long description for API comment.

And one last thing. The test does only test for UTF-8 in, UTF-8 out. It does not verify the actual bytes in the name. The reason for testing this is that Encoding.UTF8 emits BOM, which I don't think we want in the name bytes (but I can be wrong here, I will check what GNU Tar does).
If, instead, UTF8Encoding() is used, it will not emit any BOM. Maybe we should guide the consumer to making the right choice here as well?

Asserting raw bytes of name is necessary, I will add.
This impl use Encoding.GetBytes/GetString for manupilating name bytes and does not use StreamReader/Writer.
Encoding.UTF8.GetBytes() does not write BOM(I did ensure with output tar, there is no 0xEF 0xBB 0xBF in binary), so there is no problem to specify Encoding.UTF8 as encoding parameter.

@piksel
Copy link
Member

piksel commented Apr 15, 2020

Regarding documentation:
Perhaps something short, like "or null for ASCII"?

Regarding Encoding.UTF8:
That's odd? The documentation for that property explicitly warns about how it emits them:
https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding.utf8?view=netframework-4.8#remarks
Edit: Never mind. I found the reason here:
https://docs.microsoft.com/en-us/dotnet/api/system.text.utf8encoding.getpreamble?view=netframework-4.8#remarks (the notice in the bottom).

@itn3000
Copy link
Contributor Author

itn3000 commented Apr 27, 2020

@piksel I've almost done with requested changes, please review it.

@piksel piksel self-requested a review May 28, 2020 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants