Improve performance of Tar library #74281

stephentoub · 2022-08-19T22:50:48Z

Low-hanging fruit.

Method	Toolchain	format	Mean	Ratio	Code Size	Allocated	Alloc Ratio
Roundtrip	\main\corerun.exe	Ustar	12.630 us	1.00	11,006 B	17.1 KB	1.00
Roundtrip	\pr\corerun.exe	Ustar	8.266 us	0.65	7,526 B	5.95 KB	0.35

Roundtrip	\main\corerun.exe	Pax	51.859 us	1.00	15,875 B	83.99 KB	1.00
Roundtrip	\pr\corerun.exe	Pax	31.182 us	0.60	11,707 B	31.62 KB	0.38

Roundtrip	\main\corerun.exe	Gnu	14.264 us	1.00	13,310 B	19.76 KB	1.00
Roundtrip	\pr\corerun.exe	Gnu	8.183 us	0.57	8,389 B	5.95 KB	0.30

private MemoryStream _archive = new MemoryStream();
private string[] _names;
private MemoryStream[] _streams;

[GlobalSetup]
public void Setup()
{
    _names = Enumerable.Range(0, 10).Select(i => $"HelloWorld{i}.txt").ToArray();
    _streams = _names.Select(s => new MemoryStream(Encoding.UTF8.GetBytes(s))).ToArray();
}

[Benchmark]
[Arguments(TarEntryFormat.Pax)]
[Arguments(TarEntryFormat.Gnu)]
[Arguments(TarEntryFormat.Ustar)]
public void Roundtrip(TarEntryFormat format)
{
    using (TarWriter writer = new TarWriter(_archive, leaveOpen: true))
    {
        for (int i = 0; i < _names.Length; i++)
        {
            _streams[i].Position = 0;
            TarEntry entry = format switch
            {
                TarEntryFormat.Pax => new PaxTarEntry(TarEntryType.RegularFile, _names[i]) { DataStream = _streams[i] },
                TarEntryFormat.Gnu => new GnuTarEntry(TarEntryType.RegularFile, _names[i]) { DataStream = _streams[i] },
                _ => new UstarTarEntry(TarEntryType.RegularFile, _names[i]) { DataStream = _streams[i] },
            };
            writer.WriteEntry(entry);
        }
    }

    _archive.Position = 0;

    using (TarReader reader = new TarReader(_archive, leaveOpen: true))
    {
        TarEntry entry;
        while ((entry = reader.GetNextEntry()) != null)
        {
            entry.DataStream?.CopyTo(Stream.Null);
        }
    }
}

Not a perf thing, just readability.

…sync variants, and overhaul GenerateExtendedAttributesDataStream

…teName

ghost · 2022-08-19T22:51:18Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Low-hanging fruit.

Method	Toolchain	format	Mean	Ratio	Code Size	Allocated	Alloc Ratio
Roundtrip	\main\corerun.exe	Ustar	13.108 us	1.00	11,006 B	17.1 KB	1.00
Roundtrip	\pr\corerun.exe	Ustar	8.293 us	0.63	7,526 B	6.34 KB	0.37

Roundtrip	\main\corerun.exe	Pax	51.878 us	1.00	15,772 B	84.12 KB	1.00
Roundtrip	\pr\corerun.exe	Pax	37.418 us	0.72	11,692 B	45.12 KB	0.54

Roundtrip	\main\corerun.exe	Gnu	14.515 us	1.00	13,310 B	19.76 KB	1.00
Roundtrip	\pr\corerun.exe	Gnu	8.004 us	0.55	8,389 B	5.95 KB	0.30

private MemoryStream _archive = new MemoryStream();
private string[] _names;
private MemoryStream[] _streams;

[GlobalSetup]
public void Setup()
{
    _names = Enumerable.Range(0, 10).Select(i => $"HelloWorld{i}.txt").ToArray();
    _streams = _names.Select(s => new MemoryStream(Encoding.UTF8.GetBytes(s))).ToArray();
}

[Benchmark]
[Arguments(TarEntryFormat.Pax)]
[Arguments(TarEntryFormat.Gnu)]
[Arguments(TarEntryFormat.Ustar)]
public void Roundtrip(TarEntryFormat format)
{
    using (TarWriter writer = new TarWriter(_archive, leaveOpen: true))
    {
        for (int i = 0; i < _names.Length; i++)
        {
            _streams[i].Position = 0;
            TarEntry entry = format switch
            {
                TarEntryFormat.Pax => new PaxTarEntry(TarEntryType.RegularFile, _names[i]) { DataStream = _streams[i] },
                TarEntryFormat.Gnu => new GnuTarEntry(TarEntryType.RegularFile, _names[i]) { DataStream = _streams[i] },
                _ => new UstarTarEntry(TarEntryType.RegularFile, _names[i]) { DataStream = _streams[i] },
            };
            writer.WriteEntry(entry);
        }
    }

    _archive.Position = 0;

    using (TarReader reader = new TarReader(_archive, leaveOpen: true))
    {
        TarEntry entry;
        while ((entry = reader.GetNextEntry()) != null)
        {
            entry.DataStream?.CopyTo(Stream.Null);
        }
    }
}

Author:	stephentoub
Assignees:	stephentoub
Labels:	`area-System.IO`, `tenet-performance`
Milestone:	7.0.0

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs

stephentoub · 2022-08-20T10:35:17Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

        // The checksum accumulator first adds up the byte values of eight space chars, then the final number
        // is written on top of those spaces on the specified span as ascii.
        // At the end, it's saved in the header field and the final value returned.
-        internal int WriteChecksum(int checksum, Span<byte> buffer)
+        internal static int WriteChecksum(int checksum, Span<byte> buffer)


@carlossanlop, even before my changes, is this method functionally correct? The input and output spans are the same length, but the output span has two characters at the end reserved, so are we frequently losing digits from the checksum of the checksum is large enough?

(I didn't want to mess with the logic if it was already buggy, but this should be changed to just span.CopyTo rather than an open-coded loop.)

I'd have to investigate this with a header whose fields have ascii characters with the highest possible values, so that when they get added up for the checksum, its final value would go beyond the checksum field length, excluding the two reserved characters at the end.

stephentoub · 2022-08-20T10:39:33Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

-            return string.Format(PaxHeadersFormat, dirName, processId, fileName, trailingSeparator);
+            return _typeFlag is TarEntryType.Directory or TarEntryType.DirectoryList ?
+                $"{dirName}/PaxHeaders.{Environment.ProcessId}/{fileName}" :
+                $"{dirName}/PaxHeaders.{Environment.ProcessId}/{fileName}{Path.DirectorySeparatorChar}";


Oops. this condition is reversed... but no tests are failing as a result. I'll fix the inversion, but it seems like a test gap that should be addressed.

I can add a test for that.

…ributeName

carlossanlop

Thank you so much for your help, @stephentoub. No blocking comments from me. I'll address your question to me separately.

The CI passed so I'll merge it so I can submit this as an RC1 backport.

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

carlossanlop · 2022-08-21T05:10:59Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

        // The checksum accumulator first adds up the byte values of eight space chars, then the final number
        // is written on top of those spaces on the specified span as ascii.
        // At the end, it's saved in the header field and the final value returned.
-        internal int WriteChecksum(int checksum, Span<byte> buffer)
+        internal static int WriteChecksum(int checksum, Span<byte> buffer)


I'd have to investigate this with a header whose fields have ascii characters with the highest possible values, so that when they get added up for the checksum, its final value would go beyond the checksum field length, excluding the two reserved characters at the end.

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarReader.cs

carlossanlop · 2022-08-21T05:30:24Z

/backport to release/7.0-rc1

github-actions · 2022-08-21T05:30:38Z

Started backporting to release/7.0-rc1: https://github.com/dotnet/runtime/actions/runs/2897330001

stephentoub · 2022-08-21T11:02:47Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

+                    //     "XX attribute=value\n"
+                    // where "XX" is the number of characters in the entry, including those required for the count itself.
+                    int length = 3 + Encoding.UTF8.GetByteCount(attribute) + Encoding.UTF8.GetByteCount(value);
+                    length += CountDigits(length);


This was meant to model the logic previously there, but I think this logic needs to be tweaked, and it should be more like:

int digitCount = CountDigits(length) ; length += digitCount; length += CountDigits(length) - digitCount; // account for possible digit length increase

Do we have tests for this stuff? Do we validate that the archives at produce are readable by other tools? Our own code appears to ignore this length when reading archives (maybe it shouldn't?)

Are we even sure the length is supposed to include itself? Have you seen that being done in archives produced by other tools? That would be a very strange format design.

From man tar 5:

The extended attributes themselves are stored as a series of text-format lines encoded in the portable UTF-8 encoding. Each line consists of a decimal number, a space, a key string, an equals sign, a value string, and a new line. The decimal number indicates the length of the entire line, including the initial length field and the trailing newline. An example of such a field is:

25 ctime=1084839148.1212\n

I agree it's strange design.

I did verify that the tar tool was able to read our archives containing extended attributes entries.

Since we know the length of the data section, I didn't feel it was too important to verify that the length of each extended attribute entry was correct, considering that we need to look for a mandatory newline char suffix. Maybe if we checked the length number and advanced the position instead of searching for the newline, we could improve perf a bit. I can investigate if this is true and determine if we need to submit a PR to verify the length number.

iSazonov · 2022-08-22T04:56:52Z

src/libraries/Common/src/System/IO/Archiving.Utils.cs

+                for (int i = 0; i < dest.Length; i++)
+                {
+                    char ch = dest[i];
+                    if (ch == Path.DirectorySeparatorChar || ch == Path.AltDirectorySeparatorChar)


If on Unix both these constants are /

runtime/src/libraries/Common/src/System/IO/PathInternal.Unix.cs

Lines 13 to 14 in e71a958

internal const char DirectorySeparatorChar = '/';

internal const char AltDirectorySeparatorChar = '/';

we could skip the cycle on Unix. On Windows we could do only one check \ in the cycle.

This code is the same on both platforms.

@EgorBo just out of curiosity, would the JIT in theory be able to legally collapse such a thing? ie, if (ch == '/' || ch == '/') ..

This code is the same on both platforms.

It makes no sense to replace / with / on Unix.

I get that, but what are you proposing -- duplicate this method for Unix and Windows so they can be different? Is this code path that hot?

It's not super hot, but it can be improved. I need to push up a fix anyway, so I'll do so.

iSazonov · 2022-08-22T05:13:34Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

+            while (true)
+            {
+                digits[i] = (byte)('0' + (remaining % 8));
+                remaining /= 8;
+                if (remaining == 0) break;
+                i--;
+            }


~~SharpLab show that there are no optimizations with shifts as I'd expect. So why not unsafe convert the long to span[8]?~~

* Avoid unnecessary byte[] allocations * Remove unnecessary use of FileStreamOptions * Clean up Dispose{Async} implementations * Clean up unnecessary consts Not a perf thing, just readability. * Remove MemoryStream/Encoding.UTF8.GetBytes allocations, unnecessary async variants, and overhaul GenerateExtendedAttributesDataStream * Avoid string allocations in ReadMagicAttribute * Avoid allocation in WriteAsOctal * Improve handling of octal * Avoid allocation for version string * Removing boxing and char string allocation in GenerateExtendedAttributeName * Fix a couple unnecessary dictionary lookups * Replace Enum.HasFlag usage * Remove allocations from Write{Posix}Name * Replace ArrayPool use with string.Create * Replace more superfluous ArrayPool usage * Remove ArrayPool use from System.IO.Compression.ZipFile * Fix inverted condition * Use generic math to parse octal * Remove allocations from StringReader and string.Split * Remove magic string allocation for Ustar when not V7 * Remove file name and directory name allocation in GenerateExtendedAttributeName

@am11

* Improve performance of Tar library (#74281) * Avoid unnecessary byte[] allocations * Remove unnecessary use of FileStreamOptions * Clean up Dispose{Async} implementations * Clean up unnecessary consts Not a perf thing, just readability. * Remove MemoryStream/Encoding.UTF8.GetBytes allocations, unnecessary async variants, and overhaul GenerateExtendedAttributesDataStream * Avoid string allocations in ReadMagicAttribute * Avoid allocation in WriteAsOctal * Improve handling of octal * Avoid allocation for version string * Removing boxing and char string allocation in GenerateExtendedAttributeName * Fix a couple unnecessary dictionary lookups * Replace Enum.HasFlag usage * Remove allocations from Write{Posix}Name * Replace ArrayPool use with string.Create * Replace more superfluous ArrayPool usage * Remove ArrayPool use from System.IO.Compression.ZipFile * Fix inverted condition * Use generic math to parse octal * Remove allocations from StringReader and string.Split * Remove magic string allocation for Ustar when not V7 * Remove file name and directory name allocation in GenerateExtendedAttributeName * fix tar strings (#74321) * Fix some corner cases in TarReader (#74329) * Fix a few Tar issues post perf improvements (#74338) * Fix a few Tar issues post perf improvements * Update src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs * Skip directory symlink recursion on TarFile archive creation (#74376) * Skip directory symlink recursion on TarFile archive creation * Add symlink verification * Address suggestions by danmoseley Co-authored-by: carlossanlop <[email protected]> * SkipBlockAlignmentPadding must be executed for all entries (#74396) * Set modified timestamps on files being extracted from tar archives (#74400) * Add tests for exotic external tar asset archives, fix some more corner case bugs (#74412) * Remove unused _readFirstEntry. Remnant from before we created PaxGlobalExtendedAttributesEntry. * Set the position of the freshly copied data stream to 0, so the first user access of the DataStream property gives them a stream ready to read from the beginning. * Process a PAX actual entry's data block only after the extended attributes are analyzed, in case the size is found as an extended attribute and needs to be overriden. * Add tests to verify the entries of the new external tar assets can be read. Verify their DataStream if available. * Add copyData argument to recent alignment padding tests. * Throw an exception sooner and with a clearer message when a data section is unexpected for the entry type. * Allow trailing nulls and spaces in octal fields. Co-authored-by: @am11 Adeel Mujahid <[email protected]> * Throw a clearer exception if the unsupported sparse file entry type is encountered. These entries have additional data that indicates the locations of sparse bytes, which cannot be read with just the size field. So to avoid accidentally offseting the reader, we throw. * Tests. * Rename to TrimLeadingNullsAndSpaces Co-authored-by: carlossanlop <[email protected]> * Remove Compression changes, keep changes confined to Tar. * Fix build failure due to missing using in TarHelpers.Windows Co-authored-by: Stephen Toub <[email protected]> Co-authored-by: Dan Moseley <[email protected]> Co-authored-by: Adeel Mujahid <[email protected]> Co-authored-by: carlossanlop <[email protected]> Co-authored-by: David Cantú <[email protected]>

stephentoub added 15 commits August 19, 2022 11:15

Avoid unnecessary byte[] allocations

d52d836

Remove unnecessary use of FileStreamOptions

1e5020e

Clean up Dispose{Async} implementations

6938c77

Clean up unnecessary consts

5fa03e9

Not a perf thing, just readability.

Remove MemoryStream/Encoding.UTF8.GetBytes allocations, unnecessary a…

8dd0ac1

…sync variants, and overhaul GenerateExtendedAttributesDataStream

Avoid string allocations in ReadMagicAttribute

5be57ad

Avoid allocation in WriteAsOctal

ab71e6c

Improve handling of octal

df2d742

Avoid allocation for version string

c6058bd

Removing boxing and char string allocation in GenerateExtendedAttribu…

5756a8c

…teName

Fix a couple unnecessary dictionary lookups

9539a4a

Replace Enum.HasFlag usage

74bbc9c

Remove allocations from Write{Posix}Name

46e0855

Replace ArrayPool use with string.Create

02ca7da

Replace more superfluous ArrayPool usage

f9eb99f

stephentoub added the tenet-performance Performance related issue label Aug 19, 2022

stephentoub added this to the 7.0.0 milestone Aug 19, 2022

stephentoub requested review from jeffhandley, carlossanlop and adamsitnik August 19, 2022 22:50

ghost assigned stephentoub Aug 19, 2022

dotnet-issue-labeler bot added the area-System.IO label Aug 19, 2022

Remove ArrayPool use from System.IO.Compression.ZipFile

add6179

danmoseley reviewed Aug 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs Show resolved Hide resolved

danmoseley reviewed Aug 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs Show resolved Hide resolved

danmoseley reviewed Aug 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs Outdated Show resolved Hide resolved

danmoseley reviewed Aug 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs Outdated Show resolved Hide resolved

danmoseley approved these changes Aug 20, 2022

View reviewed changes

stephentoub commented Aug 20, 2022

View reviewed changes

stephentoub added 5 commits August 20, 2022 07:58

Fix inverted condition

6f8cb75

Use generic math to parse octal

827a588

Remove allocations from StringReader and string.Split

ae21478

Remove magic string allocation for Ustar when not V7

d6b6727

Remove file name and directory name allocation in GenerateExtendedAtt…

480af5c

…ributeName

carlossanlop approved these changes Aug 21, 2022

View reviewed changes

carlossanlop merged commit ca6bbf1 into dotnet:main Aug 21, 2022

github-actions bot mentioned this pull request Aug 21, 2022

[release/7.0-rc1] Improve performance of Tar library #74305

Closed

stephentoub commented Aug 21, 2022

View reviewed changes

stephentoub deleted the tarperf branch August 21, 2022 11:25

am11 mentioned this pull request Aug 21, 2022

TarReader fails to read tar file with hardlinks; throws System.IO.EndOfStreamException #74309

Closed

iSazonov reviewed Aug 22, 2022

View reviewed changes

danmoseley mentioned this pull request Aug 22, 2022

TarReader throws on various archives that other tools accept #74316

Closed

stephentoub mentioned this pull request Aug 22, 2022

Fix a few Tar issues post perf improvements #74338

Merged

ghost locked as resolved and limited conversation to collaborators Sep 21, 2022

carlossanlop added area-System.Formats.Tar and removed area-System.IO labels Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of Tar library #74281

Improve performance of Tar library #74281

stephentoub commented Aug 19, 2022 •

edited

Loading

ghost commented Aug 19, 2022

stephentoub Aug 20, 2022 •

edited

Loading

carlossanlop Aug 21, 2022

stephentoub Aug 20, 2022

carlossanlop Aug 21, 2022

carlossanlop left a comment

carlossanlop Aug 21, 2022

carlossanlop commented Aug 21, 2022

github-actions bot commented Aug 21, 2022

stephentoub Aug 21, 2022 •

edited

Loading

stephentoub Aug 21, 2022 •

edited

Loading

carlossanlop Aug 22, 2022 •

edited

Loading

carlossanlop Aug 22, 2022

iSazonov Aug 22, 2022

danmoseley Aug 22, 2022

iSazonov Aug 22, 2022

danmoseley Aug 22, 2022

stephentoub Aug 22, 2022

iSazonov Aug 22, 2022 •

edited

Loading

	internal const char DirectorySeparatorChar = '/';
	internal const char AltDirectorySeparatorChar = '/';

Improve performance of Tar library #74281

Improve performance of Tar library #74281

Conversation

stephentoub commented Aug 19, 2022 • edited Loading

ghost commented Aug 19, 2022

stephentoub Aug 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlossanlop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlossanlop commented Aug 21, 2022

github-actions bot commented Aug 21, 2022

stephentoub Aug 21, 2022 • edited Loading

Choose a reason for hiding this comment

stephentoub Aug 21, 2022 • edited Loading

Choose a reason for hiding this comment

carlossanlop Aug 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iSazonov Aug 22, 2022 • edited Loading

Choose a reason for hiding this comment

stephentoub commented Aug 19, 2022 •

edited

Loading

stephentoub Aug 20, 2022 •

edited

Loading

stephentoub Aug 21, 2022 •

edited

Loading

stephentoub Aug 21, 2022 •

edited

Loading

carlossanlop Aug 22, 2022 •

edited

Loading

iSazonov Aug 22, 2022 •

edited

Loading