Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SplitOnChunkSize() to FileInfo #23

Open
adamfisher opened this issue Dec 27, 2018 · 4 comments
Open

Add SplitOnChunkSize() to FileInfo #23

adamfisher opened this issue Dec 27, 2018 · 4 comments
Assignees

Comments

@adamfisher
Copy link
Contributor

adamfisher commented Dec 27, 2018

This issue is a proposal to add SplitOnChunkSize() to FileInfo that would split a file into multiple files and return an array of the newly created files. The challenge with this one will be handling line breaks if the breakOnNewlines is true and also taking into account large files means buffering a chunk of data at a time so as not to overload system resources.

/// <summary>
/// Splits a file into multiple files based on the specified chunk size of each file.
/// </summary>
/// <param name="file">The file.</param>
/// <param name="chunkSize">The maximum number of bytes to store in each file.
/// If a chunk size is not provided, files will be split into 1 MB chunks by default.
/// The breakOnNewlines parameter can slightly affect the size of each file.</param>
/// <param name="targetPath">The destination where the split files will be saved.</param>
/// <param name="deleteAfterSplit">if set to <c>true</c>, the original file is deleted after creating the newly split files.</param>
/// <param name="breakOnNewlines">if set to <c>true</c> break the file on the next newline once the chunk size limit is reached.</param>
/// <returns>
/// An array of references to the split files.
/// </returns>
/// <exception cref="ArgumentNullException">file</exception>
/// <exception cref="ArgumentOutOfRangeException">chunkSize - The chunk size must be larger than 0 bytes.</exception>
public static FileInfo[] SplitOnChunkSize(
	this FileInfo file,
	int chunkSize = 1000000,
	DirectoryInfo targetPath = null,
	bool deleteAfterSplit = false,
	bool breakOnNewlines = true
	)
{
	if (file == null)
		throw new ArgumentNullException(nameof(file));

	if (chunkSize < 1)
		throw new ArgumentOutOfRangeException(nameof(chunkSize), chunkSize,
			"The chunk size must be larger than 0 bytes.");

	if (file.Length <= chunkSize)
		return new[] {file};

	var buffer = new byte[chunkSize];
	var extraBuffer = new List<byte>();
	targetPath = targetPath ?? file.Directory;
	var chunkedFiles = new List<FileInfo>((int)Math.Abs(file.Length / chunkSize) + 1);

	using (var input = file.OpenRead())
	{
		var index = 1;

		while (input.Position < input.Length)
		{
			var chunkFileName = new FileInfo(Path.Combine(targetPath.FullName, $"{file.Name}.CHUNK_{index++}"));
			chunkedFiles.Add(chunkFileName);
			using (var output = chunkFileName.Create())
			{
				var chunkBytesRead = 0;
				while (chunkBytesRead < chunkSize)
				{
					var bytesRead = input.Read(buffer,
						chunkBytesRead,
						chunkSize - chunkBytesRead);

					if (bytesRead == 0)
					{
						break;
					}

					chunkBytesRead += bytesRead;
				}

				if (breakOnNewlines)
				{
					var extraByte = buffer[chunkSize - 1];
					while (extraByte != '\n')
					{
						var flag = input.ReadByte();
						if (flag == -1)
							break;
						extraByte = (byte)flag;
						extraBuffer.Add(extraByte);
					}

					output.Write(buffer, 0, chunkBytesRead);
					if (extraBuffer.Count > 0)
						output.Write(extraBuffer.ToArray(), 0, extraBuffer.Count);

					extraBuffer.Clear();
				}
			}
		}
	}

	if (deleteAfterSplit)
		file.Delete();

	return chunkedFiles.ToArray();
}
@adamfisher
Copy link
Contributor Author

Maybe just calling it Split() instead of SplitOnChunkSize() would be ok too if we want to have overloaded methods in the future that would handle other scenarios like splitting on number of lines per file.

@JonathanMagnan JonathanMagnan self-assigned this Dec 28, 2018
@JonathanMagnan
Copy link
Member

Thank @adamfisher for all your next extensions,

My employee will review all them when he will be back from his vacancy in one week.

Best Regards,

Jonathan

@adamfisher
Copy link
Contributor Author

No worries Jonathan. Very cool library you guys have. I would rather contribute to it instead of creating yet another one-off NuGet package 😃

@adamfisher
Copy link
Contributor Author

@JonathanMagnan Any movement on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants