Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ToDo #1

Open
username1565 opened this issue Aug 3, 2022 · 2 comments
Open

ToDo #1

username1565 opened this issue Aug 3, 2022 · 2 comments

Comments

@username1565
Copy link
Owner

Make file storage for HashTable.

  1. Do not load data, because it can be a strings, with length about 1 GB.
  2. Make data-file with raw-data, and index-file with offset and length of values.
  3. Load index-file only, without values, and read values when this need.
  4. Think about parts of large data, like torrents parts.
  5. Because index will be rewrited after update, update this without rewrite - limit max possible key-length and make fixed index record, and update index-records by write updated info in index-file, by offsets.

Max file length is something about 2^31 = 2147483648 so it's limit for data-file.
Let's values to be 1 byte each, then 2147483648 index-records in index-file, so it's large number of records.
Max dictionary or Hashtable size is 2 billions of records:
https://www.c-sharpcorner.com/UploadFile/b17487/dictionary-overview-in-C-Sharp/
So it's limit.

Now, let's make one value of max possible value: 2147483648 bytes.
But string length, in CSharp, is something about 2^30 = 1073741824 bytes.
So 1073741824 can be max value of data-file.
1 value with length of 1073741824 bytes - one index-record,
or 1073741824 values with length 1 byte, and 1073741824 index-records.

But index-record can not be larger than 2147483648 bytes, so 2 bytes for each index-record is small space.

Because of this, let one index-record to be fixed, as 256 bytes, for example.
Index record may contains offset and length. If it will be 4 bytes value (int, Int32), then Int32.MaxValue = 2147483647. Not bad.
256 (index-record length) - 4 (offset) - 4 (length) = 248 bytes - is max length for key.

Because Csharp max-file length is 2^31 = 2147483648
and each record have length 256 bytes, then 2147483648 / 256 = 8388608 indexes can be saved in index-file.
It's in the case, when data-file will contains 8388608 values, with length 1 byte.
So limit of data-file can be 8388608 bytes.
One value with length of 8388608 bytes, and one index,
or 8388608 values with length 1 byte, and 8388608 indexes, with 2 GB index-file.

Now, let's try to save the some data with value length 2147483648 bytes.
Need to split this by 2147483648 / 8388608 = 256 parts, and save each in new data-file, as one value with length 8388608 bytes.
Because this is one value, splitted by parts, need to specify one key.
So key, and index-record can contains the number of parts too:
256 (index-record length) - 4 (offset) - 4 (length) - 4(number of part) = 244 bytes - is max length for key.
This allow to save 4294967296 parts by 8388608 bytes each or 36028797018963968 bytes, as a one value, splitted by parts.
Not bad.

Because max data-file with part of value have max length of 8388608 bytes,
4 bytes for offset, and data-length is large values:
4 bytes * 8 bit/byte = 32 bits; 2^32 = 4294967296 values, and it's large.
8388608 = 2^23, so 23 bits can be used to save offset and length.
if 24 bits will be used, then 24 [bits] / 8 [bits/byte] = 3 bytes, to encode 23 bits for offset and for length.
The rest one bit from can be used as a flag, means is value deleted (filled with null-bytes), or not.

So, meaning that things, above, index-record can have format:
256 (index-record length) - 3 (offset) - 3 (length) - 4(number of part) = 246 bytes - is max length for key.

Now, because number of part it's 4-bytes (32 bits) value:
2^32 = 4294967296 parts can be saved, with length of 8388608 bytes for each part, so 36028797018963968 per large file.
It's ~36 PetaBytes.
But if number of part will be encoded with 8 bytes, as Uint32, then UInt64.MaxValue = 18446744073709551615
so 18446744073709551615 parts can be saved, with length of 8388608 bytes for each part, so file with length 154742504910672534354001920 bytes can be saved, then.

In this case, one index-record, will have format:
256 (index-record length) - 3 (offset) - 3 (length) - 8(number of part) = 242 bytes - is max length for key.

@username1565
Copy link
Owner Author

Remark about file-string conversion.
System.Text.Encoding.GetEncoding("ISO-8859-1")
allow to save any byte as one char, in string.

byte[] bytes = System.Text.Encoding.GetEncoding("ISO-8859-1").GetBytes(string);
string str = System.Text.Encoding.GetEncoding("ISO-8859-1").GetBytes(bytes);
So binary data can be saved as strings, in key-value storage.

@username1565
Copy link
Owner Author

Instead of reduce data-file length (file with values only), and reduce this from 2147483647 bytes, to 8388608 bytes, there is possible to leave limit of this file 2147483647, but create new file, and new index, if number of index-file is over 2147483647, in the case, when number of index records is over 8388608.

In this case, limit for each part can be 1073741824 for each string-value (because it's CSharp limit for string).

Then, index-record, can have format:
256 (index-record length) - 4 (offset) - 4 (length) - 4(number of part, Int32) = 244 bytes
with 4294967296 parts by 1073741824 bytes each = 4611686018427387904 or 4.6 ExaBytes
or
256 (index-record length) - 4 (offset) - 4 (length) - 8(number of part, Uint32) = 240 bytes
with 2^64 = 18446744073709551616 parts by 1073741824 bytes each = 19807040628566084398385987584 bytes, which is more over than 1 yottabyte for each splited value of one key in hash-table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant