You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a CSV file that is 10,090,688 bytes and the first column of the first row in a new section is reading incorrectly. In my case this column in null (it is "" in the whole CSV) and the is_null() function returns false for this one row and when read returns a value halfway through the line.
As a short explanation this first column is the dataset name, and there are multiple datasets per file in theory (but we never do that). If the name is null then the name should be the sensor serial number from column three.
This bug causes the null check to fail on that first row in a new chunk, then I try to read the value I get the bad result. All other values in the row seem to read fine, and all other rows in the chunk read correctly. This results in all the data in the one properly named dataset, and one lone dataset with a single value and the strange name.
Investigation
Using the debugger I have tracked it down to the CSVRow::get_field function on line 7683:
in this section:
CSV_INLINE csv::string_view CSVRow::get_field(size_t index) const
{
// lines omitted for brevity..constsize_t field_index = this->fields_start + index;
auto& field = this->data->fields[field_index];
auto field_str = csv::string_view(this->data->data).substr(this->data_start + field.start);
For the offending row, when accessing the first column, the field struct retrieved from has an incorrect value in the start member. In this case it is 138 when it should be zero, but that isn't consistent when I mess with the chunk size to make the problem appear more. This results in the wrong substring in field_str. Everything else is working as expected as far as I can tell.
I have also found that if I change constexpr size_t ITERATION_CHUNK_SIZE = 10000000; to a small value I get many more single value datasets with random names. The number is proportional to the change in that value.
Platforms:
MSVC 19.29.30132.0 on windows 10
GCC 10.3.0 on Ubuntu 20.04
csv version
Started with 2.1.0 release, but have been working with the "single include" header from master since I found the bug.
The text was updated successfully, but these errors were encountered:
After some more experiments, I have discovered this only happens when constructing the reader with an std::stringstream, and not if I use the memory mapped version by passing in a filename.
If I am testing with a small value for ITERATION_CHUNK_SIZE though, the mio version also fails when it throws in CSVRow::get_field for index out of bounds (the index is correct, but the CSVRow data seems to start with a newline).
If you have any idea where I can poke around I am happy to take a stab at this.
+1. I am seeing this same bug under Visual Studio 2019 with a recent update. Same behavior - works with memory mapped files, breaks with stringstream. Also a large-ish file (at least several megabytes.)
+1. Same with 2.1.3, g++ 9.3.0, and parsing a file stream. The issue is not exposed if a new chunk starts immediately after a delimiter, but in all other cases.
Background
I have a CSV file that is 10,090,688 bytes and the first column of the first row in a new section is reading incorrectly. In my case this column in null (it is
""
in the whole CSV) and theis_null()
function returns false for this one row and when read returns a value halfway through the line.As a short explanation this first column is the dataset name, and there are multiple datasets per file in theory (but we never do that). If the name is null then the name should be the sensor serial number from column three.
This bug causes the null check to fail on that first row in a new chunk, then I try to read the value I get the bad result. All other values in the row seem to read fine, and all other rows in the chunk read correctly. This results in all the data in the one properly named dataset, and one lone dataset with a single value and the strange name.
Investigation
Using the debugger I have tracked it down to the
CSVRow::get_field
function on line 7683:in this section:
For the offending row, when accessing the first column, the field struct retrieved from has an incorrect value in the
start
member. In this case it is 138 when it should be zero, but that isn't consistent when I mess with the chunk size to make the problem appear more. This results in the wrong substring infield_str
. Everything else is working as expected as far as I can tell.I have also found that if I change
constexpr size_t ITERATION_CHUNK_SIZE = 10000000;
to a small value I get many more single value datasets with random names. The number is proportional to the change in that value.Platforms:
MSVC 19.29.30132.0 on windows 10
GCC 10.3.0 on Ubuntu 20.04
csv version
Started with 2.1.0 release, but have been working with the "single include" header from master since I found the bug.
The text was updated successfully, but these errors were encountered: