Incorrect values read for first column in new chunk #180

CrustyAuklet · 2021-08-16T16:24:38Z

Background

I have a CSV file that is 10,090,688 bytes and the first column of the first row in a new section is reading incorrectly. In my case this column in null (it is "" in the whole CSV) and the is_null() function returns false for this one row and when read returns a value halfway through the line.

As a short explanation this first column is the dataset name, and there are multiple datasets per file in theory (but we never do that). If the name is null then the name should be the sensor serial number from column three.

This bug causes the null check to fail on that first row in a new chunk, then I try to read the value I get the bad result. All other values in the row seem to read fine, and all other rows in the chunk read correctly. This results in all the data in the one properly named dataset, and one lone dataset with a single value and the strange name.

Investigation

Using the debugger I have tracked it down to the CSVRow::get_field function on line 7683:
in this section:

CSV_INLINE csv::string_view CSVRow::get_field(size_t index) const
{
        // lines omitted for brevity..

        const size_t field_index = this->fields_start + index;
        auto& field = this->data->fields[field_index];
        auto field_str = csv::string_view(this->data->data).substr(this->data_start + field.start);

For the offending row, when accessing the first column, the field struct retrieved from has an incorrect value in the start member. In this case it is 138 when it should be zero, but that isn't consistent when I mess with the chunk size to make the problem appear more. This results in the wrong substring in field_str. Everything else is working as expected as far as I can tell.

I have also found that if I change constexpr size_t ITERATION_CHUNK_SIZE = 10000000; to a small value I get many more single value datasets with random names. The number is proportional to the change in that value.

Platforms:

MSVC 19.29.30132.0 on windows 10
GCC 10.3.0 on Ubuntu 20.04

csv version

Started with 2.1.0 release, but have been working with the "single include" header from master since I found the bug.

The text was updated successfully, but these errors were encountered:

CrustyAuklet · 2021-08-17T06:40:00Z

After some more experiments, I have discovered this only happens when constructing the reader with an std::stringstream, and not if I use the memory mapped version by passing in a filename.

If I am testing with a small value for ITERATION_CHUNK_SIZE though, the mio version also fails when it throws in CSVRow::get_field for index out of bounds (the index is correct, but the CSVRow data seems to start with a newline).

If you have any idea where I can poke around I am happy to take a stab at this.

jimbeveridge · 2021-12-27T02:50:27Z

+1. I am seeing this same bug under Visual Studio 2019 with a recent update. Same behavior - works with memory mapped files, breaks with stringstream. Also a large-ish file (at least several megabytes.)

MichaelSteffens · 2022-01-31T13:28:25Z

+1. Same with 2.1.3, g++ 9.3.0, and parsing a file stream. The issue is not exposed if a new chunk starts immediately after a delimiter, but in all other cases.

MichaelSteffens added a commit to MichaelSteffens/csv-cut that referenced this issue Jan 31, 2022

Disable parsing of stdin, due to vincentlaucsb/csv-parser#180.

d6bf5fe

vincentlaucsb added the bug label May 17, 2022

vincentlaucsb self-assigned this May 17, 2022

sjoubert mentioned this issue Sep 13, 2022

CSVReader fails with files bigger than 10 MB when constructed with std::ifstream #202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect values read for first column in new chunk #180

Incorrect values read for first column in new chunk #180

CrustyAuklet commented Aug 16, 2021 •

edited

Loading

CrustyAuklet commented Aug 17, 2021 •

edited

Loading

jimbeveridge commented Dec 27, 2021

MichaelSteffens commented Jan 31, 2022 •

edited

Loading

Incorrect values read for first column in new chunk #180

Incorrect values read for first column in new chunk #180

Comments

CrustyAuklet commented Aug 16, 2021 • edited Loading

Background

Investigation

Platforms:

csv version

CrustyAuklet commented Aug 17, 2021 • edited Loading

jimbeveridge commented Dec 27, 2021

MichaelSteffens commented Jan 31, 2022 • edited Loading

CrustyAuklet commented Aug 16, 2021 •

edited

Loading

CrustyAuklet commented Aug 17, 2021 •

edited

Loading

MichaelSteffens commented Jan 31, 2022 •

edited

Loading