Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect values read for first column in new chunk #180

Open
CrustyAuklet opened this issue Aug 16, 2021 · 3 comments
Open

Incorrect values read for first column in new chunk #180

CrustyAuklet opened this issue Aug 16, 2021 · 3 comments
Assignees
Labels

Comments

@CrustyAuklet
Copy link

CrustyAuklet commented Aug 16, 2021

Background

I have a CSV file that is 10,090,688 bytes and the first column of the first row in a new section is reading incorrectly. In my case this column in null (it is "" in the whole CSV) and the is_null() function returns false for this one row and when read returns a value halfway through the line.

As a short explanation this first column is the dataset name, and there are multiple datasets per file in theory (but we never do that). If the name is null then the name should be the sensor serial number from column three.

This bug causes the null check to fail on that first row in a new chunk, then I try to read the value I get the bad result. All other values in the row seem to read fine, and all other rows in the chunk read correctly. This results in all the data in the one properly named dataset, and one lone dataset with a single value and the strange name.

Investigation

Using the debugger I have tracked it down to the CSVRow::get_field function on line 7683:
in this section:

CSV_INLINE csv::string_view CSVRow::get_field(size_t index) const
{
        // lines omitted for brevity..

        const size_t field_index = this->fields_start + index;
        auto& field = this->data->fields[field_index];
        auto field_str = csv::string_view(this->data->data).substr(this->data_start + field.start);

For the offending row, when accessing the first column, the field struct retrieved from has an incorrect value in the start member. In this case it is 138 when it should be zero, but that isn't consistent when I mess with the chunk size to make the problem appear more. This results in the wrong substring in field_str. Everything else is working as expected as far as I can tell.

I have also found that if I change constexpr size_t ITERATION_CHUNK_SIZE = 10000000; to a small value I get many more single value datasets with random names. The number is proportional to the change in that value.

Platforms:

MSVC 19.29.30132.0 on windows 10
GCC 10.3.0 on Ubuntu 20.04

csv version

Started with 2.1.0 release, but have been working with the "single include" header from master since I found the bug.

@CrustyAuklet
Copy link
Author

CrustyAuklet commented Aug 17, 2021

After some more experiments, I have discovered this only happens when constructing the reader with an std::stringstream, and not if I use the memory mapped version by passing in a filename.

If I am testing with a small value for ITERATION_CHUNK_SIZE though, the mio version also fails when it throws in CSVRow::get_field for index out of bounds (the index is correct, but the CSVRow data seems to start with a newline).

If you have any idea where I can poke around I am happy to take a stab at this.

@jimbeveridge
Copy link

+1. I am seeing this same bug under Visual Studio 2019 with a recent update. Same behavior - works with memory mapped files, breaks with stringstream. Also a large-ish file (at least several megabytes.)

@MichaelSteffens
Copy link

MichaelSteffens commented Jan 31, 2022

+1. Same with 2.1.3, g++ 9.3.0, and parsing a file stream. The issue is not exposed if a new chunk starts immediately after a delimiter, but in all other cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants