PERF: fix performance regression from #62542 #62623

Alvaro-Kothe · 2025-10-08T14:22:02Z

xref BUG: read_csv() fails to detect floats larger than 2.0^64 #51295 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Benchmarks

asv continuous -f 1.1 -E virtualenv:3.13 "db31f6a38353a311cc471eb98506470b39c676d8~" HEAD -b io.csv

asv compare db31f6a38353a311cc471eb98506470b39c676d8~ HEAD

IO.csv benchmarks:

Before [`d8b3ff3`] <main~12>	After [`4c8d770`] <perf/read-csv>	Ratio	Benchmark (Parameter)
18.2±0.06ms	17.7±0.8ms	0.97	io.csv.ParseDateComparison.time_read_csv_dayfirst(False)
3.50±0.04ms	3.30±0.09ms	0.94	io.csv.ParseDateComparison.time_read_csv_dayfirst(True)
19.6±0.2ms	19.5±0.1ms	1.00	io.csv.ParseDateComparison.time_to_datetime_dayfirst(False)
3.63±0.08ms	3.49±0.08ms	0.96	io.csv.ParseDateComparison.time_to_datetime_dayfirst(True)
19.2±0.2ms	19.2±0.1ms	1.00	io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(False)
3.50±0.09ms	3.36±0.07ms	0.96	io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(True)
6.1G	6.05G	0.99	io.csv.ReadCSVCParserLowMemory.peakmem_over_2gb_input
905±3μs	904±5μs	1.00	io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'c')
1.50±0ms	1.49±0.01ms	1.00	io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'python')
922±4μs	913±7μs	0.99	io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'c')
1.52±0.01ms	1.51±0.02ms	0.99	io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'python')
25.1±0.3ms	24.6±0.6ms	0.98	io.csv.ReadCSVCategorical.time_convert_direct('c')
231±7ms	222±2ms	0.96	io.csv.ReadCSVCategorical.time_convert_direct('python')
61.4±0.5ms	60.5±2ms	0.99	io.csv.ReadCSVCategorical.time_convert_post('c')
152±1ms	144±1ms	0.95	io.csv.ReadCSVCategorical.time_convert_post('python')
35.3±1ms	35.2±0.6ms	1.00	io.csv.ReadCSVComment.time_comment('c')
35.6±0.9ms	34.9±0.2ms	0.98	io.csv.ReadCSVComment.time_comment('python')
20.6±0.5ms	20.5±0.4ms	1.00	io.csv.ReadCSVConcatDatetime.time_read_csv
9.90±0.4ms	10.0±0.4ms	1.01	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')
7.00±0.5ms	6.72±0.2ms	0.96	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
11.6±0.4ms	11.1±0.3ms	0.96	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
3.68±0.01ms	3.53±0.2ms	0.96	io.csv.ReadCSVDInferDatetimeFormat.time_read_csv('custom')
1.10±0ms	1.07±0.05ms	0.97	io.csv.ReadCSVDInferDatetimeFormat.time_read_csv('iso8601')
895±9μs	863±30μs	0.96	io.csv.ReadCSVDInferDatetimeFormat.time_read_csv('ymd')
938±40μs	940±30μs	1.00	io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(None)
4.31±0.05ms	4.20±0.2ms	0.97	io.csv.ReadCSVDatePyarrowEngine.time_read_csv_index_col
47.3M	47.1M	1.00	io.csv.ReadCSVEngine.peakmem_read_csv('c')
63.8M	63.8M	1.00	io.csv.ReadCSVEngine.peakmem_read_csv('pyarrow')
217M	217M	1.00	io.csv.ReadCSVEngine.peakmem_read_csv('python')
10.0±0.5ms	9.39±0.2ms	0.94	io.csv.ReadCSVEngine.time_read_bytescsv('c')
6.83±0.3ms	6.92±0.4ms	1.01	io.csv.ReadCSVEngine.time_read_bytescsv('pyarrow')
279±3ms	278±20ms	1.00	io.csv.ReadCSVEngine.time_read_bytescsv('python')
10.0±0.4ms	9.63±0.1ms	0.96	io.csv.ReadCSVEngine.time_read_stringcsv('c')
7.61±0.2ms	7.60±0.2ms	1.00	io.csv.ReadCSVEngine.time_read_stringcsv('pyarrow')
275±5ms	276±4ms	1.00	io.csv.ReadCSVEngine.time_read_stringcsv('python')
788±7μs	771±30μs	0.98	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
1.83±0.02ms	1.79±0.04ms	0.98	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'round_trip')
788±10μs	775±30μs	0.98	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
1.10±0.01ms	1.09±0.03ms	0.99	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
1.10±0.01ms	1.13±0.02ms	1.03	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'round_trip')
1.09±0ms	1.08±0.03ms	0.99	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', None)
787±10μs	772±30μs	0.98	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
1.83±0.02ms	1.80±0.04ms	0.98	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'round_trip')
782±10μs	766±20μs	0.98	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
1.09±0.01ms	1.07±0.04ms	0.97	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'high')
1.09±0ms	1.07±0.03ms	0.98	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'round_trip')
1.09±0.01ms	1.08±0.03ms	0.99	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', None)
2.52±0.1ms	2.61±0.02ms	1.03	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'high')
2.52±0.1ms	2.60±0.02ms	1.03	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'round_trip')
2.53±0.04ms	2.59±0.05ms	1.02	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', None)
2.14±0.03ms	2.13±0.01ms	0.99	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'high')
2.12±0.02ms	2.14±0ms	1.01	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'round_trip')
2.06±0.07ms	2.13±0.02ms	1.03	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', None)
2.61±0.02ms	2.60±0.02ms	0.99	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'high')
2.63±0.03ms	2.62±0.02ms	1.00	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'round_trip')
2.59±0.02ms	2.61±0.04ms	1.01	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', None)
2.13±0.02ms	2.14±0.01ms	1.00	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'high')
2.15±0.01ms	2.13±0.01ms	0.99	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'round_trip')
2.14±0.03ms	2.13±0.01ms	1.00	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', None)
5.28±0.2ms	4.95±0.1ms	0.94	io.csv.ReadCSVIndexCol.time_read_csv_index_col
73.6±1ms	68.1±1ms	0.92	io.csv.ReadCSVMemMapUTF8.time_read_memmapped_utf8
0	0	n/a	io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('c')
0	0	n/a	io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('python')
804±30μs	781±4μs	0.97	io.csv.ReadCSVParseDates.time_baseline('c')
956±30μs	922±3μs	0.96	io.csv.ReadCSVParseDates.time_baseline('python')
3.05±0.01ms	2.94±0.06ms	0.96	io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'c')
8.58±0.1ms	8.32±0.3ms	0.97	io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'python')
7.07±0.08ms	6.66±0.06ms	0.94	io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'c')
24.8±0.1ms	23.6±0.1ms	0.95	io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'python')
3.43±0.02ms	3.24±0.1ms	0.95	io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'c')
9.56±0.1ms	9.27±0.3ms	0.97	io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'python')
9.22±0.09ms	9.44±0.2ms	1.02	io.csv.ReadCSVSkipRows.time_skipprows(10000, 'c')
3.52±0.1ms	3.54±0.09ms	1.01	io.csv.ReadCSVSkipRows.time_skipprows(10000, 'pyarrow')
38.7±0.3ms	38.6±0.3ms	1.00	io.csv.ReadCSVSkipRows.time_skipprows(10000, 'python')
14.6±0.1ms	14.3±0.3ms	0.98	io.csv.ReadCSVSkipRows.time_skipprows(None, 'c')
3.54±0.2ms	3.53±0.06ms	1.00	io.csv.ReadCSVSkipRows.time_skipprows(None, 'pyarrow')
57.0±0.8ms	56.0±1ms	0.98	io.csv.ReadCSVSkipRows.time_skipprows(None, 'python')
10.1±0.6ms	11.1±0.9ms	1.10	io.csv.ReadCSVThousands.time_thousands(',', ',', 'c')
127±2ms	126±0.8ms	0.99	io.csv.ReadCSVThousands.time_thousands(',', ',', 'python')
9.56±0.1ms	10.0±0.1ms	1.05	io.csv.ReadCSVThousands.time_thousands(',', None, 'c')
53.8±1ms	53.8±0.3ms	1.00	io.csv.ReadCSVThousands.time_thousands(',', None, 'python')
10.2±0.5ms	10.9±0.9ms	1.07	io.csv.ReadCSVThousands.time_thousands('
126±5ms	120±3ms	0.95	io.csv.ReadCSVThousands.time_thousands('
9.63±0.4ms	10.0±0.1ms	1.04	io.csv.ReadCSVThousands.time_thousands('
53.3±2ms	51.8±0.9ms	0.97	io.csv.ReadCSVThousands.time_thousands('
1.30±0.05ms	1.32±0.06ms	1.02	io.csv.ReadUint64Integers.time_read_uint64
3.69±0.1ms	3.69±0.1ms	1.00	io.csv.ReadUint64Integers.time_read_uint64_na_values
3.43±0.09ms	3.45±0.08ms	1.00	io.csv.ReadUint64Integers.time_read_uint64_neg_values
127±0.4ms	126±0.8ms	0.99	io.csv.ToCSV.time_frame('long')
15.8±0.6ms	16.2±0.03ms	1.03	io.csv.ToCSV.time_frame('mixed')
117±0.4ms	116±3ms	0.99	io.csv.ToCSV.time_frame('wide')
9.25±0.3ms	9.51±0.07ms	1.03	io.csv.ToCSVDatetime.time_frame_date_formatting
3.63±0.1ms	3.74±0.02ms	1.03	io.csv.ToCSVDatetimeBig.time_frame(1000)
34.1±0.8ms	34.3±0.2ms	1.01	io.csv.ToCSVDatetimeBig.time_frame(10000)
344±4ms	338±10ms	0.98	io.csv.ToCSVDatetimeBig.time_frame(100000)
500±10ms	500±10ms	1.00	io.csv.ToCSVDatetimeIndex.time_frame_date_formatting_index
148±1ms	148±2ms	1.00	io.csv.ToCSVDatetimeIndex.time_frame_date_no_format_index
731±7ms	732±20ms	1.00	io.csv.ToCSVFloatFormatVariants.time_callable_format
808±20ms	794±3ms	0.98	io.csv.ToCSVFloatFormatVariants.time_new_style_brace_format
865±6ms	884±20ms	1.02	io.csv.ToCSVFloatFormatVariants.time_new_style_thousands_format
870±10ms	896±20ms	1.03	io.csv.ToCSVFloatFormatVariants.time_old_style_percent_format
659±9ms	664±10ms	1.01	io.csv.ToCSVIndexes.time_head_of_multiindex
658±10ms	664±20ms	1.01	io.csv.ToCSVIndexes.time_multiindex
670±9ms	656±2ms	0.98	io.csv.ToCSVIndexes.time_standard_index
191±5ms	192±1ms	1.00	io.csv.ToCSVMultiIndexUnusedLevels.time_full_frame
16.3±0.4ms	16.3±0.2ms	1.00	io.csv.ToCSVMultiIndexUnusedLevels.time_single_index_frame
18.7±0.5ms	18.9±0.1ms	1.01	io.csv.ToCSVMultiIndexUnusedLevels.time_sliced_frame
3.73±0.02ms	3.77±0.09ms	1.01	io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'D')
3.65±0.1ms	3.76±0.02ms	1.03	io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'h')
34.6±1ms	35.6±0.09ms	1.03	io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'D')
35.2±1ms	35.8±0.3ms	1.02	io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'h')
1.10±0.03ms	1.13±0ms	1.03	io.csv.ToCSVPeriod.time_frame_period_formatting_default(1000, 'D')
1.36±0.04ms	1.40±0.01ms	1.03	io.csv.ToCSVPeriod.time_frame_period_formatting_default(1000, 'h')
9.59±0.3ms	9.83±0.01ms	1.03	io.csv.ToCSVPeriod.time_frame_period_formatting_default(10000, 'D')
12.2±0.4ms	12.5±0.06ms	1.03	io.csv.ToCSVPeriod.time_frame_period_formatting_default(10000, 'h')
1.16±0.03ms	1.14±0ms	0.98	io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(1000, 'D')
1.39±0.03ms	1.42±0.01ms	1.02	io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(1000, 'h')
9.91±0.02ms	9.93±0.03ms	1.00	io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(10000, 'D')
12.6±0.04ms	12.6±0.03ms	1.00	io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(10000, 'h')
5.43±0.03ms	5.35±0.02ms	0.99	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'D')
5.46±0.04ms	5.25±0.1ms	0.96	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'h')
50.8±0.3ms	50.2±0.3ms	0.99	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'D')
50.9±0.5ms	50.3±0.2ms	0.99	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'h')
1.06±0ms	1.13±0ms	1.06	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(1000, 'D')
1.36±0.06ms	1.40±0ms	1.03	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(1000, 'h')
9.03±0.3ms	9.25±0.02ms	1.03	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(10000, 'D')
12.0±0.1ms	12.0±0.04ms	0.99	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(10000, 'h')
2.71±0.02ms	2.68±0.02ms	0.99	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(1000, 'D')
3.01±0.02ms	2.97±0.02ms	0.99	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(1000, 'h')
23.8±0.3ms	23.3±0.1ms	0.98	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(10000, 'D')
26.7±0.3ms	26.2±0.06ms	0.98	io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(10000, 'h')

cc @WillAyd

WillAyd

What makes this faster than the original code? Seems like we've only added instructions to the conversion function(s), so I'm worried we are overlooking something

pandas/_libs/include/pandas/parser/tokenizer.h

WillAyd · 2025-10-08T16:32:17Z

pandas/_libs/src/parser/tokenizer.c

        } else {
          *error = ERROR_OVERFLOW;
-          return 0;
+          break;


Can we try to keep these are immediate returns? This opens up the door to ambiguous behavior of the parser in case of "multiple" failures

The goal of the changes in #62542 was because of this branch, where big numbers that indicate float were cast to string due to overflow.

Can I still check posterior characters to change the error code? If not, I don't think there is anything to do in this PR, and it's best to revert the problematic commit.

I put back the immediate return, but added an inline function before it to check if it's not an integer after the overflow.

Ah OK thanks - that's helpful. I somewhat disagree with the premise of that change to cast to float even if its a lossy operation. I understand that in some cases there is a desire for numeric operations on numbers like that, but its unclear that should take precedence over the string cast, which is in some sense more "value preserving".

The larger issue is that pandas does not have native support for Decimal precision types

I somewhat disagree with the premise of that change to cast to float even if its a lossy operation.

Even if it's a lossy operation, I don't think it comes up to an integer parsing function to decide that it should be a string. The way that I setup this PR and the one on #62542 makes that it skips trying to parse the word as an integer by verifying that it's not an integer. It still tries to parse as other datatypes in the way that they are prioritized here

pandas/pandas/_libs/parsers.pyx

Line 1072 in 1863adb

for dt in self.dtype_cast_order:

Alvaro-Kothe · 2025-10-08T19:11:45Z

What makes this faster than the original code?

I don't give much value for the performance increase that I reported on the first edit. I still need to update the description after all these changes. Just waiting for all the changes that I made are considered correct.

Additionally, The changes should only apply for integer parsing, everything else can be considered noise.

Alvaro-Kothe · 2025-10-08T20:21:15Z

@WillAyd I updated the benchmark results on the description. There were no significant changes.

WillAyd · 2025-10-08T20:29:21Z

pandas/_libs/src/parser/tokenizer.c

  return self->seen_uint && (self->seen_sint || self->seen_null);
 }

+static inline void check_for_invalid_char(const char *p_item, int *error) {


Can you document what this function does? The name check_for_invalid_char is a bit too vague - this is better described as something like cast_char_p_as_float no?

I'd also suggest that you either return int and drop the int * argument, or return something useful (ex: return the parsed float value) and then set the pointer value

I added a Doxygen comment. It's also returning the pointer to the last verified character.

I kept the error pointer in the function to prevent code duplication, and also because the main purpose of the function is just to assign a value to it. Considering that it's now returning the position of the last verified character, it's possible to change the error value outside the function, but I think it's more clean the way it is.

Why is it returning a char *? It doesn't look like you are using the return value anywhere.

I'm leaning more towards the former approach unless there is a reason for it to return a value; its really common practice to return an integral code to denote an error or not (even in C++ you'll see Arrow do this all over the place). Tucking that return value away in a pointer is far less common.

Its also more performant to return the int value directly, although in this particular case that's probably too far down the stack to be noticable

I changed it to return a status code.

WillAyd · 2025-10-08T20:36:13Z

pandas/_libs/src/parser/tokenizer.c

+    p_item++;
+  }
+
+  while (*p_item != '\0' && isspace_ascii(*p_item)) {


Can this be combined with the previous loop? Is there a reason for trailing whitespace to be handed specially, or is there a reason at all to allow whitespace?

It needs a separate loop because this case should be invalid "7890123 1351713789"

Why do we allow trailing white space though?

It is permitted below too if an overflow doesn't occur. I added it in this function to make it consistent.

Why are we doing this at all though? Are we stripping trailing whitespace for any other case in the tokenizer?

Why are we doing this at all though?

I couldn't find any particular reason in the code to do this.

Are we stripping trailing whitespace for any other case in the tokenizer?

Every function that parses a string in tokenizer.c ignores leading and trailing whitespace.

I changed the behavior to just check for the character after consuming all digits. The way that the function was used before didn't change the pointer position after calling it.

This function now don't permit trailing whitespace.

WillAyd · 2025-10-08T20:42:53Z

pandas/_libs/src/parser/tokenizer.c

        } else {
          *error = ERROR_OVERFLOW;
-          return 0;
+          break;


Ah OK thanks - that's helpful. I somewhat disagree with the premise of that change to cast to float even if its a lossy operation. I understand that in some cases there is a desire for numeric operations on numbers like that, but its unclear that should take precedence over the string cast, which is in some sense more "value preserving".

The larger issue is that pandas does not have native support for Decimal precision types

Additionally, don't permit trailing whitespace.

WillAyd · 2025-10-09T18:38:03Z

pandas/_libs/src/parser/tokenizer.c

          d = *++p;
        } else {
          *error = ERROR_OVERFLOW;
+          int status = check_for_invalid_char(p);


I'm not sure why tokenizer.h defines errors individually, but it would be much easier if you created a struct like:

enum TokenizerError { TOKENIZER_OK, ERROR_NO_DIGITS, ERROR_OVERFLOW, ERROR_INVALID_CHARS };

Then you can just assign *error = check_for_invalid_char(p) and let the call stack naturally handle this (where 0 is no error).

Although its still a bit strange to apply an error on the preceding line then reassign it here. I wonder if there shouldn't be a generic function to check for errors here

I also think the name ERROR_INVALID_CHARS is a little vague - maybe ERROR_INVALID_FLOAT_STRING is better?

I'm not sure why tokenizer.h defines errors individually, but it would be much easier if you created a struct

I even was thinking about doing it, but decided against it, thinking that this refactor would pollute this PR.

I also think the name ERROR_INVALID_CHARS is a little vague - maybe ERROR_INVALID_FLOAT_STRING is better?

This function checks for any invalid character, not specific to floating characters, so I think that ERROR_INVALID_CHARS is preferable.

I wonder if there shouldn't be a generic function to check for errors here

Like a function that overwrite the current error by some precedence?

For example: TOKENIZER_OK < ERROR_NO_DIGITS < ERROR_OVERFLOW < ERROR_INVALID_CHARS
If it's currently TOKENIZER_OK it would always be overwritten. ERROR_INVALID_CHARS would take precedence above all: would always substitute the current error and will never be overwritten?

Although its still a bit strange to apply an error on the preceding line then reassign it here.

Another option would be an if-else statement. The overflow error should always be assigned if we don't find an invalid character.

I refactored the code for int64 and uint64 to use the enum. Additionally, I preferred to use if-else to assign the error code.

Another PR would be required to create and handle errors in other tokenizers.

WillAyd · 2025-10-09T18:40:47Z

pandas/_libs/src/parser/tokenizer.c

+ * @return Integer 0 if the remainder of the string contains only digits,
+ *         otherwise returns the error code for [ERROR_INVALID_CHARS].
+ */
+static inline int check_for_invalid_char(const char *p_item) {


Can you add the length of the string as an argument? I realize this is a static function, but its still best to guard against buffer overruns in case of future refactor

This information is not available in any of the parent functions. So I would have to call strlen to use it. I don't see much value in it.

Its about minimizing the risk during refactor. C is not an inherently safe language, so you need to be somewhat paranoid when writing functions.

You are correct in that at face value calling strlen is pretty...well dumb. But its a sign that a refactor can happen in another PR to better keep track of the length of a string while processing it

WillAyd · 2025-10-10T16:14:56Z

pandas/_libs/parsers.pyx

            data[i] = str_to_int64(word, INT64_MIN, INT64_MAX,
                                   &error, parser.thousands)
-            if error != 0:
+            if error != TOKENIZER_OK:


This pattern would definitely be cleaner with a macro to return of non-zero (in a follow up PR is fine)

WillAyd · 2025-10-10T16:15:57Z

pandas/_libs/src/parser/tokenizer.c

+ * @return TOKENIZER_OK if the remainder of the string contains only digits,
+ *         otherwise returns ERROR_INVALID_CHARS.
+ */
+static inline TokenizerError check_for_invalid_char(const char *p_item) {


I think my comment got lost but you should add the length of the string to the arguments here, so that we can mitigate the risk of buffer overruns.

I responded to it. The length of the string is not available in the parent functions, so I would have to call strlen to use this function, and strlen also relies on the null character.

WillAyd · 2025-10-10T16:17:27Z

pandas/_libs/src/parser/tokenizer.c

        } else {
-          *error = ERROR_OVERFLOW;
+          TokenizerError status;
+          if ((status = check_for_invalid_char(p)) > ERROR_OVERFLOW) {


Adding ordering semantics to the enum isn't really clear on the intent. Probably best to make the function check for more than invalid chars and return the appropriate error code, rather than handling this logic multiple times in the caller

Probably best to make the function check for more than invalid chars and return the appropriate error code

Currently, this function is only called after an integer overflow, and it starts checking from the character that caused the overflow. Considering its usage for integer parsing, I don't see others error codes that it should return.

Adding ordering semantics to the enum isn't really clear on the intent.

Should I just compare with TOKENIZER_OK?

Interesting - I wasn't thinking of the full context here but you bring up some interesting points.

I guess I'm wondering why we even need this function at all. Long ago, pandas only supported C89, and even when C99 was around it took a long time for MSVC to support that. However, those days are long behind us.

C99 offers the strtoull function that would seemingly replace everything this function is trying to do.

#include <errno.h> #include <inttypes.h> #include <stdlib.h> #include <stdio.h> int main() { char* endptr; errno = 0; uint64_t val = strtoull("123456789123456789123456789", &endptr, 0); if (*endptr != '\0') { printf("End of string not reached\n"); return errno; } if (errno == EINVAL) { printf("Invalid value\n"); return errno; } if (errno == ERANGE) { printf("Value is out of range\n"); return errno; } printf("My value is: %lu\n", val); return errno; }

So maybe we can use the standard function instead of rolling our own logic here?

Nice! I'm always looking forward to simplifying things. But I don't think it will be as straightforward as calling strtoull mainly because of the thousand separator, which would require some preprocessing before calling this function.

If the thousand separator doesn't exist, then I think it will be very simple.

Do you think that using strtoull should go into this PR or have a dedicated one?

I created a PR #62658 that refactors the current integer parsing functions to use the ones from stdlib. I think that one should be prioritized instead of this one, as it will simplify this current PR.

Alvaro-Kothe added 3 commits October 8, 2025 10:42

perf: verify for float numbers during tokenization

be21b2e

fix: try other dtypes instead of skipping to float64

fc10a5f

fix: don't throw error when casting is expected

ab2fab8

mroeschke requested a review from WillAyd October 8, 2025 15:48

fix: fix tuple error

7e8033d

WillAyd requested changes Oct 8, 2025

View reviewed changes

Alvaro-Kothe added 2 commits October 8, 2025 15:25

fix: remove decimal_separator argument

5219386

fix: early return on overflow, but still check next chars

4ff07e3

fix: don't flag int with trailing whitespace as invalid

c7fc292

Alvaro-Kothe requested a review from WillAyd October 8, 2025 19:20

chore: better error message

4c8d770

WillAyd requested changes Oct 8, 2025

View reviewed changes

Alvaro-Kothe added 3 commits October 8, 2025 18:21

docs: document function to check for invalid character

35f075a

Merge branch 'main' into perf/read-csv

448f944

fix: make check_for_invalid_char return status code

cf0a26d

Additionally, don't permit trailing whitespace.

Alvaro-Kothe requested a review from WillAyd October 9, 2025 17:45

WillAyd requested changes Oct 9, 2025

View reviewed changes

Alvaro-Kothe added 2 commits October 9, 2025 16:51

refactor: add TokenizerError enum

2e5a47c

refactor: assign error in if-else block

ca32c01

Alvaro-Kothe requested a review from WillAyd October 9, 2025 20:23

docs: update documentation for TokenizerError

46c9883

WillAyd requested changes Oct 10, 2025

View reviewed changes

Alvaro-Kothe mentioned this pull request Oct 11, 2025

CLN: use integer parsing functions from stdlib #62658

Open

Uh oh!

PERF: fix performance regression from #62542 #62623

Are you sure you want to change the base?

PERF: fix performance regression from #62542 #62623

Conversation

Alvaro-Kothe commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe commented Oct 8, 2025

Uh oh!

Alvaro-Kothe commented Oct 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Alvaro-Kothe commented Oct 8, 2025 •

edited

Loading

Alvaro-Kothe Oct 9, 2025 •

edited

Loading

WillAyd Oct 10, 2025 •

edited

Loading

Alvaro-Kothe Oct 11, 2025 •

edited

Loading