[ESQL] Fix ORC type support gaps#145074
Conversation
Fix three ORC type mapping issues in the format reader: - DECIMAL columns (all precisions) crashed with QlIllegalArgumentException because DecimalColumnVector was not handled in the double block path. Add support for both DecimalColumnVector and Decimal64ColumnVector. - TIMESTAMP/TIMESTAMP_INSTANT now map to DATE_NANOS to preserve full nanosecond precision. The sub-millisecond nanos[] array was previously ignored, silently dropping up to 999,999 nanoseconds per timestamp. - BINARY columns now map to UNSUPPORTED instead of KEYWORD to avoid exposing raw bytes as garbled text.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @costin, I've created a changelog YAML for you. |
Accept DATE_NANOS as compatible with DATETIME in test type assertions, matching the existing INTEGER/LONG coercion pattern. External source formats like ORC that preserve nanosecond precision map timestamps to DATE_NANOS, but the shared csv-spec files declare columns as :date (DATETIME).
| if (actualType == Type.INTEGER && expectedType == Type.LONG) { | ||
| actualType = Type.LONG; | ||
| } | ||
| if (actualType == Type.DATE_NANOS && expectedType == Type.DATETIME) { |
There was a problem hiding this comment.
Above we differentiate between millis and nanos, but not in the CSV tests. Should we?
There was a problem hiding this comment.
Good question. The csv-spec files are shared across all external source formats (CSV, NdJSON, ORC, Parquet) and declare timestamp columns as :date (DATETIME). Since ORC preserves nanos natively, its reader returns DATE_NANOS — a strictly higher-precision form of DATETIME.
The coercion follows the existing INTEGER→LONG pattern (line above): it only accepts DATE_NANOS where DATETIME is expected, never the reverse. A test that explicitly asserts :date_nanos in its header still works.
The alternative would be duplicating the entire spec file set per format, which doesn't seem worth it. Added a comment explaining the rationale.
There was a problem hiding this comment.
I feel like if a storage returns DATE_NANOS, we should assert that, not update the type. Otherwise this still has the potential to mask bugs.
There was a problem hiding this comment.
Agreed. Removed the coercion entirely — ORC now returns DATETIME like the other formats, so there's nothing to mask.
| long nanos; | ||
| if (child instanceof TimestampColumnVector ts) { | ||
| if (ts.noNulls == false && ts.isNull[idx]) { | ||
| nanos = 0L; |
There was a problem hiding this comment.
Is this right? We add a 0L for a null in the source? And same in the 2x else branches.
Should we add some tests for these?
There was a problem hiding this comment.
Good catch. Two things here:
The null→0L pattern: this is inherited from all other list block methods (createListIntBlock appends 0, createListLongBlock appends 0L, createListDatetimeBlock appends 0L). Fixing it properly requires a design decision about whether multi-value positions can contain a mix of null and non-null elements — that's a broader change I'd rather do separately across all list methods.
The catch-all else branch: you're right, this was dead code silently producing epoch-zero garbage. Changed both createListDateNanosBlock and createListDatetimeBlock to throw QlIllegalArgumentException instead, consistent with createDoubleBlock.
| nanos = lv.vector[idx] * NANOS_PER_DAY; | ||
| } | ||
| } else { | ||
| nanos = 0L; |
There was a problem hiding this comment.
Also here, what other type could it be?
There was a problem hiding this comment.
For DATE_NANOS, only TimestampColumnVector should appear (TIMESTAMP and TIMESTAMP_INSTANT). The LongColumnVector branch is defensive (DATE uses LongColumnVector but maps to DATETIME, not DATE_NANOS, so it shouldn't reach this method in practice).
Changed the else to throw QlIllegalArgumentException instead of silently producing 0L — reaching it would indicate a bug. Same change applied to createListDatetimeBlock for consistency.
- Add explanatory comment for DATE_NANOS/DATETIME coercion in CsvAssert, clarifying the rationale. - Replace silent else catch-alls with QlIllegalArgumentException in createListDateNanosBlock and createListDatetimeBlock, consistent with createDoubleBlock.
| case VARCHAR, CHAR -> DataType.KEYWORD; | ||
| case BINARY -> DataType.KEYWORD; | ||
| case TIMESTAMP, TIMESTAMP_INSTANT, DATE -> DataType.DATETIME; | ||
| case TIMESTAMP, TIMESTAMP_INSTANT -> DataType.DATE_NANOS; |
There was a problem hiding this comment.
Going back to this, not positive on the right solution here. Our DATE_NANOS have limited range (Epoch to MAX_NANOSECOND_INSTANT). Which means we won't be able to decode dates outside of that, should they appear in files.
- We should probably add unit tests with such dates/timestamps.
- I've extended
external-basic.csv-specwith a test that passes in, for instance Parquet, but not ORC after applying the PR:
readAllEmployees2
required_capability: external_command
EXTERNAL "{{employees}}"
| KEEP emp_no, first_name, last_name, birth_date, gender, hire_date, languages, height, salary, still_hired
| WHERE birth_date == TO_DATETIME("1953-09-02T00:00:00.000Z")
| SORT emp_no
| LIMIT 5;
emp_no:integer | first_name:keyword | last_name:keyword | birth_date:date | gender:keyword | hire_date:date | languages:integer | height:double | salary:integer | still_hired:boolean
10001 | "Georgi" | "Facello" | 1953-09-02T00:00:00.000Z | "M" | 1986-06-26T00:00:00.000Z | 2 | 2.03 | 57305 | true
;
with
[ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-0] fatal error in thread [elasticsearch[test-cluster-0][esql_worker][T#3]], exiting java.lang.AssertionError
at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.time.DateUtils.compareNanosToMillis(DateUtils.java:339)
at org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.Equals.processNanosMillis(Equals.java:222)
at org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.EqualsNanosMillisEvaluator.eval(EqualsNanosMillisEvaluator.java:100)
at org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.EqualsNanosMillisEvaluator.eval(EqualsNanosMillisEvaluator.java:54)
at org.elasticsearch.compute.operator.FilterOperator.process(FilterOperator.java:43)
at org.elasticsearch.compute.operator.AbstractPageMappingOperator.getOutput(AbstractPageMappingOperator.java:89)
at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:323)
at org.elasticsearch.compute.operator.Driver.run(Driver.java:194)
at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:506)
at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
at org.elasticsearch.compute.operator.DriverScheduler$1.doRun(DriverScheduler.java:57)
at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:35)
at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1114)
at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1575)
[2026-03-30T23:39:36,619][INFO ][o.e.x.e.q.o.OrcFormatSpecIT][test] [csv-spec:external-basic.readAllEmployees2 [LOCAL]] after test
There was a problem hiding this comment.
Reverted TIMESTAMP/TIMESTAMP_INSTANT back to DATETIME. DATE_NANOS can't represent pre-epoch dates so it's not viable here until that limitation is addressed separately.
| if (actualType == Type.INTEGER && expectedType == Type.LONG) { | ||
| actualType = Type.LONG; | ||
| } | ||
| if (actualType == Type.DATE_NANOS && expectedType == Type.DATETIME) { |
There was a problem hiding this comment.
I feel like if a storage returns DATE_NANOS, we should assert that, not update the type. Otherwise this still has the potential to mask bugs.
DATE_NANOS cannot represent pre-epoch dates (assert nanos >= 0 in DateUtils.compareNanosToMillis), which crashes on ORC files with historical timestamps. Revert to DATETIME (millis precision) until DATE_NANOS range is extended.
c3cb593 to
c188932
Compare
Adds filterByPreEpochBirthDate to the shared csv-spec that filters on a 1953 birth date, protecting against re-introducing DATE_NANOS without fixing its range.
|
Updated the PR description and added the |
🔍 Preview links for changed docs⏳ Building and deploying preview... View progress This comment will be updated with preview links when the build is complete. |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
testPreEpochTimestamp: verifies 1953 date reads as negative millis, protecting the DATETIME path against DATE_NANOS regression at the reader level. testReadListTimestampColumn: exercises createListDatetimeBlock which had zero test coverage; includes a pre-epoch element.
Fix three ORC type mapping issues in the format reader: - DECIMAL columns (all precisions) crashed with QlIllegalArgumentException because DecimalColumnVector was not handled in the double block path. Add support for both DecimalColumnVector and Decimal64ColumnVector. - TIMESTAMP/TIMESTAMP_INSTANT remain mapped to DATETIME (millisecond precision). Mapping to DATE_NANOS was investigated but DATE_NANOS cannot represent pre-epoch dates (assert nanos >= 0 in DateUtils), which crashes on ORC files with historical timestamps. Nanosecond precision is deferred until the DATE_NANOS range limitation is addressed. - BINARY columns now map to UNSUPPORTED instead of KEYWORD to avoid exposing raw bytes as garbled text. Also adds a shared csv-spec regression test that filters on a pre-epoch birth date, protecting against re-introducing the DATE_NANOS mapping without fixing the range limitation. Developed with AI-assisted tooling
…rics * upstream/main: (428 commits) ESQL: DS: Add inference/RERANK tests (elastic#145229) Unmute MMR logical plan test (elastic#145311) Do not attempt marking store as corrupted if the check is rejected due to shutdown (elastic#145209) feat(tsdb): add pipeline runtime and rename stage interfaces (elastic#145175) Fix UnresolvedException on PromQL by(step) grouping (elastic#145307) ES|QL: Optimize MMR by reducing cache size and lookup (elastic#145014) Prometheus labels/series APIs: support multiple match[] selectors (elastic#145298) Move ClientScrollablePaginatedHitSource into Reindex Module (elastic#144100) mute test class for elastic#145277 CPS mode for ViewResolver (elastic#145219) [ESQL] Disables GroupedTopNBenchmark temporarily (elastic#145124) Make exponential_histogram the default histogram type for HTTP OTLP endpoint (elastic#145065) More tests requiring an explicit confidence interval (elastic#145232) ES|QL: Adding `USER_AGENT` command (elastic#144384) ESQL: enable Generative IT after more fixes (elastic#145112) Rework FieldMapper parameter tests to not use merge builders (elastic#145213) [ESQL] Fix ORC type support gaps (elastic#145074) [Test] Unmute FollowingEngineTests.testProcessOnceOnPrimary (elastic#145192) Add PrometheusSeriesRestAction for /_prometheus/api/v1/series endpoint (elastic#144494) Prometheus labels API: add rest action (elastic#144952) ...
Fix three ORC type mapping issues in the format reader: - DECIMAL columns (all precisions) crashed with QlIllegalArgumentException because DecimalColumnVector was not handled in the double block path. Add support for both DecimalColumnVector and Decimal64ColumnVector. - TIMESTAMP/TIMESTAMP_INSTANT remain mapped to DATETIME (millisecond precision). Mapping to DATE_NANOS was investigated but DATE_NANOS cannot represent pre-epoch dates (assert nanos >= 0 in DateUtils), which crashes on ORC files with historical timestamps. Nanosecond precision is deferred until the DATE_NANOS range limitation is addressed. - BINARY columns now map to UNSUPPORTED instead of KEYWORD to avoid exposing raw bytes as garbled text. Also adds a shared csv-spec regression test that filters on a pre-epoch birth date, protecting against re-introducing the DATE_NANOS mapping without fixing the range limitation. Developed with AI-assisted tooling
Fix three ORC type mapping issues in the format reader:
DECIMAL columns (all precisions) crashed with
QlIllegalArgumentException because DecimalColumnVector
was not handled in the double block path. Add support
for both DecimalColumnVector and Decimal64ColumnVector.
TIMESTAMP/TIMESTAMP_INSTANT remain mapped to DATETIME
(millisecond precision). Mapping to DATE_NANOS was
investigated but DATE_NANOS cannot represent pre-epoch
dates (assert nanos >= 0 in DateUtils), which crashes
on ORC files with historical timestamps. Nanosecond
precision is deferred until the DATE_NANOS range
limitation is addressed.
BINARY columns now map to UNSUPPORTED instead of KEYWORD
to avoid exposing raw bytes as garbled text.
Also adds a shared csv-spec regression test that filters on
a pre-epoch birth date, protecting against re-introducing the
DATE_NANOS mapping without fixing the range limitation.
Developed with AI-assisted tooling