Skip to content

[ESQL] Fix ORC type support gaps#145074

Merged
costin merged 8 commits intoelastic:mainfrom
costin:esql/orc-decimal-support
Mar 31, 2026
Merged

[ESQL] Fix ORC type support gaps#145074
costin merged 8 commits intoelastic:mainfrom
costin:esql/orc-decimal-support

Conversation

@costin
Copy link
Copy Markdown
Member

@costin costin commented Mar 27, 2026

Fix three ORC type mapping issues in the format reader:

  • DECIMAL columns (all precisions) crashed with
    QlIllegalArgumentException because DecimalColumnVector
    was not handled in the double block path. Add support
    for both DecimalColumnVector and Decimal64ColumnVector.

  • TIMESTAMP/TIMESTAMP_INSTANT remain mapped to DATETIME
    (millisecond precision). Mapping to DATE_NANOS was
    investigated but DATE_NANOS cannot represent pre-epoch
    dates (assert nanos >= 0 in DateUtils), which crashes
    on ORC files with historical timestamps. Nanosecond
    precision is deferred until the DATE_NANOS range
    limitation is addressed.

  • BINARY columns now map to UNSUPPORTED instead of KEYWORD
    to avoid exposing raw bytes as garbled text.

Also adds a shared csv-spec regression test that filters on
a pre-epoch birth date, protecting against re-introducing the
DATE_NANOS mapping without fixing the range limitation.

Developed with AI-assisted tooling

Fix three ORC type mapping issues in the format reader:

- DECIMAL columns (all precisions) crashed with
  QlIllegalArgumentException because DecimalColumnVector
  was not handled in the double block path. Add support
  for both DecimalColumnVector and Decimal64ColumnVector.

- TIMESTAMP/TIMESTAMP_INSTANT now map to DATE_NANOS to
  preserve full nanosecond precision. The sub-millisecond
  nanos[] array was previously ignored, silently dropping
  up to 999,999 nanoseconds per timestamp.

- BINARY columns now map to UNSUPPORTED instead of KEYWORD
  to avoid exposing raw bytes as garbled text.
@costin costin requested a review from bpintea March 27, 2026 09:29
@costin costin enabled auto-merge (squash) March 27, 2026 09:29
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 27, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @costin, I've created a changelog YAML for you.

costin added 2 commits March 27, 2026 12:47
Accept DATE_NANOS as compatible with DATETIME in test
type assertions, matching the existing INTEGER/LONG
coercion pattern. External source formats like ORC
that preserve nanosecond precision map timestamps to
DATE_NANOS, but the shared csv-spec files declare
columns as :date (DATETIME).
@bpintea bpintea disabled auto-merge March 27, 2026 14:32
Copy link
Copy Markdown
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions.

if (actualType == Type.INTEGER && expectedType == Type.LONG) {
actualType = Type.LONG;
}
if (actualType == Type.DATE_NANOS && expectedType == Type.DATETIME) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we differentiate between millis and nanos, but not in the CSV tests. Should we?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The csv-spec files are shared across all external source formats (CSV, NdJSON, ORC, Parquet) and declare timestamp columns as :date (DATETIME). Since ORC preserves nanos natively, its reader returns DATE_NANOS — a strictly higher-precision form of DATETIME.

The coercion follows the existing INTEGER→LONG pattern (line above): it only accepts DATE_NANOS where DATETIME is expected, never the reverse. A test that explicitly asserts :date_nanos in its header still works.

The alternative would be duplicating the entire spec file set per format, which doesn't seem worth it. Added a comment explaining the rationale.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like if a storage returns DATE_NANOS, we should assert that, not update the type. Otherwise this still has the potential to mask bugs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Removed the coercion entirely — ORC now returns DATETIME like the other formats, so there's nothing to mask.

long nanos;
if (child instanceof TimestampColumnVector ts) {
if (ts.noNulls == false && ts.isNull[idx]) {
nanos = 0L;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? We add a 0L for a null in the source? And same in the 2x else branches.
Should we add some tests for these?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Two things here:

The null→0L pattern: this is inherited from all other list block methods (createListIntBlock appends 0, createListLongBlock appends 0L, createListDatetimeBlock appends 0L). Fixing it properly requires a design decision about whether multi-value positions can contain a mix of null and non-null elements — that's a broader change I'd rather do separately across all list methods.

The catch-all else branch: you're right, this was dead code silently producing epoch-zero garbage. Changed both createListDateNanosBlock and createListDatetimeBlock to throw QlIllegalArgumentException instead, consistent with createDoubleBlock.

nanos = lv.vector[idx] * NANOS_PER_DAY;
}
} else {
nanos = 0L;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, what other type could it be?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DATE_NANOS, only TimestampColumnVector should appear (TIMESTAMP and TIMESTAMP_INSTANT). The LongColumnVector branch is defensive (DATE uses LongColumnVector but maps to DATETIME, not DATE_NANOS, so it shouldn't reach this method in practice).

Changed the else to throw QlIllegalArgumentException instead of silently producing 0L — reaching it would indicate a bug. Same change applied to createListDatetimeBlock for consistency.

- Add explanatory comment for DATE_NANOS/DATETIME coercion
  in CsvAssert, clarifying the rationale.
- Replace silent else catch-alls with QlIllegalArgumentException
  in createListDateNanosBlock and createListDatetimeBlock,
  consistent with createDoubleBlock.
@costin costin enabled auto-merge (squash) March 27, 2026 15:53
case VARCHAR, CHAR -> DataType.KEYWORD;
case BINARY -> DataType.KEYWORD;
case TIMESTAMP, TIMESTAMP_INSTANT, DATE -> DataType.DATETIME;
case TIMESTAMP, TIMESTAMP_INSTANT -> DataType.DATE_NANOS;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to this, not positive on the right solution here. Our DATE_NANOS have limited range (Epoch to MAX_NANOSECOND_INSTANT). Which means we won't be able to decode dates outside of that, should they appear in files.

  1. We should probably add unit tests with such dates/timestamps.
  2. I've extended external-basic.csv-spec with a test that passes in, for instance Parquet, but not ORC after applying the PR:
readAllEmployees2
required_capability: external_command

EXTERNAL "{{employees}}"
| KEEP emp_no, first_name, last_name, birth_date, gender, hire_date, languages, height, salary, still_hired
| WHERE birth_date == TO_DATETIME("1953-09-02T00:00:00.000Z")
| SORT emp_no
| LIMIT 5;

emp_no:integer | first_name:keyword | last_name:keyword | birth_date:date          | gender:keyword | hire_date:date           | languages:integer | height:double | salary:integer | still_hired:boolean
10001          | "Georgi"           | "Facello"         | 1953-09-02T00:00:00.000Z | "M"            | 1986-06-26T00:00:00.000Z | 2                 | 2.03          | 57305          | true
;

with

[ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-0] fatal error in thread [elasticsearch[test-cluster-0][esql_worker][T#3]], exiting java.lang.AssertionError
        at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.time.DateUtils.compareNanosToMillis(DateUtils.java:339)
        at org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.Equals.processNanosMillis(Equals.java:222)
        at org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.EqualsNanosMillisEvaluator.eval(EqualsNanosMillisEvaluator.java:100)
        at org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.EqualsNanosMillisEvaluator.eval(EqualsNanosMillisEvaluator.java:54)
        at org.elasticsearch.compute.operator.FilterOperator.process(FilterOperator.java:43)
        at org.elasticsearch.compute.operator.AbstractPageMappingOperator.getOutput(AbstractPageMappingOperator.java:89)
        at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:323)
        at org.elasticsearch.compute.operator.Driver.run(Driver.java:194)
        at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:506)
        at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        at org.elasticsearch.compute.operator.DriverScheduler$1.doRun(DriverScheduler.java:57)
        at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:35)
        at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1114)
        at org.elasticsearch.server@9.4.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1575)

    [2026-03-30T23:39:36,619][INFO ][o.e.x.e.q.o.OrcFormatSpecIT][test] [csv-spec:external-basic.readAllEmployees2 [LOCAL]] after test

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted TIMESTAMP/TIMESTAMP_INSTANT back to DATETIME. DATE_NANOS can't represent pre-epoch dates so it's not viable here until that limitation is addressed separately.

if (actualType == Type.INTEGER && expectedType == Type.LONG) {
actualType = Type.LONG;
}
if (actualType == Type.DATE_NANOS && expectedType == Type.DATETIME) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like if a storage returns DATE_NANOS, we should assert that, not update the type. Otherwise this still has the potential to mask bugs.

DATE_NANOS cannot represent pre-epoch dates (assert nanos >= 0
in DateUtils.compareNanosToMillis), which crashes on ORC files
with historical timestamps. Revert to DATETIME (millis precision)
until DATE_NANOS range is extended.
@costin costin force-pushed the esql/orc-decimal-support branch from c3cb593 to c188932 Compare March 30, 2026 13:56
Copy link
Copy Markdown
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LBetterTM.
The PR description should be updated too -- "silently dropping up to 999,999 nanoseconds per timestamp" :))
Also, it'd be good, though not a showstopper to add the CSV spec shared, that'll protect in the future.

Adds filterByPreEpochBirthDate to the shared csv-spec
that filters on a 1953 birth date, protecting against
re-introducing DATE_NANOS without fixing its range.
@costin
Copy link
Copy Markdown
Member Author

costin commented Mar 30, 2026

Updated the PR description and added the filterByPreEpochBirthDate test to the shared csv-spec — filters on a 1953 birth date so any future DATE_NANOS reintroduction without fixing the range limitation will fail.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

🔍 Preview links for changed docs

⏳ Building and deploying preview... View progress

This comment will be updated with preview links when the build is complete.

@github-actions
Copy link
Copy Markdown
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

testPreEpochTimestamp: verifies 1953 date reads as negative
millis, protecting the DATETIME path against DATE_NANOS
regression at the reader level.

testReadListTimestampColumn: exercises createListDatetimeBlock
which had zero test coverage; includes a pre-epoch element.
@costin costin disabled auto-merge March 31, 2026 08:32
@costin costin merged commit b41b08a into elastic:main Mar 31, 2026
34 of 35 checks passed
@costin costin deleted the esql/orc-decimal-support branch March 31, 2026 08:32
pmpailis pushed a commit that referenced this pull request Mar 31, 2026
Fix three ORC type mapping issues in the format reader:

- DECIMAL columns (all precisions) crashed with
  QlIllegalArgumentException because DecimalColumnVector
  was not handled in the double block path. Add support
  for both DecimalColumnVector and Decimal64ColumnVector.

- TIMESTAMP/TIMESTAMP_INSTANT remain mapped to DATETIME
  (millisecond precision). Mapping to DATE_NANOS was
  investigated but DATE_NANOS cannot represent pre-epoch
  dates (assert nanos >= 0 in DateUtils), which crashes
  on ORC files with historical timestamps. Nanosecond
  precision is deferred until the DATE_NANOS range
  limitation is addressed.

- BINARY columns now map to UNSUPPORTED instead of KEYWORD
  to avoid exposing raw bytes as garbled text.

Also adds a shared csv-spec regression test that filters on
a pre-epoch birth date, protecting against re-introducing the
DATE_NANOS mapping without fixing the range limitation.

Developed with AI-assisted tooling
szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 31, 2026
…rics

* upstream/main: (428 commits)
  ESQL: DS: Add inference/RERANK tests (elastic#145229)
  Unmute MMR logical plan test (elastic#145311)
  Do not attempt marking store as corrupted if the check is rejected due to shutdown (elastic#145209)
  feat(tsdb): add pipeline runtime and rename stage interfaces (elastic#145175)
  Fix UnresolvedException on PromQL by(step) grouping (elastic#145307)
  ES|QL: Optimize MMR by reducing cache size and lookup (elastic#145014)
  Prometheus labels/series APIs: support multiple match[] selectors (elastic#145298)
  Move ClientScrollablePaginatedHitSource into Reindex Module (elastic#144100)
  mute test class for elastic#145277
  CPS mode for ViewResolver (elastic#145219)
  [ESQL] Disables GroupedTopNBenchmark temporarily (elastic#145124)
  Make exponential_histogram the default histogram type for HTTP OTLP endpoint (elastic#145065)
  More tests requiring an explicit confidence interval (elastic#145232)
  ES|QL: Adding `USER_AGENT` command (elastic#144384)
  ESQL: enable Generative IT after more fixes (elastic#145112)
  Rework FieldMapper parameter tests to not use merge builders (elastic#145213)
  [ESQL] Fix ORC type support gaps (elastic#145074)
  [Test] Unmute FollowingEngineTests.testProcessOnceOnPrimary (elastic#145192)
  Add PrometheusSeriesRestAction for /_prometheus/api/v1/series endpoint (elastic#144494)
  Prometheus labels API: add rest action (elastic#144952)
  ...
ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Apr 1, 2026
Fix three ORC type mapping issues in the format reader:

- DECIMAL columns (all precisions) crashed with
  QlIllegalArgumentException because DecimalColumnVector
  was not handled in the double block path. Add support
  for both DecimalColumnVector and Decimal64ColumnVector.

- TIMESTAMP/TIMESTAMP_INSTANT remain mapped to DATETIME
  (millisecond precision). Mapping to DATE_NANOS was
  investigated but DATE_NANOS cannot represent pre-epoch
  dates (assert nanos >= 0 in DateUtils), which crashes
  on ORC files with historical timestamps. Nanosecond
  precision is deferred until the DATE_NANOS range
  limitation is addressed.

- BINARY columns now map to UNSUPPORTED instead of KEYWORD
  to avoid exposing raw bytes as garbled text.

Also adds a shared csv-spec regression test that filters on
a pre-epoch birth date, protecting against re-introducing the
DATE_NANOS mapping without fixing the range limitation.

Developed with AI-assisted tooling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement ES|QL|DS ES|QL datasources Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants