Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1697: Fix IllegalArgumentException when reading json timestamp type in benchmark #1930

Closed
wants to merge 1 commit into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented May 10, 2024

What changes were proposed in this pull request?

This PR aims to fix IllegalArgumentException when reading json timestamp type in benchmark.

Write and read json, convert timestamp type to long type instead of string type.

Why are the changes needed?

ORC-1191 Switch the csv format of taxi to parquet and read the timestamp format of parquet, but it is in microseconds format, which is different from the millisecond format of Java's java.sql.Timestamp.

taxi source parquet meta

  optional int64 tpep_pickup_datetime (TIMESTAMP(MICROS,false));
  optional int64 tpep_dropoff_datetime (TIMESTAMP(MICROS,false));

When we write the data into json and then use the scan command, we will get the following error.

java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json
Exception in thread "main" java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
	at java.sql/java.sql.Timestamp.valueOf(Timestamp.java:224)
	at org.apache.orc.bench.core.convert.json.JsonReader$TimestampColumnConverter.convert(JsonReader.java:175)
	at org.apache.orc.bench.core.convert.json.JsonReader.nextBatch(JsonReader.java:86)
	at org.apache.orc.bench.core.convert.ScanVariants.run(ScanVariants.java:92)
	at org.apache.orc.bench.core.Driver.main(Driver.java:64)

Because json data of type timestamp is written via java.sql.Timestamp#toString, but reading the data java.sql.Timestamp#valueOf will report an error.

    Timestamp ts = new Timestamp(1446341079000000L);
    System.out.println(ts);
    System.out.println(Timestamp.valueOf(ts.toString()));
47802-09-23 02:50:00.0
Exception in thread "main" java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
	at java.sql.Timestamp.valueOf(Timestamp.java:237)

How was this patch tested?

local test

java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -format json -data taxi -compress snappy
java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json -data taxi -compress snappy

Was this patch authored or co-authored using generative AI tooling?

No

Closes #1902

@dongjoon-hyun
Copy link
Member

Thank you, @cxzl25 .
Sorry for the delay. I got a chance to verify this Today finally as a release manager of v2.0.2.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Currently, GitHub outage is happening. I failed to merge this PR. I'll re-try to merge this later.

dongjoon-hyun pushed a commit that referenced this pull request Aug 5, 2024
…pe in benchmark

### What changes were proposed in this pull request?
This PR aims to fix `IllegalArgumentException` when reading json timestamp type in benchmark.

Write and read json, convert timestamp type to long type instead of string type.

### Why are the changes needed?
ORC-1191 Switch the csv format of taxi to parquet and read the timestamp format of parquet, but it is in microseconds format, which is different from the millisecond format of Java's `java.sql.Timestamp`.

taxi source parquet meta
```bash
  optional int64 tpep_pickup_datetime (TIMESTAMP(MICROS,false));
  optional int64 tpep_dropoff_datetime (TIMESTAMP(MICROS,false));
```

When we write the data into json and then use the scan command, we will get the following error.
```java
java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json
```

```
Exception in thread "main" java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
	at java.sql/java.sql.Timestamp.valueOf(Timestamp.java:224)
	at org.apache.orc.bench.core.convert.json.JsonReader$TimestampColumnConverter.convert(JsonReader.java:175)
	at org.apache.orc.bench.core.convert.json.JsonReader.nextBatch(JsonReader.java:86)
	at org.apache.orc.bench.core.convert.ScanVariants.run(ScanVariants.java:92)
	at org.apache.orc.bench.core.Driver.main(Driver.java:64)
```

Because json data of type timestamp is written via `java.sql.Timestamp#toString`, but reading the data `java.sql.Timestamp#valueOf` will report an error.

```java
    Timestamp ts = new Timestamp(1446341079000000L);
    System.out.println(ts);
    System.out.println(Timestamp.valueOf(ts.toString()));
```
```
47802-09-23 02:50:00.0
Exception in thread "main" java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
	at java.sql.Timestamp.valueOf(Timestamp.java:237)
```

### How was this patch tested?
local test

```bash
java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -format json -data taxi -compress snappy
```

```bash
java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json -data taxi -compress snappy
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1902

Closes #1930 from cxzl25/ORC-1697_v2.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d09dbf3)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to main/2.0.

dongjoon-hyun added a commit that referenced this pull request Aug 5, 2024
### What changes were proposed in this pull request?

This PR aims to use Apache Avro 1.12.0 in `bench` module.

### Why are the changes needed?

Apache Avro 1.12.0 is the latest feature release.

Since we are fixing this area recently, we had better keep it up-to-date in order to avoid re-validation in the future.
- #1930
- #1995

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #1996 from dongjoon-hyun/ORC-1753.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Aug 5, 2024
### What changes were proposed in this pull request?

This PR aims to use Apache Avro 1.12.0 in `bench` module.

### Why are the changes needed?

Apache Avro 1.12.0 is the latest feature release.

Since we are fixing this area recently, we had better keep it up-to-date in order to avoid re-validation in the future.
- #1930
- #1995

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #1996 from dongjoon-hyun/ORC-1753.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit e43ce79)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants