bug: Regression on COPY INTO #17284

rad-pat · 2025-01-14T10:40:23Z

Search before asking

I had searched in the issues and found no similar issues.

Version

v1.2.688-nightly

What's Wrong?

Upgraded from v1.2.680-nightly to v1.2.688-nightly and one of our testing workflows began to fail. It loads data from a parquet file via COPY INTO from GCS. It reports that the schema is different, but inspection of table and parquet file show the 2 items to be in order.

Please note, downgrading to v1.2.687 made the statements work again, although they seem to be very slow - 7 seconds when on v1.2.680 vs 90 seconds on v1.2.687

We are loading two tables in parallel, both via COPY INTO statements, it looks like the actual schema reported in the message below is the schema of the other table, showed here.

However, the load for that table also reports an error and the actual schema for that is a completely unrelated table.

Data is
data_0.zip

Error message is as follows:

sqlalchemy.exc.DBAPIError: (databend_sqlalchemy.errors.Error) Code: None. APIError: QueryFailed: [1303]infer schema from 'data_0.parquet', but get diff schema in file 'data_0.parquet'. 

Expected schema: SchemaDescriptor { schema: GroupType { basic_info: BasicTypeInfo { name: "schema", repetition: None, converted_type: NONE, logical_type: None, id: None }, fields: [PrimitiveType { basic_info: BasicTypeInfo { name: "row_num", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "iteration", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "input_records", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "matched_records", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "display", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "rule", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "rule_id", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "include", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: BOOLEAN, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Driver", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "RC!!context!!", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "RC", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Customer", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }] } }, actual: SchemaDescriptor { schema: GroupType { basic_info: BasicTypeInfo { name: "schema", repetition: None, converted_type: NONE, logical_type: None, id: None }, fields: [PrimitiveType { basic_info: BasicTypeInfo { name: "TestID", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Version", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Period", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "RC", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Account", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Value", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: DOUBLE, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "log", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "rule_number", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "rule", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "rule_id", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Customer!!target!!", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Driver!!driver!!", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "RC!!context!!", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "RC!!target!!", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "allocable", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "path_id", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }] } }

[SQL: 
COPY INTO "anlze059edc4-d519-48e5-908c-1070b8273f8c"."analyzetable_3cdd2094-7df0-4ee7-954f-b7621232ffb7" 
FROM 'gcs://rhynl-pctb-temp-bugfixes2/load_temp/bugfixes2/e059edc4-d519-48e5-908c-1070b8273f8c/analyzetable_3cdd2094-7df0-4ee7-954f-b7621232ffb7/dfadfc10-5ae8-403f-b642-9f299cbc079e/'
CONNECTION = (
  ENDPOINT_URL = 'https://storage.googleapis.com'
  CREDENTIAL = '<snip>'
) FILE_FORMAT = (TYPE = PARQUET)
 FORCE = TRUE
COLUMN_MATCH_MODE = CASE_SENSITIVE
]

How to Reproduce?

I'm not sure what steps to give here to reproduce. I will try to replicate exactly and update...

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

youngsofun · 2025-01-15T01:18:02Z

Thank you for reporting this issue.

Here’s what I have so far:
1. The only PR that seems related to this is #17175, which was included in version v0.8.685.
2. I haven’t been able to reproduce the issue yet.

It looks like there are two separate concerns here:

Unexpected diff schema error (bug in v0.8.688)

Could you clarify if you’ve encountered this error multiple times with different destination tables?
Is it possible that the files in the source directory were modified during the COPY operation?

To gather more information, you could check the opendal logs filtered by the query ID in the file logs. This might provide additional context about what happened during the operation.

Copy operation becoming slower (v0.8.688)

PR #17175 may be related, as it introduced schema comparison for each file during the COPY process. However, this shouldn’t cause a significant performance impact unless the files are very small and there are a large number of them.
Examining the logs filtered by the query ID might help us pinpoint the source of the slowdown.

I will continue work on this.
Let me know if you can share any additional details or logs to help us investigate further!

rad-pat · 2025-01-15T09:10:38Z

Thanks, I will endeavour to get more information on this for you today 👍

rad-pat · 2025-01-15T09:52:27Z

I have managed to replicate the same in a test environment.

Focusing initially on the diff schema problem, I tried the following:

Run using v1.2.680-p2 - This worked fine
Run using v1.2.688 - This failed with the diff schema - query id = d3489539-52b7-426d-a0a9-ad38ba3a823d

The query logs for this are attached. In this kubernetes environment, I have 3 meta pods and 1 query pod.

In answer to your questions on diff schema error:
There is definitely no modification of the files in the source directory, code path from Python side is to load parquet file to destination in GCS and, once upload is complete, call to the server to load into Databend via COPY INTO (but this is multi-threaded to upload and import 2 tables in parallel - each threads does upload and import in series)
I am able to replicate with different destination tables. What is interesting is that other COPY INTO queries that we have in the whole workflow are working. The thing that stands these failures apart are that we're using parquet format and that they are in parallel.
databend-query-logs-copy-into.zip

rad-pat · 2025-01-15T10:08:21Z

Please note, dropping back to query pod v1.2.687-nightly worked in my test environment without any performance implication - but meta version remained v1.2.688 otherwise some heartbeat issue was causing failure

rad-pat · 2025-01-15T10:44:32Z

Additional info. I have not been able to successfully replicate the copy into becoming slower. I think it can possibly be put down to communication issue between meta/query at the point I was trying?
What is very replicable for me is to downgrade to v1.2.687-nightly and see success, then upgrade to v1.2.688-nightly and see the failure. Other versions below 1.2.687 all seem to work fine too.
I hope this helps to track the bug down.

youngsofun · 2025-01-16T02:18:55Z

I checked the change log, still no idea v1.2.687-nightly...v1.2.688-nightly

@rad-pat

Could you please double-check that the upload paths of the two upload-copy threads are correct and not the same？

the data_0.zip you uploaded here is 7KB, while the file size in the log is about 5KB. but I am not sure the file is the one expected to be load in the query corresponding to the query id d3489539-52b7-426d-a0a9-ad38ba3a823d

rad-pat · 2025-01-16T09:35:55Z

@youngsofun, the paths for the uploads are definitely different, each upload path in GCS includes the table name and then a unique load identifier, as shown below, extracted from the log file. I'm sure we would have seen issues loading earlier if the paths had been the same because the tables schemas are all different.

gcs://rheop-pctb-temp-bug-fixes/load_temp/c4a44436-4eba-024c-6764-d806864cbc2a/a79e711b-c6ac-4869-b7f6-10c45fe93e67/analyzetable_3cdd2094-7df0-4ee7-954f-b7621232ffb7/66258c46-77f2-45a7-b41e-51e22495b159
gcs://rheop-pctb-temp-bug-fixes/load_temp/c4a44436-4eba-024c-6764-d806864cbc2a/a79e711b-c6ac-4869-b7f6-10c45fe93e67/analyzetable_70d943d5-abba-4720-9169-b923322958fb/5bd7c955-e989-4430-8fd4-86d505ffd2ae/
gcs://rheop-pctb-temp-bug-fixes/load_temp/c4a44436-4eba-024c-6764-d806864cbc2a/a79e711b-c6ac-4869-b7f6-10c45fe93e67/analyzetable_189c26da-6703-4c0e-9bce-02a3a4139708/e483fdcd-94b3-4acc-aa53-5a3d641f98e1/

Possibly of relevance is that, prior to calling the COPY statement, a TRUNCATE is called on each of the tables.

rad-pat · 2025-01-16T09:42:29Z

the data_0.zip you uploaded here is 7KB, while the file size in the log is about 5KB. but I am not sure the file is the one expected to be load in the query corresponding to the query id d3489539-52b7-426d-a0a9-ad38ba3a823d

This was the parquet file for that query - it was not deleted from GCS yet
data_0 (1).zip

rad-pat · 2025-01-16T11:35:10Z

I guess another thing of note is that, despite being loaded from different gcs locations, the name of the files in each of these uploads is data_0.parquet.
I have just tried a different independent single file upload through our process and that is failing in the same way. The expected schema reported is correct, the actual schema reported in the error is that of data for analyzetable_3cdd2094-7df0-4ee7-954f-b7621232ffb7 which was a previous upload attempt to generate logs yesterday for this issue.

rad-pat · 2025-01-16T14:10:22Z

Further information, if I change the name of the files uploaded to GCS and give them a unique name like data_{unique_upload_id}_{index}.parquet then the COPY INTO works ok on 1.2.688. It must be something to do with them being named data_{index}.parquet and because they are only small data, only one file data_0.parquet. The path in GCS is unique already with a uuid for the upload making up the final directory for the path.

youngsofun · 2025-01-16T15:30:30Z

can you try this:

reproduce again, with the ending / removed from the source path when copy,
databend will use the full path (starts from /load_temps/)

youngsofun · 2025-01-16T15:34:23Z

I guess another thing of note is that, despite being loaded from different gcs locations, the name of the files in each of these uploads is data_0.parquet. I have just tried a different independent single file upload through our process and that is failing in the same way. The expected schema reported is correct, the actual schema reported in the error is that of data for analyzetable_3cdd2094-7df0-4ee7-954f-b7621232ffb7 which was a previous upload attempt to generate logs yesterday for this issue.

just to make sure, by independent single file upload, do you mean you still get this error even without parallel upload or copy?

rad-pat · 2025-01-16T15:45:52Z

just to make sure, by independent single file upload, do you mean you still get this error even without parallel upload or copy?

Yes, a singular load via COPY INTO
Client saves to GCS then calls database with COPY INTO statement (same operation that is ran in parallel, but in this case just ran once)

rad-pat · 2025-01-16T15:46:21Z

can you try this:

reproduce again, with the ending / removed from the source path when copy, databend will use the full path (starts from /load_temps/)

I will try this now

youngsofun · 2025-01-16T16:18:28Z

seems this can explain #17312
I am not familiar with the code about cache, I’ll look into it further to confirm.

rad-pat · 2025-01-16T16:25:34Z

@youngsofun - Removal of the trailing slash does seem to fix the issue. I guess we're not specifying a pattern, should that matter?

youngsofun · 2025-01-16T16:42:54Z

about the end /, it is an old bug I am working on too #17115
not problem if you use force=true for the dest table.

remove ending '/' (and use full path) avoid the bug just found #17312

youngsofun · 2025-01-16T16:46:10Z

#17312 is easier to fix, I`ll do it tomorrow.

rad-pat added the C-bug Category: something isn't working label Jan 14, 2025

sundy-li assigned youngsofun Jan 14, 2025

This was referenced Jan 16, 2025

fix: copy into table collect files twice some times. #17300

Merged

fix: set query_kind earlier to ensure it takes effect. #17302

Merged

youngsofun mentioned this issue Jan 17, 2025

fix: use full path as cache key for parquet meta. #17313

Merged

11 tasks

BohuTANG closed this as completed Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Regression on COPY INTO #17284

bug: Regression on COPY INTO #17284

rad-pat commented Jan 14, 2025 •

edited

Loading

youngsofun commented Jan 15, 2025

rad-pat commented Jan 15, 2025

rad-pat commented Jan 15, 2025

rad-pat commented Jan 15, 2025

rad-pat commented Jan 15, 2025

youngsofun commented Jan 16, 2025 •

edited

Loading

rad-pat commented Jan 16, 2025

rad-pat commented Jan 16, 2025

rad-pat commented Jan 16, 2025

rad-pat commented Jan 16, 2025 •

edited

Loading

youngsofun commented Jan 16, 2025

youngsofun commented Jan 16, 2025 •

edited

Loading

rad-pat commented Jan 16, 2025

rad-pat commented Jan 16, 2025

youngsofun commented Jan 16, 2025

rad-pat commented Jan 16, 2025

youngsofun commented Jan 16, 2025

youngsofun commented Jan 16, 2025

bug: Regression on COPY INTO #17284

bug: Regression on COPY INTO #17284

Comments

rad-pat commented Jan 14, 2025 • edited Loading

Search before asking

Version

What's Wrong?

How to Reproduce?

Are you willing to submit PR?

youngsofun commented Jan 15, 2025

Unexpected diff schema error (bug in v0.8.688)

Copy operation becoming slower (v0.8.688)

rad-pat commented Jan 15, 2025

rad-pat commented Jan 15, 2025

rad-pat commented Jan 15, 2025

rad-pat commented Jan 15, 2025

youngsofun commented Jan 16, 2025 • edited Loading

rad-pat commented Jan 16, 2025

rad-pat commented Jan 16, 2025

rad-pat commented Jan 16, 2025

rad-pat commented Jan 16, 2025 • edited Loading

youngsofun commented Jan 16, 2025

youngsofun commented Jan 16, 2025 • edited Loading

rad-pat commented Jan 16, 2025

rad-pat commented Jan 16, 2025

youngsofun commented Jan 16, 2025

rad-pat commented Jan 16, 2025

youngsofun commented Jan 16, 2025

youngsofun commented Jan 16, 2025

rad-pat commented Jan 14, 2025 •

edited

Loading

youngsofun commented Jan 16, 2025 •

edited

Loading

rad-pat commented Jan 16, 2025 •

edited

Loading

youngsofun commented Jan 16, 2025 •

edited

Loading