Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Regression on COPY INTO #17284

Closed
1 of 2 tasks
rad-pat opened this issue Jan 14, 2025 · 18 comments · Fixed by #17313
Closed
1 of 2 tasks

bug: Regression on COPY INTO #17284

rad-pat opened this issue Jan 14, 2025 · 18 comments · Fixed by #17313
Assignees
Labels
C-bug Category: something isn't working

Comments

@rad-pat
Copy link

rad-pat commented Jan 14, 2025

Search before asking

  • I had searched in the issues and found no similar issues.

Version

v1.2.688-nightly

What's Wrong?

Upgraded from v1.2.680-nightly to v1.2.688-nightly and one of our testing workflows began to fail. It loads data from a parquet file via COPY INTO from GCS. It reports that the schema is different, but inspection of table and parquet file show the 2 items to be in order.

Please note, downgrading to v1.2.687 made the statements work again, although they seem to be very slow - 7 seconds when on v1.2.680 vs 90 seconds on v1.2.687

We are loading two tables in parallel, both via COPY INTO statements, it looks like the actual schema reported in the message below is the schema of the other table, showed here.
image

However, the load for that table also reports an error and the actual schema for that is a completely unrelated table.

Data is
data_0.zip

Error message is as follows:

sqlalchemy.exc.DBAPIError: (databend_sqlalchemy.errors.Error) Code: None. APIError: QueryFailed: [1303]infer schema from 'data_0.parquet', but get diff schema in file 'data_0.parquet'. 

Expected schema: SchemaDescriptor { schema: GroupType { basic_info: BasicTypeInfo { name: "schema", repetition: None, converted_type: NONE, logical_type: None, id: None }, fields: [PrimitiveType { basic_info: BasicTypeInfo { name: "row_num", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "iteration", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "input_records", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "matched_records", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "display", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "rule", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "rule_id", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "include", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: BOOLEAN, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Driver", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "RC!!context!!", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "RC", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Customer", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }] } }, actual: SchemaDescriptor { schema: GroupType { basic_info: BasicTypeInfo { name: "schema", repetition: None, converted_type: NONE, logical_type: None, id: None }, fields: [PrimitiveType { basic_info: BasicTypeInfo { name: "TestID", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Version", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Period", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "RC", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Account", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Value", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: DOUBLE, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "log", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "rule_number", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "rule", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "rule_id", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Customer!!target!!", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "Driver!!driver!!", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "RC!!context!!", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "RC!!target!!", repetition: Some(OPTIONAL), converted_type: UTF8, logical_type: Some(String), id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "allocable", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "path_id", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: -1, scale: -1, precision: -1 }] } }

[SQL: 
COPY INTO "anlze059edc4-d519-48e5-908c-1070b8273f8c"."analyzetable_3cdd2094-7df0-4ee7-954f-b7621232ffb7" 
FROM 'gcs://rhynl-pctb-temp-bugfixes2/load_temp/bugfixes2/e059edc4-d519-48e5-908c-1070b8273f8c/analyzetable_3cdd2094-7df0-4ee7-954f-b7621232ffb7/dfadfc10-5ae8-403f-b642-9f299cbc079e/'
CONNECTION = (
  ENDPOINT_URL = 'https://storage.googleapis.com'
  CREDENTIAL = '<snip>'
) FILE_FORMAT = (TYPE = PARQUET)
 FORCE = TRUE
COLUMN_MATCH_MODE = CASE_SENSITIVE
]

How to Reproduce?

I'm not sure what steps to give here to reproduce. I will try to replicate exactly and update...

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@rad-pat rad-pat added the C-bug Category: something isn't working label Jan 14, 2025
@youngsofun
Copy link
Member

Thank you for reporting this issue.

Here’s what I have so far:
1. The only PR that seems related to this is #17175, which was included in version v0.8.685.
2. I haven’t been able to reproduce the issue yet.

It looks like there are two separate concerns here:

Unexpected diff schema error (bug in v0.8.688)

  • Could you clarify if you’ve encountered this error multiple times with different destination tables?
  • Is it possible that the files in the source directory were modified during the COPY operation?

To gather more information, you could check the opendal logs filtered by the query ID in the file logs. This might provide additional context about what happened during the operation.

Copy operation becoming slower (v0.8.688)

  • PR #17175 may be related, as it introduced schema comparison for each file during the COPY process. However, this shouldn’t cause a significant performance impact unless the files are very small and there are a large number of them.
  • Examining the logs filtered by the query ID might help us pinpoint the source of the slowdown.

I will continue work on this.
Let me know if you can share any additional details or logs to help us investigate further!

@rad-pat
Copy link
Author

rad-pat commented Jan 15, 2025

Thanks, I will endeavour to get more information on this for you today 👍

@rad-pat
Copy link
Author

rad-pat commented Jan 15, 2025

I have managed to replicate the same in a test environment.

Focusing initially on the diff schema problem, I tried the following:

  1. Run using v1.2.680-p2 - This worked fine
  2. Run using v1.2.688 - This failed with the diff schema - query id = d3489539-52b7-426d-a0a9-ad38ba3a823d

The query logs for this are attached. In this kubernetes environment, I have 3 meta pods and 1 query pod.

In answer to your questions on diff schema error:
There is definitely no modification of the files in the source directory, code path from Python side is to load parquet file to destination in GCS and, once upload is complete, call to the server to load into Databend via COPY INTO (but this is multi-threaded to upload and import 2 tables in parallel - each threads does upload and import in series)
I am able to replicate with different destination tables. What is interesting is that other COPY INTO queries that we have in the whole workflow are working. The thing that stands these failures apart are that we're using parquet format and that they are in parallel.
databend-query-logs-copy-into.zip

@rad-pat
Copy link
Author

rad-pat commented Jan 15, 2025

Please note, dropping back to query pod v1.2.687-nightly worked in my test environment without any performance implication - but meta version remained v1.2.688 otherwise some heartbeat issue was causing failure

@rad-pat
Copy link
Author

rad-pat commented Jan 15, 2025

Additional info. I have not been able to successfully replicate the copy into becoming slower. I think it can possibly be put down to communication issue between meta/query at the point I was trying?
What is very replicable for me is to downgrade to v1.2.687-nightly and see success, then upgrade to v1.2.688-nightly and see the failure. Other versions below 1.2.687 all seem to work fine too.
I hope this helps to track the bug down.

@youngsofun
Copy link
Member

youngsofun commented Jan 16, 2025

I checked the change log, still no idea v1.2.687-nightly...v1.2.688-nightly

@rad-pat

Could you please double-check that the upload paths of the two upload-copy threads are correct and not the same?

the data_0.zip you uploaded here is 7KB, while the file size in the log is about 5KB. but I am not sure the file is the one expected to be load in the query corresponding to the query id d3489539-52b7-426d-a0a9-ad38ba3a823d

@rad-pat
Copy link
Author

rad-pat commented Jan 16, 2025

@youngsofun, the paths for the uploads are definitely different, each upload path in GCS includes the table name and then a unique load identifier, as shown below, extracted from the log file. I'm sure we would have seen issues loading earlier if the paths had been the same because the tables schemas are all different.

gcs://rheop-pctb-temp-bug-fixes/load_temp/c4a44436-4eba-024c-6764-d806864cbc2a/a79e711b-c6ac-4869-b7f6-10c45fe93e67/analyzetable_3cdd2094-7df0-4ee7-954f-b7621232ffb7/66258c46-77f2-45a7-b41e-51e22495b159
gcs://rheop-pctb-temp-bug-fixes/load_temp/c4a44436-4eba-024c-6764-d806864cbc2a/a79e711b-c6ac-4869-b7f6-10c45fe93e67/analyzetable_70d943d5-abba-4720-9169-b923322958fb/5bd7c955-e989-4430-8fd4-86d505ffd2ae/
gcs://rheop-pctb-temp-bug-fixes/load_temp/c4a44436-4eba-024c-6764-d806864cbc2a/a79e711b-c6ac-4869-b7f6-10c45fe93e67/analyzetable_189c26da-6703-4c0e-9bce-02a3a4139708/e483fdcd-94b3-4acc-aa53-5a3d641f98e1/

Possibly of relevance is that, prior to calling the COPY statement, a TRUNCATE is called on each of the tables.

@rad-pat
Copy link
Author

rad-pat commented Jan 16, 2025

the data_0.zip you uploaded here is 7KB, while the file size in the log is about 5KB. but I am not sure the file is the one expected to be load in the query corresponding to the query id d3489539-52b7-426d-a0a9-ad38ba3a823d

This was the parquet file for that query - it was not deleted from GCS yet
data_0 (1).zip

@rad-pat
Copy link
Author

rad-pat commented Jan 16, 2025

I guess another thing of note is that, despite being loaded from different gcs locations, the name of the files in each of these uploads is data_0.parquet.
I have just tried a different independent single file upload through our process and that is failing in the same way. The expected schema reported is correct, the actual schema reported in the error is that of data for analyzetable_3cdd2094-7df0-4ee7-954f-b7621232ffb7 which was a previous upload attempt to generate logs yesterday for this issue.

@rad-pat
Copy link
Author

rad-pat commented Jan 16, 2025

Further information, if I change the name of the files uploaded to GCS and give them a unique name like data_{unique_upload_id}_{index}.parquet then the COPY INTO works ok on 1.2.688. It must be something to do with them being named data_{index}.parquet and because they are only small data, only one file data_0.parquet. The path in GCS is unique already with a uuid for the upload making up the final directory for the path.

@youngsofun
Copy link
Member

can you try this:

reproduce again, with the ending / removed from the source path when copy,
databend will use the full path (starts from /load_temps/)

@youngsofun
Copy link
Member

youngsofun commented Jan 16, 2025

I guess another thing of note is that, despite being loaded from different gcs locations, the name of the files in each of these uploads is data_0.parquet. I have just tried a different independent single file upload through our process and that is failing in the same way. The expected schema reported is correct, the actual schema reported in the error is that of data for analyzetable_3cdd2094-7df0-4ee7-954f-b7621232ffb7 which was a previous upload attempt to generate logs yesterday for this issue.

just to make sure, by independent single file upload, do you mean you still get this error even without parallel upload or copy?

@rad-pat
Copy link
Author

rad-pat commented Jan 16, 2025

just to make sure, by independent single file upload, do you mean you still get this error even without parallel upload or copy?

Yes, a singular load via COPY INTO
Client saves to GCS then calls database with COPY INTO statement (same operation that is ran in parallel, but in this case just ran once)

@rad-pat
Copy link
Author

rad-pat commented Jan 16, 2025

can you try this:

reproduce again, with the ending / removed from the source path when copy, databend will use the full path (starts from /load_temps/)

I will try this now

@youngsofun
Copy link
Member

seems this can explain #17312
I am not familiar with the code about cache, I’ll look into it further to confirm.

@rad-pat
Copy link
Author

rad-pat commented Jan 16, 2025

@youngsofun - Removal of the trailing slash does seem to fix the issue. I guess we're not specifying a pattern, should that matter?

@youngsofun
Copy link
Member

about the end /, it is an old bug I am working on too #17115
not problem if you use force=true for the dest table.

remove ending '/' (and use full path) avoid the bug just found #17312

@youngsofun
Copy link
Member

#17312 is easier to fix, I`ll do it tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants