Skip to content

Fix encoding for Elasticsearch count pushdown#23425

Merged
hashhar merged 2 commits intotrinodb:masterfrom
bvolpato:count-utf8-encoding
Sep 19, 2024
Merged

Fix encoding for Elasticsearch count pushdown#23425
hashhar merged 2 commits intotrinodb:masterfrom
bvolpato:count-utf8-encoding

Conversation

@bvolpato
Copy link
Copy Markdown
Member

@bvolpato bvolpato commented Sep 14, 2024

Description

We've identified that when a COUNT(*) query was pushed down and contained special characters, the QueryBuilder string was being handled as ISO-8859-1 and causing parsing issues for Elasticsearch.

Additional context and related issues

For example, this query:

SELECT COUNT(*) FROM catalog.default.users where country = 'Türkiye';

In case the "country" field is a keyword, would result in:

{"error":{"root_cause":[{"type":"parsing_exception","reason":"Failed to parse","line":1,"col":53}],"type":"parsing_exception","reason":"Failed to parse","line":1,"col":53,"caused_by":{"type":"x_content_parse_exception","reason":"[1:53] [bool] failed to parse field [filter]","caused_by":{"type":"json_parse_exception","reason":"Invalid UTF-8 middle byte 0x73\n at [Source: (org.elasticsearch.common.io.stream.ByteBufferStreamInput); line: 1, column: 64]"}}},"status":400}

The source for the problem was the new StringEntity(sourceBuilder.toString()), which uses https://github.com/apache/httpcomponents-core/blob/rel/v4.4.16/httpcore/src/main/java/org/apache/http/entity/ContentType.java#L106-L107 and defaults to ISO.

image

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# ElasticSearch, OpenSearch
* Fix query failure for some queries when a predicate contains unicode text. ({issue}`issuenumber`)

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Sep 14, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link
Copy Markdown
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenSearch connector probably has the same issue. Can you apply the fix there, too?

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Sep 15, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@bvolpato
Copy link
Copy Markdown
Member Author

bvolpato commented Sep 15, 2024

The OpenSearch connector probably has the same issue. Can you apply the fix there, too?

Good call, yes, just reproduced there with the same set of tests. Pushed the same fix there.

I also applied the text blocks change/suggestions too, agree that it looks much cleaner. For context, I followed the same structure from the surrounding tests -- so they could likely get a similar refactoring.

@bvolpato
Copy link
Copy Markdown
Member Author

Lastly, I've submitted the CLA, but I guess it might take a couple of days to hear back.

@pettyjamesm
Copy link
Copy Markdown
Member

@martint - looks like this is just pending CLA processing before it can merge.

@martint
Copy link
Copy Markdown
Member

martint commented Sep 18, 2024

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Sep 18, 2024
@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Sep 18, 2024

The cla-bot has been summoned, and re-checked this pull request!

@hashhar hashhar merged commit 3c1b11c into trinodb:master Sep 19, 2024
@github-actions github-actions bot added this to the 459 milestone Sep 19, 2024
@bvolpato bvolpato deleted the count-utf8-encoding branch September 20, 2024 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants