Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-sdk-translate - translate_document returns text in wrong encoding #2897

Closed
dgm opened this issue Aug 12, 2023 · 6 comments · Fixed by #2900
Closed

aws-sdk-translate - translate_document returns text in wrong encoding #2897

dgm opened this issue Aug 12, 2023 · 6 comments · Fixed by #2900
Assignees
Labels
feature-request A feature should be added or improved. in-progress Work is in progress to resolve the issue.

Comments

@dgm
Copy link

dgm commented Aug 12, 2023

Describe the bug

When calling translate_document, the response contains the bytes that are the correct utf-8 encoding as expected but Ruby 3.0.6 encoding thinks it is ASCII-8BIT. Here's a rails console excerpt provided below.

Expected Behavior

Expect the content to be utf-8 encoded:

3.0.6 :025 > a.translated_document.content.encoding
 => #<Encoding:UTF-8>

Current Behavior

3.0.6 :020 > a.translated_document.content.encoding
 => #<Encoding:ASCII-8BIT>

Reproduction Steps

3.0.6 :018 > a = @client.translate_document({document: {content_type: 'text/html', content: 'There will be an increase in proficiency of 10 percentage points'}, source_language_code: 'en', target_language_code: 'es'})
 =>
#<struct Aws::Translate::Types::TranslateDocumentResponse
...
3.0.6 :019 > a.translated_document.content
 => "Habr\xC3\xA1 un aumento en la competencia de 10 puntos porcentuales"
3.0.6 :020 > a.translated_document.content.encoding
 => #<Encoding:ASCII-8BIT>
3.0.6 :021 > a.translated_document.content.force_encoding('utf-8')
 => "Habrá un aumento en la competencia de 10 puntos porcentuales"

Possible Solution

No response

Additional Information/Context

No response

Gem name ('aws-sdk', 'aws-sdk-resources' or service gems like 'aws-sdk-s3') and its version

aws-sdk-translate

Environment details (Version of Ruby, OS environment)

ruby 3.0.6, OS X 13.4.1

@dgm dgm added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 12, 2023
@dgm
Copy link
Author

dgm commented Aug 12, 2023

The problem or difference from translate_text is that translate_document API returns a base63 encoded string, which when processed by https://github.com/aws/aws-sdk-ruby/blob/9a4278dbe51fd1a7125973772c021dd02d328226/gems/aws-sdk-core/lib/aws-sdk-core/json/parser.rb#L69C11-L69C52 gets changed to ASCII-8BIT. I don't know if the generic implementation in the AWS core library can assume that all Blob Shape things are UTF-8 or not, so it probably cannot be fixed there. I would prefer to have a method override in the Aws::Translate::Types::TranslatedDocument class that forced the encoding but it also appears that class is auto-generated from the api json definitions so I'm at a loss as to how to fix it. Ideally I think the API definitions should include some specification or assumptions about the the character encodings - maybe it is assumed for string types, but Blobs could conceivably be strings or binary so in addition to content-type, it would be nice if the api response also specified the character encoding. But I am not an expert in this matter. :)

@mullermp
Copy link
Contributor

I think this can possibly be fixed with a plugin/customization in aws-sdk-translate service for specifically this operation and api member. I can look into this on Monday.

@alextwoods alextwoods self-assigned this Aug 14, 2023
@alextwoods alextwoods added investigating Issue is being investigated and removed needs-triage This issue or PR still needs to be triaged. labels Aug 14, 2023
@alextwoods
Copy link
Contributor

The TranslateDocument API "supports text, HTML, or Word documents as the input document." The output is documented as "The document format matches the source document format." So I think in cases such as a Word doc we would not want to apply an encoding to this string (and instead your application would need to interpret it as binary data).

Possibly we could add a custom plugin that looks at the type and encoding of the input document and apply the same encoding on the response (eg, if the input document is a String with utf-8 encoding, then we can ensure the output document is also a String with utf-8).

@dgm
Copy link
Author

dgm commented Aug 14, 2023

Is there a document that explains the high level architecture to the aws-sdk-ruby build? I see code for plugins etc, but it all appears to be auto generated, and I can't find any documentation on how to play within the system...

@alextwoods
Copy link
Contributor

alextwoods commented Aug 14, 2023

We don't have good documentation on how to add plugins. But if you want to add a plugin in your own code, you can do something like:

class FixTranslateDocumentEncoding < Seahorse::Client::Plugin

  class Handler < Seahorse::Client::Handler
      def call(context)
        # detect encoding
        encoding = "UTF-8" # TODO: actually detect it and ensure it doesn't break for non-string inputs
        # call the rest of the stack, this will build the request, sign it, send it and parse the output
        resp = @handler.call(context)
         # modify the response before returning it upwards in the stack
        resp.translated_document.content = resp.translated_document.content.force_encoding(encoding)
        resp
      end
   end

    def add_handlers(handlers, _config)
      # Handler is early in the call stack
      handlers.add(Handler, step: :initialize, operations: [:translate_document])
    end
end

# Add the plugin to the client
Aws::Translate::Client.add_plugin(FixTranslateDocumentEncoding)

This would apply for all instances of the Translate::Client.

@alextwoods alextwoods added feature-request A feature should be added or improved. and removed bug This issue is a bug. labels Aug 15, 2023
@alextwoods alextwoods added in-progress Work is in progress to resolve the issue. and removed investigating Issue is being investigated labels Aug 16, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. in-progress Work is in progress to resolve the issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants