Add support to translate files #84

LukasWallrich · 2023-05-23T11:30:32Z

I am trying to translate some files with the Google Translate API. I don't think that is currently supported - but would be a great option for gl_translate, as I have not found any R code to do it and am a bit daunted by the API docs ... https://cloud.google.com/translate/docs/advanced/translate-documents ... might that be possible?

MarkEdmondson1234 · 2023-05-23T18:34:45Z

Can you read the files into R at all? Chunk up the text and send in

MarkEdmondson1234 · 2023-05-23T18:36:24Z

But reading your docs, it doesn't look like a big change, since there is an option to upload to Cloud Storage already.

LukasWallrich · 2023-05-23T19:58:40Z

If I just want the text, it's indeed not difficult - but I need to translate the full files, so that the formatting remains (somewhat) intact as they contain tables. I did not understand how to send the file with the request, so I ended up using reticulate and the google.cloud.translate package. With that, the Python function is straightforward, partly copied from the documentation. If this is not a common need, obviously feel free to close it - maybe this code helps others who face this issue.

import google.cloud.translate as translate
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "key.json"
os.environ["GOOGLE_PROJECT_ID"] = "XXXX"

def translate_pdf(file_path: str, destination: str, target_lang: str = 'en', source_lang: str = ''):

    if not os.path.isfile(file_path):
        raise ValueError("Error: The file does not exist or is not a regular file.")

    if not file_path.lower().endswith('.pdf'):
        raise ValueError("Error: The file is not a PDF file.")

    client = translate.TranslationServiceClient()

    location = "us-central1"

    parent = f"projects/{os.environ['GOOGLE_PROJECT_ID']}/locations/{location}"

    # Supported file types: https://cloud.google.com/translate/docs/supported-formats
    with open(file_path, "rb") as document:
        document_content = document.read()

    document_input_config = {
        "content": document_content,
        "mime_type": "application/pdf",
    }

    response = client.translate_document(
        request={
            "parent": parent,
            "target_language_code": target_lang,
            "source_language_code": source_lang,
            "document_input_config": document_input_config,
        }
    )

    # To output the translated document, uncomment the code below.
    f = open(destination, 'wb')
    f.write(response.document_translation.byte_stream_outputs[0])
    f.close()

MarkEdmondson1234 · 2023-05-24T05:24:06Z

Thanks! This is helpful

dietrichson · 2023-05-30T13:31:30Z

@MarkEdmondson1234 I was trying to browse through the documentation at this link:
https://code.markedmondson.me/googleLanguageR/
but get a 404. Have you moved it?

MarkEdmondson1234 · 2023-05-30T18:20:53Z

There were two mirrored websites so bit confused why but this one is still live: https://docs.ropensci.org/googleLanguageR/

dietrichson · 2023-06-06T10:59:39Z

@MarkEdmondson1234 I tried the following:

my_file1 <- readBin(my_out, "raw", n=10000)
my_file2 <- readBin(system.file(package = "googleLanguageR","test-doc-no.pdf"), "raw", n=10000)
expect_equal(my_file1, my_file2)

This stops working pretty quickly. I suspect that the PDF produced gets time-related metadata added, so they won't be 100% equivalent. Even the file-size is different by a couple of bytes. I'll try to rework the test using pdftools - although this will add a package dependency.

MarkEdmondson1234 mentioned this issue May 24, 2023

Call for co-maintainers :-) #79

Open

dietrichson added a commit to dietrichson/googleLanguageR that referenced this issue Jun 1, 2023

Initial try [ropensci#84]

904fbea

dietrichson mentioned this issue Jun 1, 2023

Support for translating PDF files #85

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to translate files #84

Add support to translate files #84

LukasWallrich commented May 23, 2023

MarkEdmondson1234 commented May 23, 2023

MarkEdmondson1234 commented May 23, 2023

LukasWallrich commented May 23, 2023

MarkEdmondson1234 commented May 24, 2023

dietrichson commented May 30, 2023

MarkEdmondson1234 commented May 30, 2023

dietrichson commented Jun 6, 2023

Add support to translate files #84

Add support to translate files #84

Comments

LukasWallrich commented May 23, 2023

MarkEdmondson1234 commented May 23, 2023

MarkEdmondson1234 commented May 23, 2023

LukasWallrich commented May 23, 2023

MarkEdmondson1234 commented May 24, 2023

dietrichson commented May 30, 2023

MarkEdmondson1234 commented May 30, 2023

dietrichson commented Jun 6, 2023