Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

Windows: "PDF error: Couldn't open file" with some unicode filenames #111

Open
jwilk opened this issue Sep 16, 2015 · 4 comments
Open

Windows: "PDF error: Couldn't open file" with some unicode filenames #111

jwilk opened this issue Sep 16, 2015 · 4 comments
Labels

Comments

@jwilk
Copy link
Member

jwilk commented Sep 16, 2015

Issue reported by 40a at Bitbucket:

I'm using pdf2djvu.exe on windows 8.1.
I have noticed that for all pdf files that contain "ی" character (U+06CC) in their names I get the following error:

>>>F:\Software\Media\PDF\pdf2djvu-0.8.2/pdf2djvu.exe --output="E:\out.djvu" "E:\ی.pdf"
PDF error: Couldn't open file 'E:\ÙŠ.pdf': No such file or directory.
Unable to load document

>>>"E:\ی.pdf"

>>>

When running the filepath directly ("E:\ی.pdf") it works fine and causes the file to be opened in Adobe Reader. So I suspect that the issue is caused by the way pdf2djvu decodes its arguments.

I already have tried using the chcp 65001 command to change the cmd's codepage to utf-8, but still the same error, only the shape of the mojibake in the error message changes.

Currently I have found no way around this but to rename the file to something else and then do the conversion.

@jwilk
Copy link
Member Author

jwilk commented Sep 16, 2015

Thanks for the bug report.

pdf2djvu doesn't itself perform any conversions on the arguments.
The C runtime does covert from Unicode command-line to byte-based argv[], using the ANSI codepage as encoding.
If it does it wrong, as seem to be the case here, there's not much we can do about it.

chcp doesn't help, because it only changes console codepage, not the ANSI codepage.

Anyway, I wrote a small test program that should show what's exactly going on here. Could you run it with "E:\ی.pdf" as the argument, and paste the output?


Attachment: testencoding.zip

@jwilk
Copy link
Member Author

jwilk commented Sep 16, 2015

Source of the test program:

#include <stdio.h>
#include <sys/stat.h>
#include <windows.h>

int main(int argc, char **argv)
{
    struct stat st;
    int rc;
    int i;
    printf("GetACP() = %d\n", GetACP());
    printf("GetConsoleOutputCP() = %d\n", GetConsoleOutputCP());
    for (i = 1; i < argc; i++) {
        printf("argv[%d] = \"", i);
        const char *p = argv[i];
        while (*p)
            printf("\\x%02X", (unsigned char)*p++);
        printf("\"\n");
        rc = stat(argv[i], &st);
        printf("stat(argv[%d]) = %d", i, rc);
        if (rc != 0)
            printf(" (%s)", strerror(errno));
        printf("\n");
    }
    wchar_t **argvw;
    int argcw;
    argvw = CommandLineToArgvW(GetCommandLineW(), &argcw);
    if (argvw == NULL) {
        fprintf(stderr, "CommandLineToArgvW() failed\n");
        return 1;
    }
    for (i = 1; i < argcw; i++) {
        printf("argvw[%d] = L\"", i);
        const wchar_t *p = argvw[i];
        while (*p)
            printf("\\u%04X", *p++);
        printf("\"\n");
        rc = wstat(argvw[i], &st);
        printf("wstat(argvw[%d]) = %d", i, rc);
        if (rc != 0)
            printf(" (%s)", strerror(errno));
        printf("\n");
    }
    return 0;
}

/* vim:set ts=4 sts=4 sw=4 et:*/

@jwilk
Copy link
Member Author

jwilk commented Sep 16, 2015

Comment submitted by 40a at Bitbucket:

Thank you. I see. AFAIK non of the Microsoft defined codepages contain the character "ی".

Here is the output:

F:\Downloads>testencoding.exe "E:\ی.pdf"
GetACP() = 1256
GetConsoleOutputCP() = 720
argv[1] = "\x45\x3A\x5C\xED\x2E\x70\x64\x66"
stat(argv[1]) = -1 (No such file or directory)
argvw[1] = L"\u0045\u003A\u005C\u06CC\u002E\u0070\u0064\u0066"
wstat(argvw[1]) = 0

@jwilk
Copy link
Member Author

jwilk commented Sep 17, 2015

U+06CC (ARABIC LETTER FARSI YEH) cannot be represented in CP1256, which is your ANSI codepage. Apparently the C runtime converts the character to 0xED, which is U+064A (ARABIC LETTER YEH).

That's going to be tough to fix. :-\

But I'll try at least improve the error message.

@jwilk jwilk added the bug label Dec 6, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

1 participant