Windows: "PDF error: Couldn't open file" with some unicode filenames #111

jwilk · 2015-09-16T08:43:47Z

Issue reported by 40a at Bitbucket:

I'm using pdf2djvu.exe on windows 8.1.
I have noticed that for all pdf files that contain "ی" character (U+06CC) in their names I get the following error:

>>>F:\Software\Media\PDF\pdf2djvu-0.8.2/pdf2djvu.exe --output="E:\out.djvu" "E:\ی.pdf"
PDF error: Couldn't open file 'E:\ÙŠ.pdf': No such file or directory.
Unable to load document

>>>"E:\ی.pdf"

>>>

When running the filepath directly ("E:\ی.pdf") it works fine and causes the file to be opened in Adobe Reader. So I suspect that the issue is caused by the way pdf2djvu decodes its arguments.

I already have tried using the chcp 65001 command to change the cmd's codepage to utf-8, but still the same error, only the shape of the mojibake in the error message changes.

Currently I have found no way around this but to rename the file to something else and then do the conversion.

The text was updated successfully, but these errors were encountered:

jwilk · 2015-09-16T21:02:28Z

Thanks for the bug report.

pdf2djvu doesn't itself perform any conversions on the arguments.
The C runtime does covert from Unicode command-line to byte-based argv[], using the ANSI codepage as encoding.
If it does it wrong, as seem to be the case here, there's not much we can do about it.

chcp doesn't help, because it only changes console codepage, not the ANSI codepage.

Anyway, I wrote a small test program that should show what's exactly going on here. Could you run it with "E:\ی.pdf" as the argument, and paste the output?

Attachment: testencoding.zip

jwilk · 2015-09-16T21:03:12Z

Source of the test program:

#include <stdio.h>
#include <sys/stat.h>
#include <windows.h>

int main(int argc, char **argv)
{
    struct stat st;
    int rc;
    int i;
    printf("GetACP() = %d\n", GetACP());
    printf("GetConsoleOutputCP() = %d\n", GetConsoleOutputCP());
    for (i = 1; i < argc; i++) {
        printf("argv[%d] = \"", i);
        const char *p = argv[i];
        while (*p)
            printf("\\x%02X", (unsigned char)*p++);
        printf("\"\n");
        rc = stat(argv[i], &st);
        printf("stat(argv[%d]) = %d", i, rc);
        if (rc != 0)
            printf(" (%s)", strerror(errno));
        printf("\n");
    }
    wchar_t **argvw;
    int argcw;
    argvw = CommandLineToArgvW(GetCommandLineW(), &argcw);
    if (argvw == NULL) {
        fprintf(stderr, "CommandLineToArgvW() failed\n");
        return 1;
    }
    for (i = 1; i < argcw; i++) {
        printf("argvw[%d] = L\"", i);
        const wchar_t *p = argvw[i];
        while (*p)
            printf("\\u%04X", *p++);
        printf("\"\n");
        rc = wstat(argvw[i], &st);
        printf("wstat(argvw[%d]) = %d", i, rc);
        if (rc != 0)
            printf(" (%s)", strerror(errno));
        printf("\n");
    }
    return 0;
}

/* vim:set ts=4 sts=4 sw=4 et:*/

jwilk · 2015-09-16T23:12:12Z

Comment submitted by 40a at Bitbucket:

Thank you. I see. AFAIK non of the Microsoft defined codepages contain the character "ی".

Here is the output:

F:\Downloads>testencoding.exe "E:\ی.pdf"
GetACP() = 1256
GetConsoleOutputCP() = 720
argv[1] = "\x45\x3A\x5C\xED\x2E\x70\x64\x66"
stat(argv[1]) = -1 (No such file or directory)
argvw[1] = L"\u0045\u003A\u005C\u06CC\u002E\u0070\u0064\u0066"
wstat(argvw[1]) = 0

jwilk · 2015-09-17T10:50:17Z

U+06CC (ARABIC LETTER FARSI YEH) cannot be represented in CP1256, which is your ANSI codepage. Apparently the C runtime converts the character to 0xED, which is U+064A (ARABIC LETTER YEH).

That's going to be tough to fix. :-\

But I'll try at least improve the error message.

jwilk added the bug label Dec 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows: "PDF error: Couldn't open file" with some unicode filenames #111

Windows: "PDF error: Couldn't open file" with some unicode filenames #111

jwilk commented Sep 16, 2015

jwilk commented Sep 16, 2015

jwilk commented Sep 16, 2015

jwilk commented Sep 16, 2015

jwilk commented Sep 17, 2015

Windows: "PDF error: Couldn't open file" with some unicode filenames #111

Windows: "PDF error: Couldn't open file" with some unicode filenames #111

Comments

jwilk commented Sep 16, 2015

jwilk commented Sep 16, 2015

jwilk commented Sep 16, 2015

jwilk commented Sep 16, 2015

jwilk commented Sep 17, 2015