Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gallery-dl + Exiftool] How to overcome special characters problems? #2734

Closed
KonoVitoDa opened this issue Jul 5, 2022 · 5 comments
Closed

Comments

@KonoVitoDa
Copy link

KonoVitoDa commented Jul 5, 2022

TL;DR: I need an exec.command to write the downloaded file path and metadata values to a text file (just the chosen ones, one per line and with an Exiftool command appended to the beginning), so I can use this generated text file as an arg file in my Exiftool postprocessor command to overcome problems with special characters.

Whenever a special character like ⧸ or any Japanese character is written to the exif file metadata using Exiftool, they are replaced by a bunch of "???". And when they are present in the filename, Exiftool can't even find the files. It looks like it's a problem on the Windows side, which can't handle these characters well. The Exiftool documentation says here and here that using a arg file is a good way to bypass this problem, so a solution I thought of would be to use an exec.command to first write the metadata values ​​and the file path to a text file, each one preceded by the corresponding Exiftool command (1), and then run a second exec.command using this arg file (2) to correctly write special characters in their respective fields, and also avoid problems with special characters in filenames.

1. WHAT THE TEXT FILE WOULD LOOK LIKE:

-title={content:?//}
-xpcomment={favorite_count} Likes
-keywords={hashtags!S}
-createdate={date}
{_path[:4]}

PS: Note that the values ​​in { } must have already been converted to real values, so taking this tweet as an example, the arg file would become:

-title=#深夜の真剣お絵描き60分一本勝負うどんげ
-xpcomment=13119 Likes
-keywords=深夜の真剣お絵描き60分一本勝負
-createdate=2022-07-02 22:34:43
D:\Downloads\twitter\poccheinfinity 1543362559075463169 p1 [2022-07-02].jpg

2. USING THE GENERATED TEXT FILE AS AN ARG FILE FOR EXIFTOOL:

"postprocessor":
{
    "exiftool-twitter":
    {
        "name": "exec",
        "async": false,
        "command": ["exiftool", "-charset", "filename=utf8", "-@", "~/gallery-dl/ARG FILE.txt"],
        "event" : "after"
    }
}

So how could I generate this text file with one argument per line using an exec.command postprocessor? it would require just a Powershell/CMD command I think, but I don't know how to do so (I'm a newbie). And it would be good if the text file was replaced on each new file downloaded, so I don't end up with hundreds of txt files.

And if there's an easier way to correctly read/write these special characters, please let me know.

Windows 10 Home 21H2 19044.1766 | gallery-dl 1.22.3 | Exiftool 12.42 (Windows Executable)

@mikf
Copy link
Owner

mikf commented Jul 8, 2022

I think this would be easiest by using a metadata post processor to write the text file with. Only problem is that it does not have the value for {_path[:4]} available, so we'd have to add this later, somehow. Or I do some changes to gallery-dl itself and add _path etc to every metadata dict ...

Anyway, the metadata post processor would look something like this:

{
    "name": "metadata",
    "event": "after",
    "filename": "ARG FILE.txt",
    "directory": "~/gallery-dl",
    "mode": "custom",
    "format": [
        "-title={content:?//R\n/ /}",
        "-xpcomment={favorite_count} Likes",
        "-keywords={hashtags!S}",
        "-createdate={date}"
    ]
}

but, as I said, that's missing the {_path[:4]} at the end. It would be somewhat easier if we could spawn another shell and do it with that, but I have no idea how to accomplish that on Windows.

@KonoVitoDa
Copy link
Author

Thanks. This will be a good solution once _path in the metadata is added.

But for now, a (better?) solution I found on the Exiftool forum was this:
https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window/57134096#57134096

By the way, what does :4 in _path[:4] do?

@Hrxn
Copy link
Contributor

Hrxn commented Jul 10, 2022

By the way, what does :4 in _path[:4] do?

This is called using the Python Array Slicing Syntax to get a range (a slice) from a string.
It's usually varname[<StartIndex>:<EndIndex>], but in this case the start index is left out, so it's automatically set to the start of the string (at index 0), so varname[:4] returns the substring from the string varname from the index positions 0 until 4.

[...] It would be somewhat easier if we could spawn another shell and do it with that, but I have no idea how to accomplish that on Windows.

I mean, that's what the exec postprocessor is for, right?
Why not simply wrap exiftool in a script file, PowerShell for example?

Should be easy enough on gallery-dl's side..

"postprocessor":
{
    "exiftool-twitter":
    {
        "name": "exec",
        "async": false,
        "command": ["powershell.exe", "-File", "D:/path/to/scriptfile.ps1", "~/gallery-dl/ARG FILE.txt "],
        "event" : "after"
    }
}

(Anything else with regard to metadata we need to pass on here?)

Depending on the length of the script, this could also be wrapped into a powershell.exe -Command {script block}, potentially..
Not sure if that would be really ergonomic, though..

Anyway, I'm here and willing to help with testing 😄

mikf added a commit that referenced this issue Jul 31, 2022
@mikf
Copy link
Owner

mikf commented Aug 3, 2022

This will be a good solution once _path in the metadata is added.

This is now possible with the path-metadata option added in commit 7d1a95a

It works similar to url-metadata in that you set this option to any name you want (_path) and can then use that name to access all path information associated with the last downloaded file ({_path.path}, {_path.directory}, {_path.filename}, etc)

By the way, what does :4 in _path[:4] do?

I made a mistake up in #2734 (comment)

_path[:4] is equivalent to _path[0:4] and returns the first 4 characters from _path.

What I actually meant to write was _path[4:], which returns everything but the first 4 characters, with are \\?\ on Windows.

\\?\ makes an absolute path a raw path that is not limited to 260 characters, but a lot of other tools do not support such paths.

@brokedarius
Copy link

Try whitelisting characters with a regex and use an f-string maybe? What I do myself is allow characters like a-z A-Z or 0-9 and dismiss evertything else. If you want to take a look at the snippet, it also replaces all occurences of multiple spaces into a single one. The [:48] at the end is to limit the title length but you are free to adjust that.

{   
    "extractor": {

        "extractor_name": {
            "filename": "\fE '{} - {}-{} - {}.{}'.format(user, gallery['id'], id, ' '.join(re.compile(r'[a-zA-Z0-9]+').findall(re.sub(r'\\s+', ' ', gallery['title'][:48]))).lower(), extension)",
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants