Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postprocessors – metadata – how to print to stdout? #2624

Open
aleksusklim opened this issue May 26, 2022 · 3 comments
Open

Postprocessors – metadata – how to print to stdout? #2624

aleksusklim opened this issue May 26, 2022 · 3 comments

Comments

@aleksusklim
Copy link

Is it possible to set gallery-dl to print metadata to console stdout (or stderr), rather than printing it to a file?
I call gallery-dl from my own program; it would be really nice if I could just capture and reparse its output (grabbing custom formatted via metadata.content-format lines and piping everything else verbatim) instead of making exec postprocessor calls to a tiny utility that just prints its arguments to stdout (so the metadata will appear in my stream).

For example, I tried:

"postprocessors": [{
    "name": "metadata",
    "event": "post",
    "filename": "-"
}]

– It doesn't work, creating "-" files with JSON in target folder, while I want it to print to stdout/stderr.

Another possible solution: can gallery-dl optionally append to metadata file, instead of rewriting it? So I could specify just one constant file path (to metadata.directory?), to which each downloaded metadata would be appended. (In that case I will shared-open it and watch for changes, reading simultaneously as gallery-dl writes there).

The reason for this is that I want to create my wrapper around gallery-dl that will know "where to stop downloads" using more sophisticated approach than currently possible with --abort or --download-archive. So it needs to know what is being processed at real-time, even posts without media (as for Twitter with text-tweets).

The option -j prints all metadata, but doesn't download anything…

@mikf
Copy link
Owner

mikf commented May 27, 2022

Is it possible to set gallery-dl to print metadata to console stdout (or stderr), rather than printing it to a file?

No, that's not possible at the moment.

Another possible solution: can gallery-dl optionally append to metadata file, instead of rewriting it?

Also not possible.
Not that it matters in your case, but that would produce invalid JSON documents.

The reason for this is that I want to create my wrapper around gallery-dl

I'm guessing this wrapper is not in Python, because if it were, you could access gallery-dl's internals directly and this would be a lot easier.

sing more sophisticated approach than currently possible with --abort or --download-archive

What do you have in mind? I could possibly implement this directly in gallery-dl itself.

@aleksusklim
Copy link
Author

Is it possible to set gallery-dl to print metadata to console stdout (or stderr), rather than printing it to a file?

No, that's not possible at the moment.

I'm think it would be the simplest solution to implement. Oh, by the way, how do I print the whole JSON metadata in one line, but prefixed? Suppose I have: (for twitter)

"postprocessors": [{
  "name": "metadata",
  "event": "prepare",
  "filename": "{tweet_id}_{num}.json",
  "directory": "meta",
  "mode": "custom",
  "content-format": "{user!j}"
}]

This gives me error [twitter][error] An unexpected error occurred: TypeError - Object of type datetime is not JSON serializable.
Which is strange, because removing "mode" and "content-format" keys from the config – gives correct pretty-printed JSON output, where each file is an object with keys, among others: tweet_id, num and user, the latter being an object with keys like date, name, profile_image.
I assume the error is coming from the "date" field, spelled as, for example: "date": "2022-01-30 20:52:33" in automatic JSON output, but for some reason throwing from explicit !j format.

Also, how do I print all of the metadata as JSON object, but not a restricted set of keys? Am I missing some special template like {_filename} ?

Another possible solution: can gallery-dl optionally append to metadata file, instead of rewriting it?

Also not possible. Not that it matters in your case, but that would produce invalid JSON documents.

I think that's fine, as long as they would be each on a new line…
Before this very moment, I was sure that !j (or something other) can print JSON without indentation in one line. (But now it isn't working for me at all, so I can't see the output).

The reason for this is that I want to create my wrapper around gallery-dl

I'm guessing this wrapper is not in Python, because if it were, you could access gallery-dl's internals directly and this would be a lot easier.

Yup, I use NodeJS. I was already near to release state of my local Twitter viewer (the script recursively walks over \gallery-dl\twitter\, reads all .json, stores names of all media, and then – dumps an interactive HTML file for a browser, where any known twitter user can be selected: displaying all of his tweets, all downloaded media, replies threads, retweets references, etc.), but I faced a fatal Twitter flaw with username handling:

  1. Sometimes, Twitter API (and so JSONs) contain properly capitalized usernames, but sometimes they are all-lowercase. This means it is unreliable to track stuff by folder name that gallery-dl created, nor the @-spelling should be trusted. I need to proxy each internal user resolving through lowercase-version of their name.
Example right in Elon Musk's profile popup!

In this tweet, note the link target at bottom-left corner of the browser window:

elonmusk

ElonMusk

  1. Any user can change his short username. Which means that old generated JSONs will show incorrect usernames, though there will be user.id field which is persistent. Anyway, new media will be downloaded to a new location, so the assumption "everything from one user will be saved in one folder" is conceptually false. Twitter profile page of the old username will be empty, without any redirection to a new one: StuckAshh vs Ashhfro .

  2. Worst thing ever: Twitter search will find only those tweets of a user, which were sent at the time this user had the specified username! It's a disaster: search results of (from:Ashhfro) or @Ashhfro are showing a few recent tweets with no sign of existence of everything that was sent before. Which is surprisingly showed by (from:StuckAshh) and @StuckAshh.

Screenshot of those search results:

The bottom of from:Ashhfro
May 10

The top of from:StuckAshh
May 7

When I found this out, I decided to rework my script to anticipate this nasty Twitter behavior. The hardest and unfinished thing is still a downloader script, that will iteratively add new tweets to the media folder (with metadata), without re-requesting much of already stored stuff.

sing more sophisticated approach than currently possible with --abort or --download-archive

What do you have in mind? I could possibly implement this directly in gallery-dl itself.

The problem is, I have more than one gallery-dl invocation for a user:

  • "https://twitter.com/USER/media" – This should download most of the posted images and videos
  • "https://twitter.com/USER" – This will get retweets (at least most of relatively recent of them), as my config has "retweets": "original",
  • "https://twitter.com/USER/with_replies" – So I will get replies in other people threads (the config also has conversations, quoted and replies to true)
  • "https://twitter.com/search?q=(from:USER)&src=typed_query&f=live" – All of the user tweets, overcoming possible restrictions of previous timeline requests
  • "https://twitter.com/search?q=(to:USER)&src=typed_query&f=live" – Optional one, to get other replies to this user's tweets, which is cool to have since my HTML can nicely display them.
  • "https://twitter.com/search?q=@USER&src=typed_query&f=live" – This acts as previous, but also covering plain mentions (in top-level tweets), not only direct replies. I'm not sure that it totally supersedes the from-variant, though.

Even that all of those URLs can be specified together in one call, I have to make a separate call anyway:

  • -o retweets=true --no-download "https://twitter.com/USER"
    – For now this is the only way I can retweets information of current user (since otherwise I have got only retweeted content, without any sign of retweets themselves in the target user's metadata). I submitted an issue for this before: {Twitter – emit current tweet metadata even with "retweets":"original"; also fix quoted retweets saving #2549}, also covering quoted-retweets ambiguity. Actually, I already locally workarounded the bug of quoted retweets: I drop quote_by, filter out retweets and then use author instead of user if they were different; as for media duplicates – I hardlink the identical files to each other at filesystem level.

I cannot use --abort just because a user can retweet a dozen of images from another user that was already tracked – and then it would be count as already downloaded.

Worse, I cannot use it even because I ask gallery-dl to download the same stuff several times! After /media it would have like 90% of user content, which will highly likely abort next iterations of /, /with_replies and from:

Maybe I can set --download-archive to a distinct file for each URL? I don't think so:

  1. I don't want to download any media twice, but I still want to emit metadata for it if I need.
  2. Even if current request does not return any media (only text-tweets as in {[Twitter] Download all tweets to .jsons, not tweets with media #2588}), I still want to terminate it at some point if I need.
  3. I want to continue, as long as I get at least some new tweets (configurable count), compared to previous run with the same parameters.
  4. If the process was aborted, hung or otherwise crashed half-way – I don't want to save the "archive", because the next invocation should re-download everything (except for already stored media), regenerate JSONs and continue further, to the point when there is no new content compared to previous successful run.
  5. But, a broken media should not prevent the archive to save, because such state could become unrecoverable. (For this case, I planned to compare gallery-dl exit codes)

The main question is: when to stop? A simple solution would be a map (for each URL/call) of already downloaded tweets IDs (as "post" metadata, not per-media), to check against. When there are N consequential hits to anything in the map – stop. Probably, this is still susceptible to retweets problem, at least in some conditions (and the practical workaround would be just increasing the threshold).

So, I want to try another approach: store a sequence of tweets IDs in order as they were downloaded. Then, compare not only the intersection, but the order too! Optimistically assuming that the same invocation should download the same old tweets in the same sequential order. Yes, I saw your «but this is Twitter we are talking about», but:

  1. If the threshold would be low enough, the possibility of broken sequences is greatly lowered, still keeping false-positives improbable too (since it is very unlikely that a user retweets the same content in the same order, for example).
  2. A deleted tweet will break the sequence, cancelling the intended stop. I think this is fine, since the process will stop anyway (on a new found sequence or after re-retrieving everything), storing the new total sequence for future uses. All of the following invocations will use the everytime updated sequence in the archive, which reflects actual tweet state.

I think this approach is too wobbly to implement directly in gallery-dl. Maybe I will test it in my program first (at least I can exec and write to a local socket of my parent script) and then, if it will show great results, – you would adapt the same technique?

I planned to make three additional command-line parameters: the filename of the archive/list, the integer threshold, and the format string of post keys: for now I use {tweet_id}, but how to get the default common name (that gallery-dl itself uses for media), regardless of the extractor? Just {id} is not working.

@mikf
Copy link
Owner

mikf commented Jun 7, 2022

Writing metadata to stdout is now possible by setting filename to "-". (v1.22.1, 5b43faf)
You might also want to set indent to null to get a JSON object in one line.

  • "https://twitter.com/USER" – This will get retweets (at least most of relatively recent of them), as my config has "retweets": "original",

This was changed in v1.22.0 (915dba8). You can use "https://twitter.com/USER/tweets" to get the old behavior.

This gives me error [twitter][error] An unexpected error occurred: TypeError - Object of type datetime is not JSON serializable.

The metadata post processor converts any un-serializable data structures to string, while {…!j} does not. I should probably fix this.

  1. Worst thing ever: Twitter search will find only those tweets of a user, which were sent at the time this user had the specified username! It's a disaster: search results of (from:Ashhfro) or @Ashhfro are showing a few recent tweets with no sign of existence of everything that was sent before. Which is surprisingly showed by (from:StuckAshh) and @StuckAshh.

This is terrible. I didn't realize Twitter search was that limited. It would be really nice if we could search by user ID, but I haven't found a way to do that with the public website API.

According to https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#list, this works out-of-the-box with from:USERID, but that seems to only be the case for the official API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants