Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Unshorten shortened URLs when using --write-metadata? #1532

Open
Scripter17 opened this issue May 6, 2021 · 8 comments
Open

Comments

@Scripter17
Copy link
Contributor

My main reason for wanting this is in case t.co links no longer work when the tweet that created them gets deleted. It'd be nice to preserve not just the images but also the links in the description.
I know this is possible for at least Twitter, since I made a Tampermonkey script to go through the <a> element's children to find the full URL (see bottom). This should also be trivial for websites that do xyz.com/redirect?url=abc.com. Whether or not it should support bit.ly and stuff is for someone else to decide

Ideally it'd add a new key like "raw_description" to the written JSON file to preserve backwards compatibility

Relevant part of the tampermonkey script I mentioned (pardon the jank):

document.querySelectorAll('div[dir="auto"] > a[href^="https://t.co"]').forEach(function(x){
	if (x.getAttribute("title")!=null && x.getAttribute("title").startsWith("http")){
		x.setAttribute("href", x.getAttribute("title"));
	}
});
@Twi-Hard
Copy link

Twi-Hard commented May 6, 2021

Something I noticed a while ago when looking for mediafire links on Twitter is that Twitter knows where the URLs redirect to (including bit.ly links).
For example if you search this, bit.ly links come up instead of mediafire links:
https://twitter.com/search?q=mediafire.com%20bit.ly
This even happens when there isn't an embed. Perhaps there's a way to get the URLs they redirect to since Twitter already knows?

@Scripter17
Copy link
Contributor Author

In TwitterExtractor._transform_tweet at line 158 of gallery_dl/extractor/twitter.py, the entites variable a "urls" key with the shortened URL, the expanded URL, and where exactly the shortened URL is

For example, this tweet has an entities["urls"] of [{'url': 'https://t.co/0M2GF71p8e', 'expanded_url': 'https://www.romhacking.net/hacks/5927/', 'display_url': 'romhacking.net/hacks/5927/', 'indices': [60, 83]}]

Though for some reason, the tweet I mentioned has a second URL in tweet["full_text"] whose data is located in tweet["extended_entities"]["media"]. The link doesn't show up in the tweet itself, so is this a bug/oversight?

@mikf
Copy link
Owner

mikf commented May 15, 2021

Done (41457db), and thanks for pointing out that the data provides by Twitter already has the expanded URLs, otherwise I might have used HEAD requests for this.

Though for some reason, the tweet I mentioned has a second URL in tweet["full_text"] whose data is located in tweet["extended_entities"]["media"]. The link doesn't show up in the tweet itself, so is this a bug/oversight?

Each Tweet with images or videos has a t.co URL that links to itself for some reason.
For example https://twitter.com/pixloen/status/1392646554910085120 contains https://t.co/CFjE0XnTwN, which redirects to https://twitter.com/pixloen/status/1392646554910085120/photo/1.
Should these "useless" URLs be removed or are they fine as is?

@God-damnit-all
Copy link
Contributor

Damn, this looks like a useful change. I wish there were a way to go back and resolve all the t.co urls in all the metadata files I currently have.

And yeah, the useless URLs should definitely be removed.

@Scripter17
Copy link
Contributor Author

Scripter17 commented May 15, 2021

Should these "useless" URLs be removed or are they fine as is?

They probably should be, since anyone going through and parsing it would find it a pain to both figure out where these URLs are coming from and reliably remove them

I wish there were a way to go back and resolve all the t.co urls in all the metadata files I currently have.

Is --rewrite-metadata a viable thing to add? Unless that's already possible with --no-download or something. It could also help if you want updated like/retweet statistics

mikf added a commit that referenced this issue May 17, 2021
The 'full_text' of Tweets with media content usually ends with a t.co
link to itself. This commit removes those.
@Scripter17
Copy link
Contributor Author

So it turns out that t.co links aren't expanded in, say, author descriptions

I'm not sure if it's as simple to fix those as it was for tweet contends, but I don't imagine it wouldn't be

@Scripter17
Copy link
Contributor Author

Yep, in _transform_user at line 210 of twitter.py, user has link replacements located at user["entities"]["url"] for the "my site" thing and user["entities"]["description"] for the user description

I'll work on the patch sometime tomorrow if you don't get to it first

(Sidenote: Should there be a --write-verbose-metadata for all metadata gallery-dl encounters? If someone deletes their account before the patch is added that metadata is gone (Side-sidenote: Does -o skip=true rewrite metadata files? Either way I think there should be a toggle for it like --overwrite-metadata or something))

@mikf
Copy link
Owner

mikf commented Aug 21, 2021

Should there be a --write-verbose-metadata for all metadata gallery-dl encounters?

You mean one that triggers for all events and not just files? There needs to be a default filename format string for things that aren't files first, e.g. one for a Tweet's metadata independent of any actual files.

Does -o skip=true rewrite metadata files?

No, unless the metadata post processor is configured to trigger on event skip.

Either way I think there should be a toggle for it like --overwrite-metadata or something

Just use --no-download --no-skip (or -o download=false -o skip=false)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants