Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Identify core/essential metadata and add upload safeties for missing MD #279

Open
vxbinaca opened this issue Feb 16, 2023 · 9 comments
Labels
comment-request Proposal from devs/community

Comments

@vxbinaca
Copy link
Collaborator

See title.

Twitch extractor currently does not add channel metadata, TikTok though broken also did the same. This proposal is aimed at Youtube.

@vxbinaca vxbinaca added the bug label Feb 16, 2023
@vxbinaca vxbinaca changed the title Proposal: Identify core/essential metadata and add upload safties for missing MD Proposal: Identify core/essential metadata and add upload safeties for missing MD Feb 16, 2023
@vxbinaca vxbinaca added feature request comment-request Proposal from devs/community and removed bug feature request labels Feb 16, 2023
@brandongalbraith
Copy link
Collaborator

@vxbinaca What Youtube metadata are we currently not uploading to IA that yt-dlp is able to extract?

@vxbinaca
Copy link
Collaborator Author

@vxbinaca What Youtube metadata are we currently not uploading to IA that yt-dlp is able to extract?

If it's able to be extracted it's in JSON. I'm talking about minimum metadata for items. Harkening back to the recent uploader_ID deficiency in yt-dlp, there was a similar one 5 or so years ago where the extractor was briefly not setting the creator value.

What I'm saying is, we'd need creator, video URL, and the time to be present to make a valid item where creator isn't tubeup.py. There should be a safety to prevent item creation if the creator metadata isn't present and so on and so on.

Not all metadata needs to have a item creation halt, but what I'm saying is it would be helpful to identify what is core metadata and insert safeties to prevent uploading if it's missing.

@brandongalbraith
Copy link
Collaborator

After thinking about this for a bit, I'm thinking about erring on being conservative from a cultural preservation perspective. As long as enough metadata exists to upload an item (ie unique identifier from the service the content is being retrieved from), tubeup should continue on so the artifacts are preserved. If metadata can be derived in the future from other data sources (or programatic analysis of the uploaded artifacts), so be it.

@vxbinaca
Copy link
Collaborator Author

Given the issues with live chat extraction increasingly (broken on Youtube live videos IIRC, but for sure Twitch), should live chat be considered a core metadata like manual subtitles (auto subs suck)?

Are channel URLs a core metadata? Twitch doesn't give them it gives the video URL instead.

@mrpapersonic
Copy link
Collaborator

mrpapersonic commented Aug 16, 2024

should live chat be considered a core metadata like manual subtitles (auto subs suck)?

If we're considering live chat to be metadata that is in-scope for tubeup to handle, then yes. Though I would argue that live chat should be in the same realm of comments, as in it's not really our problem to deal with, since realistically we should only be handling the video itself and its surrounding metadata.

Are channel URLs a core metadata?

Only on platforms where that URL cannot be changed at will by the user (see: youtube and channel IDs). In other cases where the user is able to change the URL/ID at will it's not very useful at all. In fact, now tubeup uses the stupid channel handles youtube added which is actually fairly annoying in itself.

p.s. sorry for being like, a year late x)

@vxbinaca
Copy link
Collaborator Author

vxbinaca commented Aug 16, 2024

No it's fine. What of extractors that didn't provide (but I think do now) like BilliBilli like channel URLs?

Edit: With youtube livechat I believe thats extracted into JSON, but Twitches is broken with yt-dlp. Do all sites have live chat? No. Do all of them have creator metadata? Yes - mostly.

A current example of this is the OnlyFans TV extractor which is all kinds of messed up right now and not routing metadata properly. Take any OFTV video and try to rip it.

@mrpapersonic
Copy link
Collaborator

What of extractors that didn't provide (but I think do now) like BilliBilli like channel URLs?

I'm not sure really. It's likely best though to not consider channel URLs as particularly important. We should warn if a URL could not be found though so the user can manually fix it and send a report to yt-dlp to fix/expand the extractors.

Take any OFTV video and try to rip it.

no offense but I wouldn't touch that website with a ten foot pole lmao

@vxbinaca
Copy link
Collaborator Author

OFTV is all non-nude content it's public facing and it's exclusive, lots of cooking classes or game stream clips.

@vxbinaca
Copy link
Collaborator Author

vxbinaca commented Aug 26, 2024

@mrpapersonic so the issue is we put fail-safe in place years ago to prevent blank creator items being rejected by IA. The creator tag is blank due to some extractors (LBRY/Odysee, Instagram posts containing multiple videos, many other extractors) misdirecting or not having the Creator tag. This failsafe was "tubeup.py" being put into the creator tag. The alternative was item creation failure and the video and metadata being stuck on the users disk.

What I want to do is try to find permutations of the creator tag and account for it, using a few other extractors as a test case to build whats breaking, because I've been (rightly) telling people for years that the problem isn't us it's the extractor not naming or accounting for metadata properly.

Anyway last bump on this for a while:

I want to define the minimum metadata elements we need to upload an item, so I can go to yt-dlp people and ask them to modify their standard for accepting an extractor, and then I or a few of us can go and find what extractors are broken and flag them to meet those minimum standards. Either that or we use else if to try another meta element we think will correctly tag a item. This way, theres less problems in the future like OFTV or the minor problems with Odysee or Instgram extractors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comment-request Proposal from devs/community
Projects
None yet
Development

No branches or pull requests

3 participants