Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace IP addresses with 0.0.0.0 #146

Closed
wants to merge 7 commits into from
Closed

Replace IP addresses with 0.0.0.0 #146

wants to merge 7 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Dec 22, 2020

IP addresses in the info.json is now replaced with 0.0.0.0. Tested with IPv4 and IPv6.

Example
image

Test uploads
https://archive.org/download/youtube-hh82cgqftgM
https://archive.org/details/youtube-Ti-DCmmS07Q
https://archive.org/details/youtube-sKoUtJpubqc

@vxbinaca
Copy link
Collaborator

@antonizoon @brandongalbraith If neither of you have qualms about this I'm going to merge, I'm a little confused about how it finds the actual value but it does replace it.

@brandongalbraith
Copy link
Collaborator

@imabritishcow Would you mind adding a corresponding test for this change? You might also consider replacing the IP address with REDACTED instead of 0.0.0.0 so that the intent is clear.

With regards to the regex pattern, it's a bit unwieldy, but I'm happy to ship a meh solution now and a more refined solution later so that IP addresses of tubeup users isn't being committed permanently to the Internet Archive.

@ghost
Copy link
Author

ghost commented Dec 23, 2020

I have changed it to replace with REDACTED but i'm not really sure how to create a test.

@vxbinaca
Copy link
Collaborator

you need to edit either the Travis.cl code or if we switch to Github tests (ahem anyone else is wanna look over that PR) it should test each PR.

@brandongalbraith
Copy link
Collaborator

@imabritishcow Regarding a test for this enhancement, we want to ensure we're confident we're only replacing the user's IP address by using the regex provided. Did you test the provided regex pattern on common IP address formats?

@ghost
Copy link
Author

ghost commented Dec 23, 2020

Yes, I have tested the regex on both IPv4 and IPv6 addresses

@ghost
Copy link
Author

ghost commented Dec 28, 2020

This isn't a great solution, i'll work on something better.

@brandongalbraith
Copy link
Collaborator

Appreciate it @imabritishcow 🙏.

TubeUp is executed not just from the command line, but is also baked into pipelines and other automated systems folks are running for archival purposes. Definitely want to ensure we're not munging data or erroring out unexpectedly.

@ghost
Copy link
Author

ghost commented Dec 28, 2020

This would have removed any IP addresses in the description or title. I'm working on a solution that will edit only the formats section.

@brandongalbraith
Copy link
Collaborator

Good catch! We'll keep an eye out for an updated review when your new commits land.

@vxbinaca
Copy link
Collaborator

Yeah please be really careful with metadata. Half of the uploads from this script are automated (mostly me). I have limited mental bandwidth to vet that a change doesn't have knockon effects down the line.

I'm open to a solution but not a board one. Maybe have it look for a value in JSON and snip only that value.

@ghost
Copy link
Author

ghost commented Dec 28, 2020

Improved IP removal and created a new test for it.

@ghost
Copy link
Author

ghost commented Dec 28, 2020

The IPv6 addresses in the file are URL encoded, the regex won't match regular IPv6 addresses but it works on URL encoded IPv6 like FE80%3A0000%3A0000%3A0000%3A0202%3AB3FF%3AFE1E%3A8329

@ghost
Copy link
Author

ghost commented Jan 10, 2021

@vxbinaca @brandongalbraith What do you think? Should I fix the old solution and use that or use the new solution?

@puigru
Copy link
Contributor

puigru commented Jan 20, 2021

I think a better way to implement this would be to redact based on a match of the public IP address of the network. A regular expression could end up overwriting something else, and introduces a lot of complexity, especially if we want to extend it to IPv6.
We could introduce a switch like --redact=<string> that allows redacting of a specific substring.
In addition, we could make a request to a known service like http://ipinfo.io/json to retrieve our public IP address and redact that, I would also make this a switch.

@vxbinaca
Copy link
Collaborator

Why can't you just go in and post-process the JSON, strip out that or whatever other element you want?

Or, maybe use tubeup on a VPS like the README says. Or take the privacy L and accept it and run it on your desktop.

These seem to be easier solutions than this. I'll keep the issue open but this seems to be a lot of work and a lot of risk of destroying metadata than just dropping elemnets within the JSON entirely, changing behavoir/mindset or just accepting the privacy loss.

@ghost
Copy link

ghost commented Jan 20, 2021

@vxbinaca

If you do not feel comfortable with this, send a pull request that reliably removes both IPv6/6 addresses and the filepath, or do not use Tubeup.

The README doesn't actually tell you to use a VPS. I've personally been using a VPS but it's not straightforward because you have to mess around with the source code if the VPS doesn't have storage for the videos. I rented 100GB from digital ocean but it was mounted elsewhere so I had to change the source code to save the files in the hard drive.

I think youtube is the only offender here. Personally I just deleted those fields that contain the ip address and moved on with my life.

@XXLuigiMario

I don't think that's a bad idea. Though I remember in the past ipinfo restricted the amount of requests you could make. Not sure if they still do that, but something to keep in mind.

@vxbinaca
Copy link
Collaborator

I rented 100GB from digital ocean but it was mounted elsewhere so I had to change the source code to save the files in the hard drive.

Digital Oceans problem. Tubeup expects ~/.tubeup/downloads will be on the drive with actual storage space.

@ghost ghost closed this Jun 1, 2021
@bibanon bibanon deleted a comment from toscompliantname Sep 10, 2021
@bibanon bibanon locked as off-topic and limited conversation to collaborators Sep 10, 2021
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants