Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wayback Machine Integration #59

Closed
FIGBERT opened this issue Jan 8, 2021 · 7 comments · Fixed by #150
Closed

Wayback Machine Integration #59

FIGBERT opened this issue Jan 8, 2021 · 7 comments · Fixed by #150
Labels
enhancement New feature or request

Comments

@FIGBERT
Copy link

FIGBERT commented Jan 8, 2021

I think a really interesting feature for linkding to adapt would be an integration with Internet Archive (IA). I've looked into this myself, and got an almost working prototype, but run into an issue with response time and rate limiting of the IA API.

In short, I added an archive_url variable to each Bookmark and queried the https://web.archive.org/save/{bookmark.url} endpoint on creation. This would create a backup, and link to it in the response's Location header. However, as mentioned above, there were two issues:

  1. The response time of the archive API is quite slow and blocks the main thread until completion. This could potentially be solved with some sort of asynchronous request, but I'm not familiar enough with Django to know how to implement this.
  2. Though I was successful in making several requests to /save, there appears to be some sort of rate limiting and not very much information on how it works (that I could find).

Regardless, I think this is a really cool feature that I would definitely be interested in (implementing and using).

@sissbruecker
Copy link
Owner

I'm not sure if I understand 100% correctly what this feature is supposed to do - what I understood it is:

  • user creates a bookmark
  • ask IA to create a snapshot of the bookmarked URL, IA returns an URL to the snapshot
  • store snapshot URL as part of bookmark
  • in addition to the bookmark URL, display link to snapshot URL in bookmark view, so if the original URL is not available anymore the user can still click on the snapshot URL?

If so, then I think it might be a cool idea, though I'm very particular about keeping the UI simple and we would have to see how we can introduce another link to display per bookmark.

As for the IA API - if there are restrictions around request duration and rate limiting then yes, creating the snapshot should happen async and in the background, plus it should probably repeat until it succeeds. That basically asks for a job/task queue. I was recently looking into something like this for #43 , which would also need to run in async. The simplest solution I could find was django-background-tasks, which uses the Django ORM to manage the task queue, instead of requiring a separate application as message queue such as Redis. Using the library and writing a job to create the snapshot is probably not too difficult. The queue then keeps rescheduling a job until it succeeds, which handles the possible failure due to rate limiting.

The bigger issue with a job queue is actually that the queue processor needs to be run separately from the Django app which means two processes would need to be started. Here I don't have much experience in how to best apply this to the Docker image. Recommendation is to run a single service per container, which means the ideal solution would be to run two containers. But that is out of the question for linkding, since the basic idea is to keep it simple and easy to use. Other than that it's possible to start multiple processes in a Docker container, for example using:

# Start background task processor
python manage.py process_tasks &
# Start uwsgi server
uwsgi uwsgi.ini

(notice the &)

That works, but I have to admit I don't really know what this command does 😅 - what happens if only one of the processes stops? Does it stop the container or does it keep running? Someone would need to dig into this and figure out how to make this work safely.

Anyway sorry for dumping this here, just clearing my thoughts around this 🙂.

@FIGBERT
Copy link
Author

FIGBERT commented Jan 9, 2021

Definitely hit the nail on the head with the feature breakdown. In terms of UI, I had placed an "archive" button before the "edit" button that links directly to the url returned by IA. We may have to think more about the exact terminology though, given #46.

In terms of the technical implementation, I definitely think that django-background-tasks is the way to go. Looking into the "multiple processes in one container" problem a little more, it seems like the best way to go about this – imo – is to use a cron job inside the container. This way instead of running two commands, we could do something like this:

CMD ["cron", "&&", "./bootstrap.sh"]

A cron job that runs at the same interval as the duration of process_tasks is also listed as a simple way to control the tasks in their documentation.

@bachya
Copy link
Contributor

bachya commented Feb 19, 2021

The bigger issue with a job queue is actually that the queue processor needs to be run separately from the Django app which means two processes would need to be started. Here I don't have much experience in how to best apply this to the Docker image. Recommendation is to run a single service per container, which means the ideal solution would be to run two containers. But that is out of the question for linkding, since the basic idea is to keep it simple and easy to use.

What about using a process control system like supervisord? This type of thing is tailor-made for running multiple related processes from a single entry point. It benefits both Docker and non-Docker use cases.

You could modify bootstrap.sh to launch supervisord:

#!/usr/bin/env bash
# Bootstrap script that gets executed in new Docker containers

# Create data folder if it does not exist
mkdir -p data

# Run database migration
python manage.py migrate
# Generate secret key file if it does not exist
python manage.py generate_secret_key

# Ensure the DB folder is owned by the right user
chown -R www-data: /etc/linkding/data

# Start supervisord
/usr/bin/supervisord -c /etc/supervisord.conf

...and supervisord could launch both the uwsgi server and the task queue:

[supervisord]
nodaemon=true
loglevel=info
user=root

[program:app]
command=uwsgi uwsgi.ini
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
redirect_stderr=true

[program:jobs]
command=python manage.py process_tasks
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
redirect_stderr=true

(actual content will likely vary, but you get the idea)

supervisord would handle process priority, restarting processes that die, log aggregation for the Docker image, etc.

EDIT: interestingly, the django-background-tasks docs mention supervisord explicitly:

The alternative is to use a grown-up program like supervisord to handle this for you.

@sissbruecker If you'd like, I'd be happy to put together a small set of PRs centered on introducing supervisord in the existing stack, just to see if it works well; if so, they can be the basis upon which this issue (and others, like #68) are built. Let me know!

@sissbruecker
Copy link
Owner

@bachya Thanks for the valuable input, this looks promising! I'm not sure if I'll be able to work on this for the foreseeable future, so I don't wouldn't waste anyones time by having them starting to work on this and then not being able to provide feedback. I might get back to you when I find the time, plus with your input I might have a good starting point to put this together myself.

@sissbruecker
Copy link
Owner

Opened a draft PR for this: #150

Still some details left that need to be looked at.

@FIGBERT
Copy link
Author

FIGBERT commented Sep 5, 2021

Unbelievable – thank you so much!

@xuhcc
Copy link
Contributor

xuhcc commented Sep 5, 2021

So with this addition linkding effectively leaks every URL you save to web.archive.org ? I think it should be disabled by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants