Wayback Machine Integration #59

FIGBERT · 2021-01-08T07:19:10Z

I think a really interesting feature for linkding to adapt would be an integration with Internet Archive (IA). I've looked into this myself, and got an almost working prototype, but run into an issue with response time and rate limiting of the IA API.

In short, I added an archive_url variable to each Bookmark and queried the https://web.archive.org/save/{bookmark.url} endpoint on creation. This would create a backup, and link to it in the response's Location header. However, as mentioned above, there were two issues:

The response time of the archive API is quite slow and blocks the main thread until completion. This could potentially be solved with some sort of asynchronous request, but I'm not familiar enough with Django to know how to implement this.
Though I was successful in making several requests to /save, there appears to be some sort of rate limiting and not very much information on how it works (that I could find).

Regardless, I think this is a really cool feature that I would definitely be interested in (implementing and using).

The text was updated successfully, but these errors were encountered:

sissbruecker · 2021-01-08T21:48:18Z

I'm not sure if I understand 100% correctly what this feature is supposed to do - what I understood it is:

user creates a bookmark
ask IA to create a snapshot of the bookmarked URL, IA returns an URL to the snapshot
store snapshot URL as part of bookmark
in addition to the bookmark URL, display link to snapshot URL in bookmark view, so if the original URL is not available anymore the user can still click on the snapshot URL?

If so, then I think it might be a cool idea, though I'm very particular about keeping the UI simple and we would have to see how we can introduce another link to display per bookmark.

As for the IA API - if there are restrictions around request duration and rate limiting then yes, creating the snapshot should happen async and in the background, plus it should probably repeat until it succeeds. That basically asks for a job/task queue. I was recently looking into something like this for #43 , which would also need to run in async. The simplest solution I could find was django-background-tasks, which uses the Django ORM to manage the task queue, instead of requiring a separate application as message queue such as Redis. Using the library and writing a job to create the snapshot is probably not too difficult. The queue then keeps rescheduling a job until it succeeds, which handles the possible failure due to rate limiting.

The bigger issue with a job queue is actually that the queue processor needs to be run separately from the Django app which means two processes would need to be started. Here I don't have much experience in how to best apply this to the Docker image. Recommendation is to run a single service per container, which means the ideal solution would be to run two containers. But that is out of the question for linkding, since the basic idea is to keep it simple and easy to use. Other than that it's possible to start multiple processes in a Docker container, for example using:

# Start background task processor
python manage.py process_tasks &
# Start uwsgi server
uwsgi uwsgi.ini

(notice the &)

That works, but I have to admit I don't really know what this command does 😅 - what happens if only one of the processes stops? Does it stop the container or does it keep running? Someone would need to dig into this and figure out how to make this work safely.

Anyway sorry for dumping this here, just clearing my thoughts around this 🙂.

FIGBERT · 2021-01-09T08:32:14Z

Definitely hit the nail on the head with the feature breakdown. In terms of UI, I had placed an "archive" button before the "edit" button that links directly to the url returned by IA. We may have to think more about the exact terminology though, given #46.

In terms of the technical implementation, I definitely think that django-background-tasks is the way to go. Looking into the "multiple processes in one container" problem a little more, it seems like the best way to go about this – imo – is to use a cron job inside the container. This way instead of running two commands, we could do something like this:

CMD ["cron", "&&", "./bootstrap.sh"]

A cron job that runs at the same interval as the duration of process_tasks is also listed as a simple way to control the tasks in their documentation.

bachya · 2021-02-19T05:34:03Z

The bigger issue with a job queue is actually that the queue processor needs to be run separately from the Django app which means two processes would need to be started. Here I don't have much experience in how to best apply this to the Docker image. Recommendation is to run a single service per container, which means the ideal solution would be to run two containers. But that is out of the question for linkding, since the basic idea is to keep it simple and easy to use.

What about using a process control system like supervisord? This type of thing is tailor-made for running multiple related processes from a single entry point. It benefits both Docker and non-Docker use cases.

You could modify bootstrap.sh to launch supervisord:

#!/usr/bin/env bash
# Bootstrap script that gets executed in new Docker containers

# Create data folder if it does not exist
mkdir -p data

# Run database migration
python manage.py migrate
# Generate secret key file if it does not exist
python manage.py generate_secret_key

# Ensure the DB folder is owned by the right user
chown -R www-data: /etc/linkding/data

# Start supervisord
/usr/bin/supervisord -c /etc/supervisord.conf

...and supervisord could launch both the uwsgi server and the task queue:

[supervisord]
nodaemon=true
loglevel=info
user=root

[program:app]
command=uwsgi uwsgi.ini
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
redirect_stderr=true

[program:jobs]
command=python manage.py process_tasks
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
redirect_stderr=true

(actual content will likely vary, but you get the idea)

supervisord would handle process priority, restarting processes that die, log aggregation for the Docker image, etc.

EDIT: interestingly, the django-background-tasks docs mention supervisord explicitly:

The alternative is to use a grown-up program like supervisord to handle this for you.

@sissbruecker If you'd like, I'd be happy to put together a small set of PRs centered on introducing supervisord in the existing stack, just to see if it works well; if so, they can be the basis upon which this issue (and others, like #68) are built. Let me know!

sissbruecker · 2021-03-20T13:05:03Z

@bachya Thanks for the valuable input, this looks promising! I'm not sure if I'll be able to work on this for the foreseeable future, so I don't wouldn't waste anyones time by having them starting to work on this and then not being able to provide feedback. I might get back to you when I find the time, plus with your input I might have a good starting point to put this together myself.

sissbruecker · 2021-08-25T09:09:01Z

Opened a draft PR for this: #150

Still some details left that need to be looked at.

FIGBERT · 2021-09-05T00:23:01Z

Unbelievable – thank you so much!

xuhcc · 2021-09-05T10:35:37Z

So with this addition linkding effectively leaks every URL you save to web.archive.org ? I think it should be disabled by default.

sissbruecker mentioned this issue Feb 6, 2021

Add Url Health Function #68

Open

sissbruecker added the enhancement New feature or request label May 13, 2021

sissbruecker mentioned this issue Aug 25, 2021

Create snapshots on web.archive.org for bookmarks #150

Merged

5 tasks

sissbruecker closed this as completed in #150 Sep 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wayback Machine Integration #59

Wayback Machine Integration #59

FIGBERT commented Jan 8, 2021

sissbruecker commented Jan 8, 2021

FIGBERT commented Jan 9, 2021

bachya commented Feb 19, 2021 •

edited

Loading

sissbruecker commented Mar 20, 2021

sissbruecker commented Aug 25, 2021

FIGBERT commented Sep 5, 2021

xuhcc commented Sep 5, 2021

Wayback Machine Integration #59

Wayback Machine Integration #59

Comments

FIGBERT commented Jan 8, 2021

sissbruecker commented Jan 8, 2021

FIGBERT commented Jan 9, 2021

bachya commented Feb 19, 2021 • edited Loading

sissbruecker commented Mar 20, 2021

sissbruecker commented Aug 25, 2021

FIGBERT commented Sep 5, 2021

xuhcc commented Sep 5, 2021

bachya commented Feb 19, 2021 •

edited

Loading