-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use Eliot with scrapy? #439
Comments
Unfortunately you can't add action context from inside individual callbacks... You need to start actions outside, but then as you say DeferredContext is needed to track things. Once Twisted has native support for contextvars (twisted/twisted#1192), the use of As a stopgap measure you can:
Not ideal, but something. |
I guess looking at what you're doing in more detail, that might also be a reasonable stop gap, just: you can only call |
Thanks for the fast reply! This was super helpful. I've tried a few different options, and so far, my understanding is:
So it's not perfect, I still have weird non-chronological output in some places, and it requires carefully thinking about where the context manager calls and serialization calls happen, but the results is already way more readable than what I had only 3 hours ago 🎉 |
First of all, thanks for the great library. I haven't gotten it to work yet, but I'm already impressed :)
I am writing a fairly complex scraper with scrapy that involves fetching files in a tree-like way:
one first index file yields a number of other files to fetch, each of them having several sections to process independently. The processing is fairly complex, so I am struggling to track errors, and eliot looks like a super promising solution (my first attempt was very exciting to look at although it doesn't quite work yet).
In short, scrapy is built on top of twisted, but I obviously don't want to modify scrapy's code as described in the docs. To make things worse, the initial scrapy process uses generators everywhere, so keeping track of the current action is tricky.
scrapy
uses the concept of anitem
that it passes around between callbacks to transfer data. There is an initial scraping process which yields requests to which thisitem
is attached. Thenscrapy
passes thisitem
to a series ofpipelines
, each of which gets theitem
, modifies it and returns it.To keep the context in Eliot, I tried serializing the action
the_id = action.serialize_task_id()
and then picking up the context withwith Action.continue_task(task_id=the_id):
. It works partially. The first time Icontinue_task
, the logs look ok, but if I try to do it more than once, the logs look like:The code looks like (these are the standard scrapy callbacks):
Is this kind of pipeline logic supported by
continue_task
, or am I trying to use the wrong solution here?To make it clear, each pipeline.process_item() is called once per item, and I then run a loop on each subitem inside each pipeline. Ideally, I want the logs to reflect that tree structure, to ease tracing errors.
Any ideas would be great!
The text was updated successfully, but these errors were encountered: