Support for custom header and cookies for the initial request from kafka_monitor.py feed #182

knirbhay · 2018-05-18T06:45:15Z

I needed to request an URL with custom header and preset cookies. eg.

There is an API at https://xyz.com/test_api/_id which returns a json.
and this should be called with api keys with custom header and few preset cookies in the request with a POST call.

How do I get it working with scrapy-cluster?

With Scrapy I used to override the start_request method and apply custom header and cookies.

Another problem looks like cookie jar issue where cookies are stored on one node and can not be passed to another node. This is activated when Server uses Set-Cookies method to store session details.

The text was updated successfully, but these errors were encountered:

madisonb · 2018-05-24T20:59:32Z

Cookie support is already provided thanks to the cookie field in the kafka monitor api. This is a cookie string that is then deserialized into a scrapy cookie object.

As for custom request methods, the custom scheduler is where you want to look, as that translates the objects coming in into Scrapy Requests. I think the scheduler should be able to handle Post requests being yielded from the spider, thanks to the scrapy dict methods, but on initial request that is something that could be improved on.

Scrapy Cluster purposefully does not store cookie information in each spider, because any single chain of requests might go to multiple spiders or machines. You would need to customize the setup a bit to pass those cookies through your calls so they are used in subsequent requests.

Scrapy Cluster is most suited for large scale on demand crawling, and in its current form (because it is distributed) has some of the limitations or assumptions I noted above. I am always happy to look at or review a PR if you think it would be worthwhile to add to the project!

knirbhay · 2018-06-06T06:02:56Z

Working towards it. I made the custom request working with headers and cookies. Working on shared cookie instances, shared via redis seperated by crawl/spider ids

knirbhay · 2018-06-21T11:52:14Z

Below custom cookie middleware worked for me. Not sure if this is a right place to initiate redis_conn. Could not find a way to share DistributedScheduler redis_conn.

`
import redis
import pickle

from scrapy.downloadermiddlewares.cookies import CookiesMiddleware

class SharedCookiesMiddleware(CookiesMiddleware):

def __init__(self, debug=True, server=None):
    CookiesMiddleware.__init__(self, debug)
    self.redis_conn = server
    self.debug = debug

@classmethod
def from_crawler(cls, crawler):
    server = redis.Redis(host=crawler.settings.get('REDIS_HOST'),
                         port=crawler.settings.get('REDIS_PORT'),
                         db=crawler.settings.get('REDIS_DB'))
    return cls(crawler.settings.getbool('COOKIES_DEBUG'), server)

def process_request(self, request, spider):
    if 'dont_merge_cookies' in request.meta:
        return
    cookiejarkey = "{spiderid}:sharedcookies:{crawlid}".format(
                    spiderid=request.meta.get("spiderid"),
                    crawlid=request.meta.get("crawlid"))

    jar = self.jars[cookiejarkey]
    jar.clear()

    if self.redis_conn.exists(cookiejarkey):
        data = self.redis_conn.get(cookiejarkey)
        jar = pickle.loads(data)

    cookies = self._get_request_cookies(jar, request)
    for cookie in cookies:
        jar.set_cookie_if_ok(cookie, request)

    # set Cookie header
    request.headers.pop('Cookie', None)
    jar.add_cookie_header(request)
     

    self._debug_cookie(request, spider)
    self.redis_conn.set(cookiejarkey, pickle.dumps(jar))

def process_response(self, request, response, spider):
    if request.meta.get('dont_merge_cookies', False):
        return response
    cookiejarkey = "{spiderid}:sharedcookies:{crawlid}".format(
                    spiderid=request.meta.get("spiderid"),
                    crawlid=request.meta.get("crawlid"))

    
    jar = self.jars[cookiejarkey]
    jar.clear()

    if self.redis_conn.exists(cookiejarkey):
        data = self.redis_conn.get(cookiejarkey)
        jar = pickle.loads(data)
        
    # extract cookies from Set-Cookie and drop invalid/expired cookies
    jar.extract_cookies(response, request)
    self._debug_set_cookie(response, spider)
     

    self.redis_conn.set(cookiejarkey, pickle.dumps(jar))
    return response

`

madisonb · 2018-06-21T14:34:25Z

Thanks @knirbhay! This is a great start, I can try to incorporate this in or just leave it as a standalone file. You can make a PR and I can review it and get it merged in, otherwise there are just a couple of things I would like changed in it, but otherwise is great work.

NirbhayK · 2018-06-21T14:40:58Z

Sure I will also include custom request working with headers and cookies. I have enhanced Kafka feed API but I also need to check if it works with REST API of Scrapy cluster.

madisonb added crawler feature request labels May 24, 2018

madisonb mentioned this issue Jun 28, 2018

Add login in LinkSpider #188

Closed

knirbhay mentioned this issue Jul 2, 2018

Adding support for issue #182 #190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for custom header and cookies for the initial request from kafka_monitor.py feed #182

Support for custom header and cookies for the initial request from kafka_monitor.py feed #182

knirbhay commented May 18, 2018 •

edited

Loading

madisonb commented May 24, 2018

knirbhay commented Jun 6, 2018

knirbhay commented Jun 21, 2018 •

edited

Loading

madisonb commented Jun 21, 2018

NirbhayK commented Jun 21, 2018 •

edited

Loading

Support for custom header and cookies for the initial request from kafka_monitor.py feed #182

Support for custom header and cookies for the initial request from kafka_monitor.py feed #182

Comments

knirbhay commented May 18, 2018 • edited Loading

madisonb commented May 24, 2018

knirbhay commented Jun 6, 2018

knirbhay commented Jun 21, 2018 • edited Loading

madisonb commented Jun 21, 2018

NirbhayK commented Jun 21, 2018 • edited Loading

knirbhay commented May 18, 2018 •

edited

Loading

knirbhay commented Jun 21, 2018 •

edited

Loading

NirbhayK commented Jun 21, 2018 •

edited

Loading