Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for custom header and cookies for the initial request from kafka_monitor.py feed #182

Open
knirbhay opened this issue May 18, 2018 · 5 comments

Comments

@knirbhay
Copy link
Contributor

knirbhay commented May 18, 2018

I needed to request an URL with custom header and preset cookies. eg.

There is an API at https://xyz.com/test_api/_id which returns a json.
and this should be called with api keys with custom header and few preset cookies in the request with a POST call.

How do I get it working with scrapy-cluster?

With Scrapy I used to override the start_request method and apply custom header and cookies.

Another problem looks like cookie jar issue where cookies are stored on one node and can not be passed to another node. This is activated when Server uses Set-Cookies method to store session details.

@madisonb
Copy link
Collaborator

Cookie support is already provided thanks to the cookie field in the kafka monitor api. This is a cookie string that is then deserialized into a scrapy cookie object.

As for custom request methods, the custom scheduler is where you want to look, as that translates the objects coming in into Scrapy Requests. I think the scheduler should be able to handle Post requests being yielded from the spider, thanks to the scrapy dict methods, but on initial request that is something that could be improved on.

Scrapy Cluster purposefully does not store cookie information in each spider, because any single chain of requests might go to multiple spiders or machines. You would need to customize the setup a bit to pass those cookies through your calls so they are used in subsequent requests.

Scrapy Cluster is most suited for large scale on demand crawling, and in its current form (because it is distributed) has some of the limitations or assumptions I noted above. I am always happy to look at or review a PR if you think it would be worthwhile to add to the project!

@knirbhay
Copy link
Contributor Author

knirbhay commented Jun 6, 2018

Working towards it. I made the custom request working with headers and cookies. Working on shared cookie instances, shared via redis seperated by crawl/spider ids

@knirbhay
Copy link
Contributor Author

knirbhay commented Jun 21, 2018

Below custom cookie middleware worked for me. Not sure if this is a right place to initiate redis_conn. Could not find a way to share DistributedScheduler redis_conn.

`
import redis
import pickle

from scrapy.downloadermiddlewares.cookies import CookiesMiddleware

class SharedCookiesMiddleware(CookiesMiddleware):

def __init__(self, debug=True, server=None):
    CookiesMiddleware.__init__(self, debug)
    self.redis_conn = server
    self.debug = debug

@classmethod
def from_crawler(cls, crawler):
    server = redis.Redis(host=crawler.settings.get('REDIS_HOST'),
                         port=crawler.settings.get('REDIS_PORT'),
                         db=crawler.settings.get('REDIS_DB'))
    return cls(crawler.settings.getbool('COOKIES_DEBUG'), server)

def process_request(self, request, spider):
    if 'dont_merge_cookies' in request.meta:
        return
    cookiejarkey = "{spiderid}:sharedcookies:{crawlid}".format(
                    spiderid=request.meta.get("spiderid"),
                    crawlid=request.meta.get("crawlid"))

    jar = self.jars[cookiejarkey]
    jar.clear()

    if self.redis_conn.exists(cookiejarkey):
        data = self.redis_conn.get(cookiejarkey)
        jar = pickle.loads(data)

    cookies = self._get_request_cookies(jar, request)
    for cookie in cookies:
        jar.set_cookie_if_ok(cookie, request)

    # set Cookie header
    request.headers.pop('Cookie', None)
    jar.add_cookie_header(request)
     

    self._debug_cookie(request, spider)
    self.redis_conn.set(cookiejarkey, pickle.dumps(jar))

def process_response(self, request, response, spider):
    if request.meta.get('dont_merge_cookies', False):
        return response
    cookiejarkey = "{spiderid}:sharedcookies:{crawlid}".format(
                    spiderid=request.meta.get("spiderid"),
                    crawlid=request.meta.get("crawlid"))

    
    jar = self.jars[cookiejarkey]
    jar.clear()

    if self.redis_conn.exists(cookiejarkey):
        data = self.redis_conn.get(cookiejarkey)
        jar = pickle.loads(data)
        
    # extract cookies from Set-Cookie and drop invalid/expired cookies
    jar.extract_cookies(response, request)
    self._debug_set_cookie(response, spider)
     

    self.redis_conn.set(cookiejarkey, pickle.dumps(jar))
    return response

`

@madisonb
Copy link
Collaborator

Thanks @knirbhay! This is a great start, I can try to incorporate this in or just leave it as a standalone file. You can make a PR and I can review it and get it merged in, otherwise there are just a couple of things I would like changed in it, but otherwise is great work.

@NirbhayK
Copy link
Contributor

NirbhayK commented Jun 21, 2018

Sure I will also include custom request working with headers and cookies. I have enhanced Kafka feed API but I also need to check if it works with REST API of Scrapy cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants