Skip to content
This repository was archived by the owner on May 9, 2020. It is now read-only.

Optimize header checks #247

Merged
merged 3 commits into from Aug 24, 2019
Merged

Optimize header checks #247

merged 3 commits into from Aug 24, 2019

Conversation

ghost
Copy link

@ghost ghost commented Aug 20, 2019

Perks:

  • Remove the indirect dependency on caseless.
  • Iterate over the header keys only once.
  • Call toLowerCase for each key only once.
Rationale

Cloudscraper currently checks a few different headers using caseless.

cloudscraper/index.js

Lines 157 to 158 in 0d86d22

response.isCloudflare = /^(cloudflare|sucuri)/i.test('' + response.caseless.get('server'));
response.isHTML = /text\/html/i.test('' + response.caseless.get('content-type'));

if (/\bbr\b/i.test('' + response.caseless.get('content-encoding'))) {

The following code is used for each response.caseless.get method call:
https://github.com/request/caseless/blob/98079105e15e04308d1f3becc3e8efab8ddd7f0e/index.js#L27-L36
github.com_request_caseless_blob_98079105e15e04308d1f3becc3e8efab8ddd7f0e_index js (1)

It's iterating over all the keys, calling toLowerCase and then comparing the result each time we check a header. This is inefficient, especially when the headers that we're checking for tend to be the last in the list.

i.e.

  HTTP/1.1 503 Service Temporarily Unavailable
  Date: Tue, 20 Aug 2019 20:06:29 GMT
  Content-Type: text/html; charset=UTF-8
  Connection: close
  X-Frame-Options: SAMEORIGIN
  Set-Cookie: __cfduid=d0ae8363c368eb644584c1fb3ca1424f51566331589; expires=Wed, 19-Aug-20 20:06:29 GMT; path=/; domain=redacted; HttpOnly
  Cache-Control: no-cache
  Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
  Vary: Accept-Encoding
  Server: cloudflare
  CF-RAY: 5096fcf2be1ed31c-ATL

There are other optimizations we can make to Cloudscraper that involve checking additional headers and would cause even a greater performance penalty if left unchanged.

@ghost ghost added the enhancement label Aug 20, 2019
@ghost ghost requested a review from codemanki August 20, 2019 20:18
@ghost ghost mentioned this pull request Aug 22, 2019
@ghost
Copy link
Author

ghost commented Aug 23, 2019

@codemanki Since we're only targeting Cloudflare and/or Sucuri, maybe it's okay to perform case-sensitive header checks?

@codemanki
Copy link
Owner

@pro-src probably yes, frankly speaking I don't believe that we are gaining too much performance improvement in this case, as it is only 1 call to toLowerCase per header which basically doesn't really matter. But we can merge this if you want :)

@ghost
Copy link
Author

ghost commented Aug 23, 2019

If we're going to switch over to case-sensitive headers checks then this is redundant and can be closed.

Otherwise, we make 3 header checks for every single request. If a request has 10 headers, that puts the number of calls to toLowerCase at 3 * 10 for a single request, not counting the comparison against every header until a match is found.

1 call to toLowerCase per header

That's what this PR does.

@ghost
Copy link
Author

ghost commented Aug 23, 2019

You know, I agree, the overall performance gain will be negligible in the face of an initial 5 second wait.

Being simple, this gets called 3 times and iterates over every header if a match isn't found:
github.com_request_caseless_blob_98079105e15e04308d1f3becc3e8efab8ddd7f0e_index js (1)

This PR changes essentially reduces this to 1 call, regardless of whether a header is present and never increases. Meaning, we can check additional headers without the performance penalty of calling the above method.

It's more than 3 times as fast without introducing code complexity.

I'm leaning toward the case-sensitive checks now though.

@ghost
Copy link
Author

ghost commented Aug 24, 2019

@codemanki Check this out: https://github.com/pro-src/cloudflare-domains

It should help in making informed decisions and performing more comprehensive tests. :)

@codemanki
Copy link
Owner

codemanki commented Aug 24, 2019

@pro-src Well if you scrape 2.5m domains then impact will be significant :) Okay, merging it then. And thank you :)

@codemanki codemanki merged commit 459c558 into codemanki:master Aug 24, 2019
@ghost ghost deleted the optimize_headers branch August 24, 2019 20:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant