PCrawl is a PHP library for crawling and scraping web pages.
It supports multiple clients: curl, guzzle. Options to debug, modify and parse responses.
- Rapidly create custom clients. Fluently change clients and client options like user-agent, with method chaining.
- Responses can be modified using reusable callback functions.
- Debug Responses using different criterias - httpcode, regex etc.
- Parse responses using querypath library. Several convenience functions are provided.
- Fluent API. Different debuggers, clients and response mod objects can be be changed on the fly !
We'll try to fetch a bad page, then detect using a debugger and finally change client options to fetch the page correctly.
- Setup up some clients
// simple clients.
$gu = new GuzzleClient();
// Custom Client, that does not allow redirects.
$uptightNoRedirectClient = new CurlClient();
$uptightNoRedirectClient->setRedirects(0); // disable redirects
// Custom client - thin wrapper around curl
class ConvertToHttpsClient extends CurlClient
{
public function get(string $url, array $options = []): PResponse
{
$url = str_replace('http://', 'https://', $url);
return parent::get($url, $options);
}
}
- Lets make some debugger objects
$redirectDetector = new ResponseDebug();
$redirectDetector->setMustNotExistHttpCodes([301, 302, 303, 307, 308]);
$fullPageDetector = new ResponseDebug();
$fullPageDetector->setMustExistRegex(['#</html>#']);
For testing, we will fetch page with a client that does not support redirects, then use the redirectDetector to detect 301. If so we change client option to support redirects and fetch again.
$req = new Request();
$url = "http://www.whatsmyua.info";
$req->setClient($uptightNoRedirectClient);
$count = 0;
do {
$res = $req->get($url);
$redirectDetector->setResponse($res);
if ($redirectDetector->isFail()) {
var_dump($redirectDetector->getFailDetail());
$uptightNoRedirectClient->setRedirects(1);
$res = $req->get($url);
}
} while ($redirectDetector->isFail() && $count++ < 1);
Use the fullPageDetector to detect if the page is proper.
Then parse the response body using Parser
if ($fullPageDetector->setResponse($res)->isFail()) {
var_dump($redirectDetector->getFailDetail());
} else {
$parser = new ParserCommon($res->getBody());
$h1 = $parser->find('h1')->text();
$htmlClass = $parser->find('html')->attr('class');
}
Note: the debuggers, clients, parsers can be reused.
Usage of functions can be divided into parts:
- Composer:
composer init # for new projects.
composer config minimum-stability dev # Will be removed once stable.
composer require gyaaniguy/pcrawl
composer update
include __DIR__ . '/vendor/autoload.php'; #in PHP
- github:
git clone [email protected]:gyaaniguy/PCrawl.git # clone repo
cd PCrawl
composer update # update composer
mv ../PCrawl /desired/location # Move dir to desired location.
require __DIR__ . '../PCrawl/vendor/autoload.php'; #in PHP
- Leverage guzzlehttp asynchronous support
PSR-12
PHPUnit tests