PCrawl

This is in alpha stage.

PCrawl

PCrawl is a PHP library for crawling and scraping web pages.
It supports multiple clients: curl, guzzle. Options to debug, modify and parse responses.

Features

Rapidly create custom clients. Fluently change clients and client options like user-agent, with method chaining.
Responses can be modified using reusable callback functions.
Debug Responses using different criterias - httpcode, regex etc.
Parse responses using querypath library. Several convenience functions are provided.
Fluent API. Different debuggers, clients and response mod objects can be be changed on the fly !

Full Example

We'll try to fetch a bad page, then detect using a debugger and finally change client options to fetch the page correctly.

Setup up some clients

// simple clients.
$gu = new GuzzleClient();

// Custom Client, that does not allow redirects.
$uptightNoRedirectClient = new CurlClient();
$uptightNoRedirectClient->setRedirects(0); // disable redirects

// Custom client - thin wrapper around curl
class ConvertToHttpsClient extends CurlClient
{
    public function get(string $url, array $options = []): PResponse
    {
        $url = str_replace('http://', 'https://', $url);
        return parent::get($url, $options);
    }
}

Lets make some debugger objects

$redirectDetector = new ResponseDebug();
$redirectDetector->setMustNotExistHttpCodes([301, 302, 303, 307, 308]);
$fullPageDetector = new ResponseDebug();
$fullPageDetector->setMustExistRegex(['#</html>#']);

Start fetching!

For testing, we will fetch page with a client that does not support redirects, then use the redirectDetector to detect 301. If so we change client option to support redirects and fetch again.

$req = new Request();
$url = "http://www.whatsmyua.info";
$req->setClient($uptightNoRedirectClient);
$count = 0;
do {
    $res = $req->get($url);
    $redirectDetector->setResponse($res);
    if ($redirectDetector->isFail()) {
        var_dump($redirectDetector->getFailDetail());
        $uptightNoRedirectClient->setRedirects(1);
        $res = $req->get($url);
    }
} while ($redirectDetector->isFail() && $count++ < 1);

Use the fullPageDetector to detect if the page is proper.
Then parse the response body using Parser

if ($fullPageDetector->setResponse($res)->isFail()) {
    var_dump($redirectDetector->getFailDetail());
} else {
    $parser = new ParserCommon($res->getBody()); 
    $h1 = $parser->find('h1')->text();
    $htmlClass = $parser->find('html')->attr('class');
}

Note: the debuggers, clients, parsers can be reused.

Detailed Usage

Usage of functions can be divided into parts:

Installation

Composer:

composer init   # for new projects. 
composer config minimum-stability dev # Will be removed once stable.
composer require gyaaniguy/pcrawl
composer update
include __DIR__ . '/vendor/autoload.php'; #in PHP

github:

git clone [email protected]:gyaaniguy/PCrawl.git # clone repo 
cd PCrawl 
composer update # update composer 
mv ../PCrawl /desired/location # Move dir to desired location.
require __DIR__ . '../PCrawl/vendor/autoload.php'; #in PHP

TODO list

Leverage guzzlehttp asynchronous support

Standards

PSR-12
PHPUnit tests

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
docs		docs
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
composer.json		composer.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is in alpha stage.

PCrawl

Features

Full Example

Start fetching!

Detailed Usage

Installation

TODO list

Standards

About

Releases

Packages

Languages

License

gyaaniguy/PCrawl

Folders and files

Latest commit

History

Repository files navigation

This is in alpha stage.

PCrawl

Features

Full Example

Start fetching!

Detailed Usage

Installation

TODO list

Standards

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages