RefineDom - Simple and fast Html parser refined.
- Installation
- Quick start
- Creating new document
- Search for elements
- Verify if element exists
- Supported selectors
- Output
- Creating a new element
- Getting parent element
- Getting sibling elements
- Getting the child elements
- Getting owner document
- Working with element attributes
- Comparing elements
- Adding a child element
- Replacing an element
- Removing element
- Working with cache
- Comparison with other parsers
At the time of writing there is no public Packagist package. Therefor a custom vcs repository has to be defined in your composer.json
:
...
"repositories": [
{
"type": "vcs",
"url": "[email protected]:GameplayJDK/RefineDom.git"
}
],
...
After that you can install RefineDom using the following command:
composer require gameplayjdk/refinedom
In the future you may be able to install without defining a custom vcs repository.
use RefineDom\Document;
$document = new Document('news.html', true);
$posts = $document->find('.post');
foreach ($posts as $post)
{
echo($post->text(), "\n");
}
RefineDom currently allows to load html in three ways:
// the first parameter is a string with html
$document = new Document($html);
// file path
$document = new Document('page.html', true);
// or DOMDocument
$document = new Document($doc);
The second parameter specifies if you need to load file. Default is false
.
$document = new Document();
$document->load($html);
$document->loadFile('page.html');
$document->loadDocument($doc);
The load
method is also available for loading Xml but requires $isHtml
to be set to false using either the third constructor argument or $document->setIsHtml(false);
.
It then accept additional options:
$document->load($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
RefineDom allows CSS selectors or XPath expressions for search. You need to pass the expression as the first parameter, and specify its type in the second one (default type is Query::TYPE_CSS
):
use RefineDom\Document;
use RefineDom\Query;
...
// with CSS selector
$posts = $document->find('.post');
// or XPath
$posts = $document->find("//div[contains(@class, 'post')]", Query::TYPE_XPATH);
$posts = $document('.post');
$posts = $document->xPath("//*[contains(concat(' ', normalize-space(@class), ' '), ' post ')]");
You can search inside an element:
echo $document->find('.post')[0]->find('h2')[0]->text();
If the elements that match a given expression are found, then the methods return an array of instances of RefineDom\Element
, otherwise an empty array. You could also get an array of DOMElement
objects. To get this, pass false
as the third parameter.
To avoid the array square brackets (these: []
), use the findIndex
and xPathIndex
methods:
$posts = $document->findIndex('header', 1)->xPathIndex("//h1")->text();
Its first argument defaults to zero and the other arguments are the ones available for normal search methods.
To verify if an element exist use the has
method:
if ($document->has('.post'))
{
// code
}
If you need to check if element exist and then get it:
if ($document->has('.post'))
{
$elements = $document->find('.post');
// code
}
but it would be faster like this:
if (count($elements = $document->find('.post')) > 0)
{
// code
}
because in the first case it makes two requests.
Note that expressions are cached thougth. See the working with cache section for details.
This check also works with findIndex
(and also xPathIndex
):
if (($element = $document->findIndex('.post', 3)) !== null)
{
// code
}
which retun null
when the element at index does not exist.
RefineDom supports search by:
- tag
- class, ID, name and value of an attribute
- pseudo-classes:
- first-, last-, nth-child
- empty and not-empty
- contains
- has
// all links
$document->find('a');
// any element with id = "foo" and a "bar" class
$document->find('#foo.bar');
// any element with attribute "name"
$document->find('[name]');
// which is the same as
$document->find('*[name]');
// input field with the name "foo"
$document->find('input[name=foo]');
$document->find('input[name=\'bar\']');
$document->find('input[name="baz"]');
// any element that has an attribute starting with "data-" and the value "foo"
$document->find('*[^data-=foo]');
// all links starting with https
$document->find('a[href^=https]');
// all images with the extension "png" assuming their src attribute ends 'png'
$document->find('img[src$=png]');
// all links containing the string "example.com"
$document->find('a[href*=example.com]');
// text of the links with "foo" class
$document->find('a.foo::text');
// address and title of all the links with "bar" class
$document->find('a.bar::attr(href|title)');
$post = $document->findIndex('.post');
echo $post->html();
$html = (string) $posts[0];
$html = $document->format()->html();
An element does not have the format()
method, so if you need to output formatted Html of the element, then first you have to convert it to a document like this:
$html = $element->toDocument()->format()->html();
Adittionally you can supply additional options to xml
as well as to html
:
$html = $document->format()->xml(LIBXML_NOEMPTYTAG);
To output unformatted html or xml give a boolean argument to format
(or setFormat
):
$unformatted = $document->format(false)->html();
$innerHtml = $element->innerHtml();
Document does not have the method innerHtml()
, therefore, if you need to get inner Html of a document, convert it into an element first:
$innerHtml = $document->toElement()->innerHtml();
$posts = $document->find('.post');
echo $posts[0]->text();
use RefineDom\Element;
$element = new Element('span', 'Hello');
// outputs "<span>Hello</span>"
echo $element->html();
First parameter is the name of the element, the second one is its text value (optional), the third one is an array of element attributes (also optional).
An example of creating an element with attributes:
$attributes = ['name' => 'description', 'placeholder' => 'Enter description of item'];
$element = new Element('textarea', 'Text', $attributes);
An element can also be created from an instance of the class DOMElement
:
use RefineDom\Element;
use DOMElement;
$domElement = new DOMElement('span', 'Hello');
$element = new Element($domElement);
$document = new Document($html);
$element = $document->createElement('span', 'Hello');
$document = new Document($html);
$input = $document->findIndex('input[name=email]');
var_dump($input->parent());
$document = new Document($html);
$item = $document->findIndex('ul.menu > li', 1);
var_dump($item->previousSibling());
var_dump($item->nextSibling());
$html = '
<ul>
<li>Foo</li>
<li>Bar</li>
<li>Baz</li>
</ul>
';
$document = new Document($html);
$list = $document->first('ul');
// string(3) "Baz"
var_dump($item->child(2)->text());
// string(3) "Foo"
var_dump($item->firstChild()->text());
// string(3) "Baz"
var_dump($item->lastChild()->text());
// array(3) { ... }
var_dump($item->children());
$document = new Document($html);
$element = $document->findIndex('input[name=email]', 0);
$otherDocument = $element->getDocument();
// bool(true)
var_dump($document->is($otherDocument));
$name = $element->tag;
$element->setAttribute('name', 'username');
$element->attr('name', 'username');
$element->name = 'username';
$username = $element->getAttribute('value');
$username = $element->attr('value');
$username = $element->name;
Returns null
if attribute is not found.
if ($element->hasAttribute('name'))
{
// code
}
if (isset($element->name))
{
// code
}
$element->removeAttribute('name');
unset($element->name);
$element = new Element('span', 'hello');
$otherElement = new Element('span', 'hello');
// bool(true)
var_dump($element->is($element));
// bool(false)
var_dump($element->is($otherElement));
$list = new Element('ul');
$item = new Element('li', 'Item 1');
$list->appendChild($item);
$items = [
new Element('li', 'Item 2'),
new Element('li', 'Item 3'),
];
$list->appendChild($items);
$list = new Element('ul');
$item = new Element('li', 'Item 1');
$items = [
new Element('li', 'Item 2'),
new Element('li', 'Item 3'),
];
$list->appendChild($item);
$list->appendChild($items);
$element = new Element('span', 'hello');
$document->find('.post')[0]->replace($element);
$document->findIndex('.post')->remove();
Cache is an associative array of XPath expressions, that were converted from CSS selectors. The CSS selector is the key.
use RefineDom\Query;
...
$xPath = Query::compile('h2');
$compiled = Query::getCompiled();
// array('h2' => '//h2')
var_dump($compiled);
Using a predefined cache can help to improve the speed as there is no need to recompile a selector.
Query::setCompiled(['h2' => '//h2']);
This comparison refers to DiDom, not RefineDom. Numbers should not be off that much thought.