Gammo - A pure-Ruby HTML5 parser

Gammo provides a pure Ruby HTML5-compliant parser and CSS selector / XPath support for traversing the DOM tree built by Gammo. The implementation of the HTML5 parsing algorithm in Gammo conforms the WHATWG specification. Given an HTML string, Gammo parses it and builds DOM tree based on the tokenization and tree-construction algorithm defined in WHATWG parsing algorithm, these implementations are provided without any external dependencies.

Gammo, its naming is inspired by Gumbo. But Gammo is a fried tofu fritter made with vegetables.

require 'gammo'
require 'open-uri'

parser = URI.open('https://google.com') { |f| Gammo.new(f.read) }
document = parser.parse #=> #<Gammo::Node::Document>

puts document.css('title').first.inner_text #=> 'Google'

Overview
- Features
Tokenizaton
- Token types
Parsing
- Notes
Node
DOM Tree Traversal
- XPath 1.0 (experimental)
- CSS Selector (experimental)
Performance
References
License
Release History

Overview

Features

Tokenization: Gammo has a tokenizer for implementing the tokenization algorithm.
Parsing: Gammo provides a parser which implements the parsing algorithm by the above tokenization and the tree-construction algorithm.
Node: Gammo provides the nodes which implement WHATWG DOM specification partially.
DOM Tree Traversal: Gammo provides a way of DOM tree traversal (CSS selector / XPath).
Performance: Gammo does not prioritize performance, and there are a few potential performance notes.

Tokenizaton

Gammo::Tokenizer implements the tokenization algorithm in WHATWG. You can get tokens in order by calling Gammo::Tokenizer#next_token.

Here is a simple example for performing only the tokenizer.

def dump_for(token)
  puts "data: #{token.data}, class: #{token.class}"
end

tokenizer = Gammo::Tokenizer.new('<!doctype html><input type="button"><frameset>')
dump_for tokenizer.next_token #=> data: html, class: Gammo::Tokenizer::DoctypeToken
dump_for tokenizer.next_token #=> data: input, class: Gammo::Tokenizer::StartTagToken
dump_for tokenizer.next_token #=> data: frameset, class: Gammo::Tokenizer::StartTagToken
dump_for tokenizer.next_token #=> data: end of string, class: Gammo::Tokenizer::ErrorToken

The parser described below depends on this tokenizer, it applies the WHATWG parsing algorithm to the tokens extracted by this tokenization in order.

Token types

The tokens generated by the tokenizer will be categorized into one of the following types:

Token type	Description
`Gammo::Tokenizer::ErrorToken`	Represents an error token, it usually means end-of-string.
`Gammo::Tokenizer::TextToken`	Represents a text token like "foo" which is inner text of elements.
`Gammo::Tokenizer::StartTagToken`	Represents a start tag token like `<a>`.
`Gammo::Tokenizer::EndTagToken`	Represents an end tag token like `</a>`.
`Gammo::Tokenizer::SelfClosingTagToken`	Represents a self closing tag token like `<img />`
`Gammo::Tokenizer::CommentToken`	Represents a comment token like `<!-- comment -->`.
`Gammo::Tokenizer::DoctypeToken`	Represents a doctype token like `<!doctype html>`.

Parsing

Gammo::Parser implements processing in the tree-construction stage based on the tokenization described above.

A successfully parsed parser has the document accessor as the root document (this is the same as the return value of the Gammo::Parser#parse). From the document accessor, you can traverse the DOM tree constructed by the parser.

require 'gammo'
require 'pp'

document = Gammo.new('<!doctype html><input type="button">').parse

def dump_for(node, strm)
  strm << node.to_h
  return unless node && (child = node.first_child)
  while child
    dump_for(child, (strm.last[:children] ||= []))
    child = child.next_sibling
  end
  strm
end

pp dump_for(document, [])

Notes

Currently, it's not possible to traverse the DOM tree with css selector or xpath like Nokogiri. However, Gammo plans to implement these features in the future.

Node

The nodes generated by the parser will be categorized into one of the following types:

Node type	Description
`Gammo::Node::Error`	Represents error node, it usually means end-of-string.
`Gammo::Node::Text`	Represents the text node like "foo" which is inner text of elements.
`Gammo::Node::Document`	Represents the root document type. It's always returned by `Gammo::Parser#document`.
`Gammo::Node::Element`	Represents any elements of HTML like `<p>`.
`Gammo::Node::Comment`	Represents comments like `<!-- foo -->`
`Gammo::Node::Doctype`	Represents doctype like `<!doctype html>`

For some nodes such as Gammo::Node::Element and Gammo::Node::Document, they contain pointers to nodes that can be referenced by itself, such as Gammo::Node#next_sibling or Gammo::Node#first_child. In addition, APIs such as Gammo::Node#append_child and Gammo::Node#remove_child that perform operations defined in DOM living standard are also provided.

DOM Tree Traversal

CSS selector and XPath-1.0 are the way for traversing DOM tree built by Gammo.

XPath 1.0 (experimental)

Gammo has an original lexer/parser for XPath 1.0, it's provided as a helper in the DOM tree built by Gammo. Here is a simple example:

document = Gammo.new('<!doctype html><input type="button">').parse
node_set = document.xpath('//input[@type="button"]') #=> "<Gammo::XPath::NodeSet>"

node_set.length #=> 1
node_set.first #=> "<Gammo::Node::Element>"

Since this is implemented by full scratch, Gammo is providing this support as a very experimental feature. Please file an issue if you find bugs.

Example

Before proceeding at the details of XPath support, let's have a look at a few simple examples. Given a sample HTML text and its DOM tree:

document = Gammo.new(<<-EOS).parse
<!DOCTYPE html>
<html>
<head>
</head>
<body>
  <h1>namusyaka.com</h1>
  <p class="description">Here is a sample web site.</p>
  <ul>
    <li>hello</li>
    <li>world</li>
  </ul>
  <ul id="links">
    <li>Google <a href="https://google.com/">google.com</a></li>
    <li>GitHub <a href="https://github.com/namusyaka">github.com/namusyaka</a></li>
  </ul>
</body>
</html>
EOS

The following XPath expression gets all li elements and prints those text contents:

document.xpath('//li').each do |elm|
  puts elm.inner_text
end

The following XPath expression gets all li elements under the ul element having the id=links attribute:

document.xpath('//ul[@id="links"]/li').each do |elm|
  puts elm.inner_text
end

The following XPath expression gets each text node for each li element under the ul element having the id=links attribute:

document.xpath('//ul[@id="links"]/li/text()').each do |elm|
  puts elm.data
end

Axis Specifiers

In the combination with Gammo, the axis specifier indicates navigation direction within the DOM tree built by Gammo. Here is list of axes. As you can see, Gammo fully supports the all of axes.

Full Syntax	Abbreviated Syntax	Supported	Notes
`ancestor`		yes
`ancestor-or-self`		yes
`attribute`	`@`	yes	`@abc` is the alias for `attribute::abc`
`child`		yes	`abc` is the short for `child::abc`
`descendant`		yes
`descendant-or-self`	`//`	yes	`//` is the alias for `/descendant-or-self::node()/`
`following`		yes
`following-sibling`		yes
`namespace`		yes
`parent`	`..`	yes	`..` is the alias for `parent::node()`
`preceding`		yes
`preceding-sibling`		yes
`self`	`.`	yes	`.` is the alias for `self::node()`

Node Test

Node tests consist of specific node names or more general expressions. Although particular syntax like : should work for specifying namespace prefix in XPath, Gammo does not support it yet as it's not a core feature in HTML5.

Full Syntax	Supported	Notes
`text()`	yes	Finds a node of type text, e.g. `hello` in `<p>hello <a href="https://hello">world</a></p>`
`comment()`	yes	Finds a node of type comment, e.g. `<!-- comment -->`
`node()`	yes	Finds any node at all.

Also note that the processing-instruction is not supported. There is no plan to support it.

Operators

The /, // and [] are used in the path expression.
The union operator | forms the union of two node sets.
The boolean operators: and, or
The arithmetic operators: +, -, *, div and mod
Comparison operators: =, !=, <, >, <=, >=

Functions

XPath 1.0 defines four data types (nodeset, string, number, boolean) and there are various functions based on the types. Gammo supports those functions partially, please check it to be supported before using functions.

Node set functions

Function Name	Supported	Specification
`last()`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-last
`position()`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-position
`count(node-set)`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-count

String Functions

Function Name	Supported	Specification
`string(object?)`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string
`concat(string, string, string*)`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-concat
`starts-with(string, string)`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-starts-with
`contains(string, string)`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-contains
`substring-before(string, string)`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring-before
`substring-after(string, string)`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring-after
`substring(string, number, number?)`	no	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring
`string-length(string?)`	no	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-length
`normalize-space(string?)`	no	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-normalize-space
`translate(string, string, string)`	no	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-translate

Boolean Functions

Function Name	Supported	Specification
`boolean(object)`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-boolean
`not(object)`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-not
`true()`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-true
`false()`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-false
`lang()`	no	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-lang

Number Functions

Function Name	Supported	Specification
`number(object?)`	no	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-number
`sum(node-set)`	no	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-sum
`floor(number)`	no	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-floor
`ceiling(number)`	yes	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-ceiling
`round(number)`	no	https://www.w3.org/TR/1999/REC-xpath-19991116/#function-round

CSS Selector (experimental)

Gammo has an original lexer/parser for CSS Selector, it's provided as a helper in the DOM tree built by Gammo. Here is a simple example:

document = Gammo.new('<!doctype html><input type="button">').parse
node_set = document.css('input[type="button"]') #=> "<Gammo::CSSSelector::NodeSet>"

node_set.length #=> 1
node_set.first #=> "<Gammo::Node::Element>"

Since this is implemented by full scratch, Gammo is providing this support as a very experimental feature. Please file an issue if you find bugs.

Example

Before proceeding at the details of CSS Selector support, let's have a look at a few simple examples. Given a sample HTML text and its DOM tree:

document = Gammo.new(<<-EOS).parse
<!DOCTYPE html>
<html>
<head>
</head>
<body>
  <h1>namusyaka.com</h1>
  <p class="description">Here is a sample web site.</p>
  <ul>
    <li>hello</li>
    <li>world</li>
  </ul>
  <ul id="links">
    <li>Google <a href="https://google.com/">google.com</a></li>
    <li>GitHub <a href="https://github.com/namusyaka">github.com/namusyaka</a></li>
  </ul>
</body>
</html>
EOS

The following CSS selector gets all li elements and prints thoese text contents:

document.css('li').each do |elm|
  puts elm.inner_text
end

The following CSS selector gets all li elements under the ul element having the id=links attribute:

document.xpath('ul#links li').each do |elm|
  puts elm.inner_text
end

Groups of selectors

Gammo supports groups of selectors, this means you can use , to traverse DOM tree by multiple selectors.

require 'gammo'

@doc = Gammo.new(<<-EOS).parse
<!DOCTYPE html>
<html>
<head>
<title>hello</title>
<meta charset="utf8">
</head>
<body>
<p id="hello">hello</p>
<p id="world">world</p>
EOS

@doc.css('#hello, #world').map(&:inner_text).join(' ') #=> 'hello world'

Simple selectors

Type selector & Universal selector

Gammo supports the basic grammar of type selector and universal selector, but not namespaces.

Attribute selectors

See more details: 6.3. Attribute selectors

Syntax	Supported
`[att]`	yes
`[att=val]`	yes
`[att~=val]`	yes
`[att\|=val]`	yes

Class selectors

Supported. See more details: 6.4. Class selectors

ID selectors

Supported. See more details: 6.5. ID selectors

Pseudo-classes

Partially supported. See the table below.

Class name	Supported	Can support?
`:link`	no	no
`:visited`	no	no
`:hover`	no	no
`:active`	no	no
`:focus`	no	no
`:target`	no	no
`:lang`	no	yes
`:enabled`	yes	yes
`:disabled`	yes	yes
`:checked`	yes	yes
`:root`	yes	yes
`:nth-child`	yes	yes
`:nth-last-child`	no	yes
`:nth-of-type`	no	yes
`:nth-last-of-type`	no	yes
`:first-child`	no	yes
`:last-child`	no	yes
`:first-of-type`	no	yes
`:last-of-type`	no	yes
`:only-child`	no	yes
`:only-of-type`	no	yes
`:empty`	no	yes
`:not`	yes	yes

Combinators

See more details: 8. Combinators

Syntax	Supported	Desc
`h1 em`	yes	Descendant combinator
`h1 > em`	yes	Child combinator
`math + p`	yes	Next-sibling combinator
`h1 ~ pre`	yes	Subsequent-sibling combinator

Performance

As mentioned in the features at the beginning, Gammo doesn't prioritize its performance. Thus, for example, Gammo is not suitable for very performance-sensitive applications (e.g. performing Gammo parsing synchronously from an incoming request from an end user). Instead, the goal is to work well with batch processing such as crawlers. Gammo places the highest priority on making it easy to parse HTML by peforming it without depending on native-extensions and external gems.

References

This was developed with reference to the following softwares.

x/net/html: I've been working on this package, it gave me strong reason to make this happen.
Blink: Blink gave me great impression about tree construction.
html5lib-tests: Gammo relies on this test.

License

The gem is available as open source under the terms of the MIT License.

Release History

v0.3.0
- CSS selector support #11
v0.2.0
- XPath 1.0 support #4
v0.1.0
- Initial Release

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
lib		lib
misc		misc
test		test
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
gammo.gemspec		gammo.gemspec

License

namusyaka/gammo

Folders and files

Latest commit

History

Repository files navigation

Gammo - A pure-Ruby HTML5 parser

Overview

Features

Tokenizaton

Token types

Parsing

Notes

Node

DOM Tree Traversal

XPath 1.0 (experimental)

Example

Axis Specifiers

Node Test

Operators

Functions

Node set functions

String Functions

Boolean Functions

Number Functions

CSS Selector (experimental)

Example

Groups of selectors

Simple selectors

Type selector & Universal selector

Attribute selectors

Class selectors

ID selectors

Pseudo-classes

Combinators

Performance

References

License

Release History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages