Skip to content

Navigating by sections

zverok edited this page Aug 7, 2015 · 4 revisions

"Flat" document model (heading, then paragraph, then another heading) is responds to Wikipedia structure (and to typical HTML), but it's not semantic enough!

So, here are sections:

page.intro # paragraphs before first heading
page.sections # top-level document sections, made of heading and nodes
              # before next heading of same level

sec = page.sections.first
sec.heading # => <Heading(level=2)....>
sec.children # => all nodes in section

# Further navigation inside section:
sec.paragraphs.first.images

# Next level sections:
sec.intro
sec.sections

# Concrete sections:
page.sections('Culture')
# or even
page.sections(/Season|Episodes/)

# Sections inside sections:
page.sections('Culture').sections('Visual arts')
# or just:
page.sections('Culture', 'Visual arts')
# or sugar for second-level sections:
page.sections('Culture' => 'Visual arts')

# multiple sections:
page.sections('Culture' => /.*/)

Gotcha: sections are "virtual" nodes, they are NOT in a tree. So, you may be surprised with:

page.lookup(:Section)
# => []

section = page.sections.first
# => <Section...>

section.paragraphs.first.lookup_parents(:Section)
# => []
section.paragraphs.first.parent
# => <Page....>

# but there IS Node#in_sections for each node:
section.paragraphs.first.in_sections
# => [<Section...>, <Section...>...]

See also API docs.

Next topics:

  • On Templates -- the real strenght (and sometimes the real pain of Wikipedia content is in templates.
  • Tips and tricks -- Here the full process of information extraction is described, alongside with some useful tips and gotchas.