-
Notifications
You must be signed in to change notification settings - Fork 16
Navigating by sections
zverok edited this page Aug 7, 2015
·
4 revisions
"Flat" document model (heading, then paragraph, then another heading) is responds to Wikipedia structure (and to typical HTML), but it's not semantic enough!
So, here are sections:
page.intro # paragraphs before first heading
page.sections # top-level document sections, made of heading and nodes
# before next heading of same level
sec = page.sections.first
sec.heading # => <Heading(level=2)....>
sec.children # => all nodes in section
# Further navigation inside section:
sec.paragraphs.first.images
# Next level sections:
sec.intro
sec.sections
# Concrete sections:
page.sections('Culture')
# or even
page.sections(/Season|Episodes/)
# Sections inside sections:
page.sections('Culture').sections('Visual arts')
# or just:
page.sections('Culture', 'Visual arts')
# or sugar for second-level sections:
page.sections('Culture' => 'Visual arts')
# multiple sections:
page.sections('Culture' => /.*/)
Gotcha: sections are "virtual" nodes, they are NOT in a tree. So, you may be surprised with:
page.lookup(:Section)
# => []
section = page.sections.first
# => <Section...>
section.paragraphs.first.lookup_parents(:Section)
# => []
section.paragraphs.first.parent
# => <Page....>
# but there IS Node#in_sections for each node:
section.paragraphs.first.in_sections
# => [<Section...>, <Section...>...]
See also API docs.
Next topics:
- On Templates -- the real strenght (and sometimes the real pain of Wikipedia content is in templates.
- Tips and tricks -- Here the full process of information extraction is described, alongside with some useful tips and gotchas.