elijah_dump

Dump and parse existing pages for Madrid.rb from Jottit

Usage

Simply run elijah_dump.rb.

Gets a list of pages and tries to convert them. If successful, writes the results in JSON and YAML formats, as out/meetings.json and out/meetings.yml.

WARNING It overwrites existing versions of these files!

It will create the out directory if it doesn't exist yet.

Output data

The output file contains an array of meetings. For each of them, these fields can be present (when they're not, they have no value or haven't been correctly parsed).

title
details
details_md
meeting_date
meeting_time
offered_by
offered_by_html
attendees
venue
map_url
original_url
topics

attendees contains a list of (usually) Twitter handles or (sometimes) plain names.

offered_by contains a list of the sponsors' urls. offered_by_html contains the raw HTML for that info (normally with images).

topics contains an array or topics (talks) that took place during the meeting. Each topic can include:

title
details
details_md
video_url
slides_url
speakers

Again, speakers is an array that contains the list of speakers for a given topic. Each can include:

speaker_name
speaker_handle
speaker_bio
speaker_bio_md

These fields contain raw HTML: details, speaker_bio, offered_by_html.

These fields contain Markdown: details_md (both in a meeting and in each topic) and speaker_bio_md.

The rest contain plain text.

Markdown conversion

Fields with HTML are converted back to Markdown thanks to reverse_markdown.

Original pages were written in Markdown but the parsing uses Nokogiri to navigate through the raw HTML, so the results are in HTML too. However, Markdown source is preferred to repopulate the new site.

Fortunately, reverse_markdown seems to perform a great job at reverting the process.

Page caching

To accelerate processing (specially during development), pages are downloaded only once and stored under the directory page_cache

Results included!!

If you are interested in this it's most probably because you just want the results. To make your life easier, they are included in the repository. Just get them from out and be done.

For the same price, the cached pages are included too!!

Current issues

Sections not identified as speaker, attendees, etc are just appended to details.
~~Assumes one talk per meeting. In meetings with more than one talk (or more than one speaker), speaker data is not accurate.~~ It now supports multi topic (talk) and multi speaker.
~~Right now, It only gets pages from Jottit (no GitHub pages yet!)~~ Pages from GitHub now work!

Author

Josep Egea. March 2015

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
out		out
page_cache		page_cache
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
date_additions.rb		date_additions.rb
elijah_dump.rb		elijah_dump.rb
github_page_parser.rb		github_page_parser.rb
github_parser.rb		github_parser.rb
jottit_page_parser.rb		jottit_page_parser.rb
jottit_parser.rb		jottit_parser.rb
meeting.rb		meeting.rb
node_additions.rb		node_additions.rb
page_cache.rb		page_cache.rb
page_fetcher.rb		page_fetcher.rb
speaker.rb		speaker.rb
string_additions.rb		string_additions.rb
topic.rb		topic.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

elijah_dump

Usage

Output data

Markdown conversion

Page caching

Results included!!

Current issues

Author

About

Releases

Packages

Languages

License

josepegea/elijah_dump

Folders and files

Latest commit

History

Repository files navigation

elijah_dump

Usage

Output data

Markdown conversion

Page caching

Results included!!

Current issues

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages