Web Scraping With Ruby

Installing Ruby
Scraping static pages
Scraping dynamic pages

Ruby is a time-tested, open-source programming language. Its first version was released in 1996, while the latest major iteration 3 was dropped in 2020. This article covers tools and techniques for web scraping with Ruby that work with the latest version 3.

We’ll begin with a step-by-step overview of scraping static public web pages first and shift our focus to the means of scraping dynamic pages. While the first approach works with most websites, it will not function with the dynamic pages that use JavaScript to render the content. To handle these sites, we’ll look at headless browsers.

For a detailed explanation, see our blog post.

Installing Ruby

To install Ruby on Windows, run the following:

choco install ruby

To install Ruby on macOS, use a package manager such as Homebrew. Enter the following in the terminal:

brew install ruby

For Linux, use the package manager for your distro. For example, run the following for Ubuntu:

sudo apt install ruby-full

Scraping static pages

In this section, we’ll write a web scraper that can scrape data from [https://sandbox.oxylabs.io/products])(https://sandbox.oxylabs.io/products) . It is a dummy video game store for practicing web scraping with static websites.

Installing required gems

gem install httparty
gem install nokogiri
gem install csv

Making an HTTP request

require 'httparty'
response = HTTParty.get('https://sandbox.oxylabs.io/products')
if response.code == 200
    puts response.body
else
    puts "Error: #{response.code}"
    exit
end

Parsing HTML with Nokogiri

require 'nokogiri'
document = Nokogiri::HTML4(response.body)

games = []
50.times do |i|
  url = "https://sandbox.oxylabs.io/products?page={i+1}"
  response = HTTParty.get(url)
  document = Nokogiri::HTML(response.body)
  all_game_containers = document.css('.product-card')

  all_game_containers.each do |container|
    title = container.css('h4').text.strip
    price = container.css('.price-wrapper').text.delete('^0-9.')
    category_elements = container.css('.category span')
    categories = category_elements.map { |elem| elem.text.strip }.join(', ')
    game = [title, price, categories]
  end
end

Writing scraped data to a CSV file

require 'csv'
CSV.open(
  'games.csv',
  'w+',
  write_headers: true,
  headers: %w[Title, Price, Categories]
) do |csv|
  50.times do |i|
    response = HTTParty.get("https://sandbox.oxylabs.io/products?page={i+1}")
    document = Nokogiri::HTML4(response.body)
    all_game_containers = document.css('.product-card')
    all_games_containers.each do |container|
      title = container.css('h4').text.strip
      price = container.css('.price-wrapper').text.delete('^0-9.')
      category_elements = container.css('.category span')
      categories = category_elements.map { |elem| elem.text.strip }.join(', ')    
      game = [title, price, categories]
      csv << game
    end
  end
end

Scraping dynamic pages

Required installation

gem install selenium-webdriver
gem install csv

Loading a dynamic website

require 'selenium-webdriver'

driver = Selenium::WebDriver.for(:chrome)

Locating HTML elements via CSS selectors

document = Nokogiri::HTML(driver.page_source)

quotes = []
quote_elements = driver.find_elements(css: '.quote')
quote_elements.each do |quote_el|
  quote_text = quote_el.find_element(css: '.text').attribute('textContent')
  author = quote_el.find_element(css: '.author').attribute('textContent')
  quotes << [quote_text, author]
end

Handling pagination

quotes = []
while true do
  quote_elements = driver.find_elements(css: '.quote')
  quote_elements.each do |quote_el|
    quote_text = quote_el.find_element(css: '.text').attribute('textContent')
    author = quote_el.find_element(css: '.author').attribute('textContent')
    quotes << [quote_text, author]
  end
  begin
    driver.find_element(css: '.next >a').click
  rescue
    break # Next button not found
  end
end

Creating a CSV file

require 'csv'

CSV.open('quotes.csv', 'w+', write_headers: true,
         headers: %w[Quote Author]) do |csv|
  quotes.each do |quote|
    csv << quote
  end
end

If you wish to find out more about web scraping with Ruby, see our blog post.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
dynamic.rb		dynamic.rb
static.rb		static.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping With Ruby

Installing Ruby

Scraping static pages

Installing required gems

Making an HTTP request

Parsing HTML with Nokogiri

Writing scraped data to a CSV file

Scraping dynamic pages

Required installation

Loading a dynamic website

Locating HTML elements via CSS selectors

Handling pagination

Creating a CSV file

About

Releases

Packages

Contributors 4

Languages

oxylabs/webscraping-with-ruby

Folders and files

Latest commit

History

Repository files navigation

Web Scraping With Ruby

Installing Ruby

Scraping static pages

Installing required gems

Making an HTTP request

Parsing HTML with Nokogiri

Writing scraped data to a CSV file

Scraping dynamic pages

Required installation

Loading a dynamic website

Locating HTML elements via CSS selectors

Handling pagination

Creating a CSV file

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages