Ruby is a time-tested, open-source programming language. Its first version was released in 1996, while the latest major iteration 3 was dropped in 2020. This article covers tools and techniques for web scraping with Ruby that work with the latest version 3.
We’ll begin with a step-by-step overview of scraping static public web pages first and shift our focus to the means of scraping dynamic pages. While the first approach works with most websites, it will not function with the dynamic pages that use JavaScript to render the content. To handle these sites, we’ll look at headless browsers.
For a detailed explanation, see our blog post.
To install Ruby on Windows, run the following:
choco install ruby
To install Ruby on macOS, use a package manager such as Homebrew. Enter the following in the terminal:
brew install ruby
For Linux, use the package manager for your distro. For example, run the following for Ubuntu:
sudo apt install ruby-full
In this section, we’ll write a web scraper that can scrape data from [https://sandbox.oxylabs.io/products])(https://sandbox.oxylabs.io/products) . It is a dummy video game store for practicing web scraping with static websites.
gem install httparty
gem install nokogiri
gem install csv
require 'httparty'
response = HTTParty.get('https://sandbox.oxylabs.io/products')
if response.code == 200
puts response.body
else
puts "Error: #{response.code}"
exit
end
require 'nokogiri'
document = Nokogiri::HTML4(response.body)
games = []
50.times do |i|
url = "https://sandbox.oxylabs.io/products?page={i+1}"
response = HTTParty.get(url)
document = Nokogiri::HTML(response.body)
all_game_containers = document.css('.product-card')
all_game_containers.each do |container|
title = container.css('h4').text.strip
price = container.css('.price-wrapper').text.delete('^0-9.')
category_elements = container.css('.category span')
categories = category_elements.map { |elem| elem.text.strip }.join(', ')
game = [title, price, categories]
end
end
require 'csv'
CSV.open(
'games.csv',
'w+',
write_headers: true,
headers: %w[Title, Price, Categories]
) do |csv|
50.times do |i|
response = HTTParty.get("https://sandbox.oxylabs.io/products?page={i+1}")
document = Nokogiri::HTML4(response.body)
all_game_containers = document.css('.product-card')
all_games_containers.each do |container|
title = container.css('h4').text.strip
price = container.css('.price-wrapper').text.delete('^0-9.')
category_elements = container.css('.category span')
categories = category_elements.map { |elem| elem.text.strip }.join(', ')
game = [title, price, categories]
csv << game
end
end
end
gem install selenium-webdriver
gem install csv
require 'selenium-webdriver'
driver = Selenium::WebDriver.for(:chrome)
document = Nokogiri::HTML(driver.page_source)
quotes = []
quote_elements = driver.find_elements(css: '.quote')
quote_elements.each do |quote_el|
quote_text = quote_el.find_element(css: '.text').attribute('textContent')
author = quote_el.find_element(css: '.author').attribute('textContent')
quotes << [quote_text, author]
end
quotes = []
while true do
quote_elements = driver.find_elements(css: '.quote')
quote_elements.each do |quote_el|
quote_text = quote_el.find_element(css: '.text').attribute('textContent')
author = quote_el.find_element(css: '.author').attribute('textContent')
quotes << [quote_text, author]
end
begin
driver.find_element(css: '.next >a').click
rescue
break # Next button not found
end
end
require 'csv'
CSV.open('quotes.csv', 'w+', write_headers: true,
headers: %w[Quote Author]) do |csv|
quotes.each do |quote|
csv << quote
end
end
If you wish to find out more about web scraping with Ruby, see our blog post.