Skip to content

oxylabs/webscraping-with-ruby

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraping With Ruby

Oxylabs promo code

Ruby is a time-tested, open-source programming language. Its first version was released in 1996, while the latest major iteration 3 was dropped in 2020. This article covers tools and techniques for web scraping with Ruby that work with the latest version 3.

We’ll begin with a step-by-step overview of scraping static public web pages first and shift our focus to the means of scraping dynamic pages. While the first approach works with most websites, it will not function with the dynamic pages that use JavaScript to render the content. To handle these sites, we’ll look at headless browsers.

For a detailed explanation, see our blog post.

Installing Ruby

To install Ruby on Windows, run the following:

choco install ruby

To install Ruby on macOS, use a package manager such as Homebrew. Enter the following in the terminal:

brew install ruby

For Linux, use the package manager for your distro. For example, run the following for Ubuntu:

sudo apt install ruby-full

Scraping static pages

In this section, we’ll write a web scraper that can scrape data from [https://sandbox.oxylabs.io/products])(https://sandbox.oxylabs.io/products) . It is a dummy video game store for practicing web scraping with static websites.

Installing required gems

gem install httparty
gem install nokogiri
gem install csv

Making an HTTP request

require 'httparty'
response = HTTParty.get('https://sandbox.oxylabs.io/products')
if response.code == 200
    puts response.body
else
    puts "Error: #{response.code}"
    exit
end

Parsing HTML with Nokogiri

require 'nokogiri'
document = Nokogiri::HTML4(response.body)

games = []
50.times do |i|
  url = "https://sandbox.oxylabs.io/products?page={i+1}"
  response = HTTParty.get(url)
  document = Nokogiri::HTML(response.body)
  all_game_containers = document.css('.product-card')

  all_game_containers.each do |container|
    title = container.css('h4').text.strip
    price = container.css('.price-wrapper').text.delete('^0-9.')
    category_elements = container.css('.category span')
    categories = category_elements.map { |elem| elem.text.strip }.join(', ')
    game = [title, price, categories]
  end
end

Writing scraped data to a CSV file

require 'csv'
CSV.open(
  'games.csv',
  'w+',
  write_headers: true,
  headers: %w[Title, Price, Categories]
) do |csv|
  50.times do |i|
    response = HTTParty.get("https://sandbox.oxylabs.io/products?page={i+1}")
    document = Nokogiri::HTML4(response.body)
    all_game_containers = document.css('.product-card')
    all_games_containers.each do |container|
      title = container.css('h4').text.strip
      price = container.css('.price-wrapper').text.delete('^0-9.')
      category_elements = container.css('.category span')
      categories = category_elements.map { |elem| elem.text.strip }.join(', ')    
      game = [title, price, categories]
      csv << game
    end
  end
end

Scraping dynamic pages

Required installation

gem install selenium-webdriver
gem install csv

Loading a dynamic website

require 'selenium-webdriver'

driver = Selenium::WebDriver.for(:chrome)

Locating HTML elements via CSS selectors

document = Nokogiri::HTML(driver.page_source)

quotes = []
quote_elements = driver.find_elements(css: '.quote')
quote_elements.each do |quote_el|
  quote_text = quote_el.find_element(css: '.text').attribute('textContent')
  author = quote_el.find_element(css: '.author').attribute('textContent')
  quotes << [quote_text, author]
end

Handling pagination

quotes = []
while true do
  quote_elements = driver.find_elements(css: '.quote')
  quote_elements.each do |quote_el|
    quote_text = quote_el.find_element(css: '.text').attribute('textContent')
    author = quote_el.find_element(css: '.author').attribute('textContent')
    quotes << [quote_text, author]
  end
  begin
    driver.find_element(css: '.next >a').click
  rescue
    break # Next button not found
  end
end

Creating a CSV file

require 'csv'

CSV.open('quotes.csv', 'w+', write_headers: true,
         headers: %w[Quote Author]) do |csv|
  quotes.each do |quote|
    csv << quote
  end
end

If you wish to find out more about web scraping with Ruby, see our blog post.

About

A tutorial for web scraping with Ruby

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages