Skip to content

Commit

Permalink
Validate ota crawler (#1187)
Browse files Browse the repository at this point in the history
* starts implementation of engine

* working implementation of request to node server

* includes net::http implementation if preferred

* implementation of node server in app - not working

* disable Rack Attack

* seed DocumentTypes

* schema changes in dev DO NOT MERGE

* Document how to see document_types

* Add ota server

* image: pondersource/tosdr-ota:1.0

* ota image v1.1

* Use tosdr-ota:1.2

* Ref https://pptr.dev/guides/docker#usage, fix #1175

* Use pondersource/tosdr-ota:1.3

* manages ota env variables

* makes flash errors more robust

* implements ota crawler functionality, error handling for relevant methods

* updates views to account for new ota crawler

* removes outdated references to old crawlers

* removes ota engine dependency -- no longer needed!

* removes useless code

* removes yarn dependencies for ota engine

* clarifies standard error

* resolves conflict

* updates postgres to 14 to bring in line with produciton

* upgrades postgres in dev

* configures database for proper simulation of production env

* implements way to restore elasticsearch if documents missing for annotations

* adds last crawl date to documents

* adds es volumes

* refactor

* handles missing points after new crawl

* validates and converts selector

* implements way to restore annotations in es

* comments out useless code

---------

Co-authored-by: Michiel de Jong <[email protected]>
  • Loading branch information
madoleary and michielbdejong authored Nov 5, 2024
1 parent 86b9f45 commit 0d43307
Show file tree
Hide file tree
Showing 30 changed files with 667 additions and 395 deletions.
4 changes: 3 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,14 @@ COPY . .
RUN apt-get update -qq && apt-get install -y build-essential libpq-dev postgresql postgresql-contrib openssl sudo && \
curl -sS http://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add - && \
echo "deb http://dl.yarnpkg.com/debian/ stable main" | tee /etc/apt/sources.list.d/yarn.list && \
curl -sL https://deb.nodesource.com/setup_12.x | bash - && \
curl -sL https://deb.nodesource.com/setup_16.x | bash - && \
apt-get update -qq && apt-get install -y yarn nodejs && \
apt clean && \
rm -rf /var/lib/apt/lists/* && \
yarn

RUN apt-get update && apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2

RUN gem install bundler -v 2.4.14
COPY Gemfile Gemfile.lock ./

Expand Down
3 changes: 1 addition & 2 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ gem 'jquery-rails'
gem 'kaminari'
gem 'kramdown'
gem 'mini_racer'
gem 'node-runner', '~> 1.1'
gem 'paper_trail'
gem 'pg', '~> 0.21'
gem 'puma', '>= 3.12.4'
Expand Down Expand Up @@ -82,5 +83,3 @@ group :development, :test do
gem 'stackprof'
gem 'webmock'
end

gem "node-runner", "~> 1.1"
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,10 @@ So,

5. Sign-in

6. To debug the db, try `docker exec -it db psql -U postgres`. Due to a bug in the seeds you will currently need to:
```
insert into document_types values (0, 'terms of service', now(), now(), null, 1, 'approved');
```
To **annotate** a service, navigate to the services page from the top-right menu, choose a service, and click `View Documents`. Begin by highlighting a piece of text from this page. **H and the Hypothesis client must be running.**

For a demonstration of how annotations work, feel free to [inspect the video attached to this PR](https://github.com/tosdr/edit.tosdr.org/pull/1116).
Expand Down
8 changes: 0 additions & 8 deletions app/assets/stylesheets/components/_alert.scss
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,4 @@

.alert {
margin: -10px 0 10px;
text-align: center;
color: white;
}
.alert-info {
background: $green;
}
.alert-warning {
background: $red;
}
2 changes: 2 additions & 0 deletions app/controllers/application_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ class ApplicationController < ActionController::Base
before_action :configure_permitted_parameters, if: :devise_controller?
before_action :set_paper_trail_whodunnit

add_flash_types :info, :error, :warning

def configure_permitted_parameters
# For additional in app/views/devise/registrations/edit.html.erb
devise_parameter_sanitizer.permit(:account_update, keys: [:username])
Expand Down
223 changes: 102 additions & 121 deletions app/controllers/documents_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,26 +10,11 @@
class DocumentsController < ApplicationController
include Pundit::Authorization

PROD_CRAWLERS = {
"https://api.tosdr.org/crawl/v1": 'Random',
"https://api.tosdr.org/crawl/v1/eu": 'Europe (Recommended)',
"https://api.tosdr.org/crawl/v1/us": 'United States (Recommended)',
"https://api.tosdr.org/crawl/v1/eu-central": 'Europe (Central)',
"https://api.tosdr.org/crawl/v1/eu-west": 'Europe (West)',
"https://api.tosdr.org/crawl/v1/us-east": 'United States (East)',
"https://api.tosdr.org/crawl/v1/us-west": 'United States (West)'
}.freeze

DEV_CRAWLERS = {
"http://localhost:5000": 'Standalone (localhost:5000)',
"http://crawler:5000": 'Docker-Compose (crawler:5000)'
}.freeze

before_action :authenticate_user!, except: %i[index show]
before_action :set_document, only: %i[show edit update crawl restore_points]
before_action :set_services, only: %i[new edit create update]
before_action :set_document_names, only: %i[new edit create update]
before_action :set_crawlers, only: %i[new edit create update]
before_action :set_uri, only: %i[new edit create update crawl]

rescue_from Pundit::NotAuthorizedError, with: :user_not_authorized

Expand All @@ -55,18 +40,20 @@ def create
@document.user = current_user
@document.name = @document.document_type.name if @document.document_type

document_url = document_params[:url]
selector = document_params[:selector]

request = build_request(document_url, @uri, selector)
results = fetch_text(request, @uri, @document)

@document = results[:document]
message = results[:message]

if @document.save
crawl_result = perform_crawl

unless crawl_result.nil?
if crawl_result['error']
flash[:alert] = crawler_error_message(crawl_result)
else
flash[:notice] = 'The crawler has updated the document'
end
end
flash[:notice] = message
redirect_to document_path(@document)
else
flash.now[:warning] = message.html_safe if message
render :new
end
end
Expand All @@ -82,25 +69,27 @@ def update
@document.name = document_type.name unless @document.name == document_type.name
end

# we should probably only be running the crawler if the URL or XPath have changed
run_crawler = @document.saved_changes.keys.any? { |attribute| %w[url xpath crawler_server].include? attribute }
crawl_result = perform_crawl if run_crawler
# we should probably only be running the crawler if the URL or css selector have changed
run_crawler = @document.saved_changes.keys.any? { |attribute| %w[url selector].include? attribute }
#### need to crawl regardless once we deploy
if run_crawler
request = build_request(@document.url, @uri, @document.selector)
results = fetch_text(request, @uri, @document)

@document = results[:document]
@document.last_crawl_date = Time.now.getutc
message = results[:message]
crawl_sucessful = results[:crawl_sucessful]
end

if @document.save
# only want to do this if XPath or URL have changed
## text is returned blank when there's a defunct URL or XPath
### avoids server error upon 404 error in the crawler
# need to alert people if the crawler wasn't able to retrieve any text...
unless crawl_result.nil?
if crawl_result['error']
flash[:alert] = crawler_error_message(crawl_result)
else
flash[:notice] = 'The crawler has updated the document'
end
end
if (crawl_sucessful || !run_crawler) && @document.save
message ||= 'Document updated'
flash[:notice] = message
redirect_to document_path(@document)
else
render 'edit', locals: { crawlers: PROD_CRAWLERS }
message ||= 'Document failed to update'
flash.now[:warning] = message
render 'edit'
end
end

Expand All @@ -122,17 +111,38 @@ def destroy
def show
authorize @document

@points = @document.points
@missing_points = @points.where(status: 'approved-not-found')
@last_crawled_at = @document.formatted_last_crawl_date
@name = @document.document_type ? @document.document_type.name : @document.name
end

def crawl
authorize @document
crawl_result = perform_crawl
if crawl_result['error']
flash[:alert] = crawler_error_message(crawl_result)

old_text = @document.text
request = build_request(@document.url, @uri, @document.selector)
results = fetch_text(request, @uri, @document)

@document = results[:document]
@document.last_crawl_date = Time.now.getutc
message = results[:message]
crawl_sucessful = results[:crawl_sucessful]

text_changed = old_text != @document.text

if crawl_sucessful && text_changed && @document.save
missing_points = analyze_points
missing_points_count = missing_points.length.to_s
message = `Crawl successful. Document text updated. There are #{missing_points_count} points missing from the new text.`
flash[:notice] = message
elsif crawl_sucessful && !text_changed && @document.save
flash[:notice] = 'Crawl successful. Document text unchanged.'
else
flash[:notice] = 'The crawler has updated the document'
message ||= 'Crawl failed!'
flash.now[:warning] = message
end

redirect_to document_path(@document)
end

Expand Down Expand Up @@ -168,92 +178,63 @@ def set_document_names
@document_names = DocumentType.where(status: 'approved').order('name ASC')
end

def set_crawlers
@crawlers = Rails.env.development? ? DEV_CRAWLERS : PROD_CRAWLERS
end

def document_params
params.require(:document).permit(:service, :service_id, :user_id, :document_type_id, :name, :url, :xpath, :crawler_server)
params.require(:document).permit(:service, :service_id, :user_id, :document_type_id, :name, :url, :selector)
end

def crawler_error_message(result)
message = result['message']['name'].to_s
region = result['message']['crawler'].to_s
stacktrace = CGI::escapeHTML(result['message']['remoteStacktrace'].to_s)

`It seems that our crawler wasn't able to retrieve any text. <br><br>Reason: #{message} <br>Region: #{region} <br>Stacktrace: #{stacktrace}`
def set_uri
url = ENV['OTA_URL']
@uri = URI(url)
end

# to-do: refactor out comment assembly
def perform_crawl
authorize @document
@tbdoc = TOSBackDoc.new({
url: @document.url,
xpath: @document.xpath,
server: @document.crawler_server
})

@tbdoc.scrape
@document_comment = DocumentComment.new

error = @tbdoc.apiresponse['error']
if error
message_name = @tbdoc.apiresponse['message']['name'] || ''
crawler = @tbdoc.apiresponse['message']['crawler'] || ''
stacktrace = @tbdoc.apiresponse['message']['remoteStacktrace'] || ''
@document_comment.summary = '<span class="label label-danger">Attempted to Crawl Document</span><br>Error Message: <kbd>' + message_name + '</kbd><br>Crawler: <kbd>' + crawler + '</kbd><br>Stacktrace: <kbd>' + stacktrace + '</kbd>'
@document_comment.user_id = current_user.id
@document_comment.document_id = @document.id
end

document_blank = !@document.text.blank?
old_length = document_blank ? @document.text.length : 0
old_crc = document_blank ? Zlib.crc32(@document.text) : 0
new_crc = Zlib.crc32(@tbdoc.newdata)
changes_made = old_crc != new_crc

if changes_made
@document.update(text: @tbdoc.newdata)
new_length = @document.text ? @document.text.length : 'no text retrieved by crawler'

# There is a cron job in the crontab of the 'tosdr' user on the forum.tosdr.org
# server which runs once a day and before it deploys the site from edit.tosdr.org
# to tosdr.org, it will run the check_quotes script from
# https://github.com/tosdr/tosback-crawler/blob/225a74b/src/eto-admin.js#L121-L123
# So that if text has moved without changing, points are updated to the corrected
# quote_start, quote_end, and quote_text values where possible, and/or their status is
# switched between:
# pending <-> pending-not-found
# approved <-> approved-not-found
crawler = @tbdoc.apiresponse['message']['crawler'] || ''
@document_comment.summary = '<span class="label label-info">Document has been crawled</span><br><b>Old length:</b> <kbd>' + old_length.to_s + ' CRC ' + old_crc.to_s + '</kbd><br><b>New length:</b> <kbd>' + new_length.to_s + ' CRC ' + new_crc.to_s + '</kbd><br> Crawler: <kbd>' + crawler + '</kbd>'
@document_comment.user_id = current_user.id
@document_comment.document_id = @document.id
end
def build_request(document_url, uri, selector)
request = Net::HTTP::Post.new(uri)
params = '{"fetch": "' + document_url + '","select": "' + selector + '"}'
request.body = params
request.content_type = 'application/json'
token = ENV['OTA_API_SECRET']
request['Authorization'] = "Bearer #{token}"

unless changes_made
@tbdoc.apiresponse['error'] = true
@tbdoc.apiresponse['message'] = {
'name' => 'The source document has not been updated. No changes made.',
'remoteStacktrace' => 'SourceDocument'
}
end
request
end

message_name = @tbdoc.apiresponse['message']['name'] || ''
crawler = @tbdoc.apiresponse['message']['crawler'] || ''
stacktrace = @tbdoc.apiresponse['message']['remoteStacktrace'] || ''
def analyze_points
document.handle_missing_points
end

@document_comment.summary = '<span class="label label-danger">Attempted to Crawl Document</span><br>Error Message: <kbd>' + message_name + '</kbd><br>Crawler: <kbd>' + crawler + '</kbd><br>Stacktrace: <kbd>' + stacktrace + '</kbd>'
@document_comment.user_id = current_user.id
@document_comment.document_id = @document.id
def fetch_text(request, uri, document)
crawl_sucessful = false
begin
response_text = Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
http.request(request)
end

if @document_comment.save
puts 'Comment added!'
else
puts 'Error adding comment!'
puts @document_comment.errors.full_messages
case response_text
when Net::HTTPSuccess
puts 'HTTP Success'
response_body = response_text.body
parsed_response_body = JSON.parse(response_body)
document.text = parsed_response_body
crawl_sucessful = true
message = 'Document created!'
else
Rails.logger.error("HTTP Error: #{response.code} - #{response.message}")
message = "HTTP Error: Could not retrieve document text. Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{response.code} - #{response.message}"
end
rescue SocketError => e
# Handle network-related errors
Rails.logger.error("Network Error: #{e.message}")
message = "Network Error: Crawler unreachable. Could not retrieve document text. Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{e.message}"
rescue Timeout::Error => e
# Handle timeout errors
Rails.logger.error("Timeout Error: #{e.message}")
message = "Timeout Error: Could not retrieve document text. Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{e.message}"
rescue StandardError => e
# Handle any other standard errors
Rails.logger.error("Standard Error: #{e.message}")
message = "Standard Error: Could not retrieve document text. Is the crawler running? Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{e.message}"
end

@tbdoc.apiresponse
{ document: document, message: message, crawl_sucessful: crawl_sucessful }
end
end
11 changes: 8 additions & 3 deletions app/controllers/services_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,13 @@ def create
def annotate
authorize Service

@service = Service.includes(documents: [:points, :user, :document_type]).find(params[:id] || params[:service_id])
@service = Service.includes(documents: %i[points user document_type]).find(params[:id] || params[:service_id])
@documents = @service.documents
@sourced_from_ota = @documents.where(ota_sourced: true).any?
@missing_points = @service.points.where(status: 'approved-not-found')
@missing_points_cases = []
@missing_points.each { |point| @missing_points_cases << point.case.title } if @missing_points.any?
@missing_points_cases = @missing_points_cases.length > 1 ? @missing_points_cases.join(', ') : @missing_points_cases.join('')
if params[:point_id] && current_user
@point = Point.find_by id: params[:point_id]
else
Expand Down Expand Up @@ -91,7 +95,7 @@ def quote
end

def show
@service = Service.includes(points: [:case, :user]).find(params[:id] || params[:service_id])
@service = Service.includes(points: %i[case user]).find(params[:id] || params[:service_id])

authorize @service

Expand Down Expand Up @@ -173,7 +177,8 @@ def build_quote(point)

def build_point(case_obj, service, current_user)
point = Point.new(
params.permit(:title, :source, :status, :analysis, :service_id, :query, :point_change, :case_id, :document, :quote_start, :quote_end, :quote_text)
params.permit(:title, :source, :status, :analysis, :service_id, :query, :point_change, :case_id, :document,
:quote_start, :quote_end, :quote_text)
)
point.user = current_user
point.case = case_obj
Expand Down
8 changes: 4 additions & 4 deletions app/mailers/user_mailer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ class UserMailer < ApplicationMailer
#
# en.user_mailer.welcome.subject

def status_update(reason)
@user = reason.point.user
# def status_update(reason)
# @user = reason.point.user

mail(to: @user.email, subject: 'Status update from ToS;DR')
end
# mail(to: @user.email, subject: 'Status update from ToS;DR')
# end

def commented(author, point, commenter, commentText)
@authorName = author.username || 'user ' + author.id.to_s
Expand Down
Loading

0 comments on commit 0d43307

Please sign in to comment.