Validate ota crawler (#1187)

* starts implementation of engine * working implementation of request to node server * includes net::http implementation if preferred * implementation of node server in app - not working * disable Rack Attack * seed DocumentTypes * schema changes in dev DO NOT MERGE * Document how to see document_types * Add ota server * image: pondersource/tosdr-ota:1.0 * ota image v1.1 * Use tosdr-ota:1.2 * Ref https://pptr.dev/guides/docker#usage, fix #1175 * Use pondersource/tosdr-ota:1.3 * manages ota env variables * makes flash errors more robust * implements ota crawler functionality, error handling for relevant methods * updates views to account for new ota crawler * removes outdated references to old crawlers * removes ota engine dependency -- no longer needed! * removes useless code * removes yarn dependencies for ota engine * clarifies standard error * resolves conflict * updates postgres to 14 to bring in line with produciton * upgrades postgres in dev * configures database for proper simulation of production env * implements way to restore elasticsearch if documents missing for annotations * adds last crawl date to documents * adds es volumes * refactor * handles missing points after new crawl * validates and converts selector * implements way to restore annotations in es * comments out useless code --------- Co-authored-by: Michiel de Jong <[email protected]>
tosdr · Nov 5, 2024 · 0d43307 · 0d43307
1 parent 86b9f45
commit 0d43307
Show file tree

Hide file tree

Showing 30 changed files with 667 additions and 395 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -13,12 +13,14 @@ COPY . .
 RUN apt-get update -qq && apt-get install -y build-essential libpq-dev postgresql postgresql-contrib openssl sudo && \
     curl -sS http://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add - && \
     echo "deb http://dl.yarnpkg.com/debian/ stable main" | tee /etc/apt/sources.list.d/yarn.list && \
-    curl -sL https://deb.nodesource.com/setup_12.x | bash - && \
+    curl -sL https://deb.nodesource.com/setup_16.x | bash - && \
     apt-get update -qq && apt-get install -y yarn nodejs && \
     apt clean && \
     rm -rf /var/lib/apt/lists/* && \
     yarn
 
+RUN apt-get update && apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2
+
 RUN gem install bundler -v 2.4.14
 COPY Gemfile Gemfile.lock ./
 

diff --git a/Gemfile b/Gemfile
@@ -31,6 +31,7 @@ gem 'jquery-rails'
 gem 'kaminari'
 gem 'kramdown'
 gem 'mini_racer'
+gem 'node-runner', '~> 1.1'
 gem 'paper_trail'
 gem 'pg', '~> 0.21'
 gem 'puma', '>= 3.12.4'
@@ -82,5 +83,3 @@ group :development, :test do
   gem 'stackprof'
   gem 'webmock'
 end
-
-gem "node-runner", "~> 1.1"
diff --git a/README.md b/README.md
@@ -93,6 +93,10 @@ So,
 
 5. Sign-in
 
+6. To debug the db, try `docker exec -it db psql -U postgres`. Due to a bug in the seeds you will currently need to:
+```
+insert into document_types values (0, 'terms of service', now(), now(), null, 1, 'approved');
+```
 To **annotate** a service, navigate to the services page from the top-right menu, choose a service, and click `View Documents`. Begin by highlighting a piece of text from this page. **H and the Hypothesis client must be running.**
 
 For a demonstration of how annotations work, feel free to [inspect the video attached to this PR](https://github.com/tosdr/edit.tosdr.org/pull/1116).

diff --git a/app/assets/stylesheets/components/_alert.scss b/app/assets/stylesheets/components/_alert.scss
@@ -4,12 +4,4 @@
 
 .alert {
   margin: -10px 0 10px;
-  text-align: center;
-  color: white;
-}
-.alert-info {
-  background: $green;
-}
-.alert-warning {
-  background: $red;
 }
diff --git a/app/controllers/application_controller.rb b/app/controllers/application_controller.rb
@@ -9,6 +9,8 @@ class ApplicationController < ActionController::Base
   before_action :configure_permitted_parameters, if: :devise_controller?
   before_action :set_paper_trail_whodunnit
 
+  add_flash_types :info, :error, :warning
+
   def configure_permitted_parameters
     # For additional in app/views/devise/registrations/edit.html.erb
     devise_parameter_sanitizer.permit(:account_update, keys: [:username])

diff --git a/app/controllers/documents_controller.rb b/app/controllers/documents_controller.rb
@@ -10,26 +10,11 @@
 class DocumentsController < ApplicationController
   include Pundit::Authorization
 
-  PROD_CRAWLERS = {
-    "https://api.tosdr.org/crawl/v1": 'Random',
-    "https://api.tosdr.org/crawl/v1/eu": 'Europe (Recommended)',
-    "https://api.tosdr.org/crawl/v1/us": 'United States (Recommended)',
-    "https://api.tosdr.org/crawl/v1/eu-central": 'Europe (Central)',
-    "https://api.tosdr.org/crawl/v1/eu-west": 'Europe (West)',
-    "https://api.tosdr.org/crawl/v1/us-east": 'United States (East)',
-    "https://api.tosdr.org/crawl/v1/us-west": 'United States (West)'
-  }.freeze
-
-  DEV_CRAWLERS = {
-    "http://localhost:5000": 'Standalone (localhost:5000)',
-    "http://crawler:5000": 'Docker-Compose (crawler:5000)'
-  }.freeze
-
   before_action :authenticate_user!, except: %i[index show]
   before_action :set_document, only: %i[show edit update crawl restore_points]
   before_action :set_services, only: %i[new edit create update]
   before_action :set_document_names, only: %i[new edit create update]
-  before_action :set_crawlers, only: %i[new edit create update]
+  before_action :set_uri, only: %i[new edit create update crawl]
 
   rescue_from Pundit::NotAuthorizedError, with: :user_not_authorized
 
@@ -55,18 +40,20 @@ def create
     @document.user = current_user
     @document.name = @document.document_type.name if @document.document_type
 
+    document_url = document_params[:url]
+    selector = document_params[:selector]
+
+    request = build_request(document_url, @uri, selector)
+    results = fetch_text(request, @uri, @document)
+
+    @document = results[:document]
+    message = results[:message]
+
     if @document.save
-      crawl_result = perform_crawl
-
-      unless crawl_result.nil?
-        if crawl_result['error']
-          flash[:alert] = crawler_error_message(crawl_result)
-        else
-          flash[:notice] = 'The crawler has updated the document'
-        end
-      end
+      flash[:notice] = message
       redirect_to document_path(@document)
     else
+      flash.now[:warning] = message.html_safe if message
       render :new
     end
   end
@@ -82,25 +69,27 @@ def update
       @document.name = document_type.name unless @document.name == document_type.name
     end
 
-    # we should probably only be running the crawler if the URL or XPath have changed
-    run_crawler = @document.saved_changes.keys.any? { |attribute| %w[url xpath crawler_server].include? attribute }
-    crawl_result = perform_crawl if run_crawler
+    # we should probably only be running the crawler if the URL or css selector have changed
+    run_crawler = @document.saved_changes.keys.any? { |attribute| %w[url selector].include? attribute }
+    #### need to crawl regardless once we deploy
+    if run_crawler
+      request = build_request(@document.url, @uri, @document.selector)
+      results = fetch_text(request, @uri, @document)
+
+      @document = results[:document]
+      @document.last_crawl_date = Time.now.getutc
+      message = results[:message]
+      crawl_sucessful = results[:crawl_sucessful]
+    end
 
-    if @document.save
-      # only want to do this if XPath or URL have changed
-      ## text is returned blank when there's a defunct URL or XPath
-      ### avoids server error upon 404 error in the crawler
-      # need to alert people if the crawler wasn't able to retrieve any text...
-      unless crawl_result.nil?
-        if crawl_result['error']
-          flash[:alert] = crawler_error_message(crawl_result)
-        else
-          flash[:notice] = 'The crawler has updated the document'
-        end
-      end
+    if (crawl_sucessful || !run_crawler) && @document.save
+      message ||= 'Document updated'
+      flash[:notice] = message
       redirect_to document_path(@document)
     else
-      render 'edit', locals: { crawlers: PROD_CRAWLERS }
+      message ||= 'Document failed to update'
+      flash.now[:warning] = message
+      render 'edit'
     end
   end
 
@@ -122,17 +111,38 @@ def destroy
   def show
     authorize @document
 
+    @points = @document.points
+    @missing_points = @points.where(status: 'approved-not-found')
+    @last_crawled_at = @document.formatted_last_crawl_date
     @name = @document.document_type ? @document.document_type.name : @document.name
   end
 
   def crawl
     authorize @document
-    crawl_result = perform_crawl
-    if crawl_result['error']
-      flash[:alert] = crawler_error_message(crawl_result)
+
+    old_text = @document.text
+    request = build_request(@document.url, @uri, @document.selector)
+    results = fetch_text(request, @uri, @document)
+
+    @document = results[:document]
+    @document.last_crawl_date = Time.now.getutc
+    message = results[:message]
+    crawl_sucessful = results[:crawl_sucessful]
+
+    text_changed = old_text != @document.text
+
+    if crawl_sucessful && text_changed && @document.save
+      missing_points = analyze_points
+      missing_points_count = missing_points.length.to_s
+      message = `Crawl successful. Document text updated. There are #{missing_points_count} points missing from the new text.`
+      flash[:notice] = message
+    elsif crawl_sucessful && !text_changed && @document.save
+      flash[:notice] = 'Crawl successful. Document text unchanged.'
     else
-      flash[:notice] = 'The crawler has updated the document'
+      message ||= 'Crawl failed!'
+      flash.now[:warning] = message
     end
+
     redirect_to document_path(@document)
   end
 
@@ -168,92 +178,63 @@ def set_document_names
     @document_names = DocumentType.where(status: 'approved').order('name ASC')
   end
 
-  def set_crawlers
-    @crawlers = Rails.env.development? ? DEV_CRAWLERS : PROD_CRAWLERS
-  end
-
   def document_params
-    params.require(:document).permit(:service, :service_id, :user_id, :document_type_id, :name, :url, :xpath, :crawler_server)
+    params.require(:document).permit(:service, :service_id, :user_id, :document_type_id, :name, :url, :selector)
   end
 
-  def crawler_error_message(result)
-    message = result['message']['name'].to_s
-    region = result['message']['crawler'].to_s
-    stacktrace = CGI::escapeHTML(result['message']['remoteStacktrace'].to_s)
-
-    `It seems that our crawler wasn't able to retrieve any text. <br><br>Reason: #{message} <br>Region: #{region} <br>Stacktrace: #{stacktrace}`
+  def set_uri
+    url = ENV['OTA_URL']
+    @uri = URI(url)
   end
 
-  # to-do: refactor out comment assembly
-  def perform_crawl
-    authorize @document
-    @tbdoc = TOSBackDoc.new({
-                              url: @document.url,
-                              xpath: @document.xpath,
-                              server: @document.crawler_server
-                            })
-
-    @tbdoc.scrape
-    @document_comment = DocumentComment.new
-
-    error = @tbdoc.apiresponse['error']
-    if error
-      message_name = @tbdoc.apiresponse['message']['name'] || ''
-      crawler = @tbdoc.apiresponse['message']['crawler'] || ''
-      stacktrace = @tbdoc.apiresponse['message']['remoteStacktrace'] || ''
-      @document_comment.summary = '<span class="label label-danger">Attempted to Crawl Document</span><br>Error Message: <kbd>' + message_name + '</kbd><br>Crawler: <kbd>' + crawler + '</kbd><br>Stacktrace: <kbd>' + stacktrace + '</kbd>'
-      @document_comment.user_id = current_user.id
-      @document_comment.document_id = @document.id
-    end
-
-    document_blank = !@document.text.blank?
-    old_length = document_blank ? @document.text.length : 0
-    old_crc = document_blank ? Zlib.crc32(@document.text) : 0
-    new_crc = Zlib.crc32(@tbdoc.newdata)
-    changes_made = old_crc != new_crc
-
-    if changes_made
-      @document.update(text: @tbdoc.newdata)
-      new_length = @document.text ? @document.text.length : 'no text retrieved by crawler'
-
-      # There is a cron job in the crontab of the 'tosdr' user on the forum.tosdr.org
-      # server which runs once a day and before it deploys the site from edit.tosdr.org
-      # to tosdr.org, it will run the check_quotes script from
-      # https://github.com/tosdr/tosback-crawler/blob/225a74b/src/eto-admin.js#L121-L123
-      # So that if text has moved without changing, points are updated to the corrected
-      # quote_start, quote_end, and quote_text values where possible, and/or their status is
-      # switched between:
-      # pending <-> pending-not-found
-      # approved <-> approved-not-found
-      crawler = @tbdoc.apiresponse['message']['crawler'] || ''
-      @document_comment.summary = '<span class="label label-info">Document has been crawled</span><br><b>Old length:</b> <kbd>' + old_length.to_s + ' CRC ' + old_crc.to_s + '</kbd><br><b>New length:</b> <kbd>' + new_length.to_s + ' CRC ' + new_crc.to_s + '</kbd><br> Crawler: <kbd>' + crawler + '</kbd>'
-      @document_comment.user_id = current_user.id
-      @document_comment.document_id = @document.id
-    end
+  def build_request(document_url, uri, selector)
+    request = Net::HTTP::Post.new(uri)
+    params = '{"fetch": "' + document_url + '","select": "' + selector + '"}'
+    request.body = params
+    request.content_type = 'application/json'
+    token = ENV['OTA_API_SECRET']
+    request['Authorization'] = "Bearer #{token}"
 
-    unless changes_made
-      @tbdoc.apiresponse['error'] = true
-      @tbdoc.apiresponse['message'] = {
-        'name' => 'The source document has not been updated. No changes made.',
-        'remoteStacktrace' => 'SourceDocument'
-      }
-    end
+    request
+  end
 
-    message_name = @tbdoc.apiresponse['message']['name'] || ''
-    crawler = @tbdoc.apiresponse['message']['crawler'] || ''
-    stacktrace = @tbdoc.apiresponse['message']['remoteStacktrace'] || ''
+  def analyze_points
+    document.handle_missing_points
+  end
 
-    @document_comment.summary = '<span class="label label-danger">Attempted to Crawl Document</span><br>Error Message: <kbd>' + message_name + '</kbd><br>Crawler: <kbd>' + crawler + '</kbd><br>Stacktrace: <kbd>' + stacktrace + '</kbd>'
-    @document_comment.user_id = current_user.id
-    @document_comment.document_id = @document.id
+  def fetch_text(request, uri, document)
+    crawl_sucessful = false
+    begin
+      response_text = Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
+        http.request(request)
+      end
 
-    if @document_comment.save
-      puts 'Comment added!'
-    else
-      puts 'Error adding comment!'
-      puts @document_comment.errors.full_messages
+      case response_text
+      when Net::HTTPSuccess
+        puts 'HTTP Success'
+        response_body = response_text.body
+        parsed_response_body = JSON.parse(response_body)
+        document.text = parsed_response_body
+        crawl_sucessful = true
+        message = 'Document created!'
+      else
+        Rails.logger.error("HTTP Error: #{response.code} - #{response.message}")
+        message = "HTTP Error: Could not retrieve document text. Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{response.code} - #{response.message}"
+      end
+    rescue SocketError => e
+      # Handle network-related errors
+      Rails.logger.error("Network Error: #{e.message}")
+      message = "Network Error: Crawler unreachable. Could not retrieve document text. Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{e.message}"
+    rescue Timeout::Error => e
+      # Handle timeout errors
+      Rails.logger.error("Timeout Error: #{e.message}")
+      message = "Timeout Error:  Could not retrieve document text. Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{e.message}"
+    rescue StandardError => e
+      # Handle any other standard errors
+      Rails.logger.error("Standard Error: #{e.message}")
+      message = "Standard Error: Could not retrieve document text. Is the crawler running? Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{e.message}"
     end
 
-    @tbdoc.apiresponse
+    { document: document, message: message, crawl_sucessful: crawl_sucessful }
   end
 end
diff --git a/app/controllers/services_controller.rb b/app/controllers/services_controller.rb
@@ -49,9 +49,13 @@ def create
   def annotate
     authorize Service
 
-    @service = Service.includes(documents: [:points, :user, :document_type]).find(params[:id] || params[:service_id])
+    @service = Service.includes(documents: %i[points user document_type]).find(params[:id] || params[:service_id])
     @documents = @service.documents
     @sourced_from_ota = @documents.where(ota_sourced: true).any?
+    @missing_points = @service.points.where(status: 'approved-not-found')
+    @missing_points_cases = []
+    @missing_points.each { |point| @missing_points_cases << point.case.title } if @missing_points.any?
+    @missing_points_cases = @missing_points_cases.length > 1 ? @missing_points_cases.join(', ') : @missing_points_cases.join('')
     if params[:point_id] && current_user
       @point = Point.find_by id: params[:point_id]
     else
@@ -91,7 +95,7 @@ def quote
   end
 
   def show
-    @service = Service.includes(points: [:case, :user]).find(params[:id] || params[:service_id])
+    @service = Service.includes(points: %i[case user]).find(params[:id] || params[:service_id])
 
     authorize @service
 
@@ -173,7 +177,8 @@ def build_quote(point)
 
   def build_point(case_obj, service, current_user)
     point = Point.new(
-      params.permit(:title, :source, :status, :analysis, :service_id, :query, :point_change, :case_id, :document, :quote_start, :quote_end, :quote_text)
+      params.permit(:title, :source, :status, :analysis, :service_id, :query, :point_change, :case_id, :document,
+                    :quote_start, :quote_end, :quote_text)
     )
     point.user = current_user
     point.case = case_obj

diff --git a/app/mailers/user_mailer.rb b/app/mailers/user_mailer.rb
@@ -4,11 +4,11 @@ class UserMailer < ApplicationMailer
   #
   #   en.user_mailer.welcome.subject
 
-  def status_update(reason)
-    @user = reason.point.user
+  # def status_update(reason)
+  #   @user = reason.point.user
 
-    mail(to: @user.email, subject: 'Status update from ToS;DR')
-  end
+  #   mail(to: @user.email, subject: 'Status update from ToS;DR')
+  # end
 
   def commented(author, point, commenter, commentText)
     @authorName = author.username || 'user ' + author.id.to_s