-
-
Notifications
You must be signed in to change notification settings - Fork 36
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* starts implementation of engine * working implementation of request to node server * includes net::http implementation if preferred * implementation of node server in app - not working * disable Rack Attack * seed DocumentTypes * schema changes in dev DO NOT MERGE * Document how to see document_types * Add ota server * image: pondersource/tosdr-ota:1.0 * ota image v1.1 * Use tosdr-ota:1.2 * Ref https://pptr.dev/guides/docker#usage, fix #1175 * Use pondersource/tosdr-ota:1.3 * manages ota env variables * makes flash errors more robust * implements ota crawler functionality, error handling for relevant methods * updates views to account for new ota crawler * removes outdated references to old crawlers * removes ota engine dependency -- no longer needed! * removes useless code * removes yarn dependencies for ota engine * clarifies standard error * resolves conflict * updates postgres to 14 to bring in line with produciton * upgrades postgres in dev * configures database for proper simulation of production env * implements way to restore elasticsearch if documents missing for annotations * adds last crawl date to documents * adds es volumes * refactor * handles missing points after new crawl * validates and converts selector * implements way to restore annotations in es * comments out useless code --------- Co-authored-by: Michiel de Jong <[email protected]>
- Loading branch information
1 parent
86b9f45
commit 0d43307
Showing
30 changed files
with
667 additions
and
395 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,26 +10,11 @@ | |
class DocumentsController < ApplicationController | ||
include Pundit::Authorization | ||
|
||
PROD_CRAWLERS = { | ||
"https://api.tosdr.org/crawl/v1": 'Random', | ||
"https://api.tosdr.org/crawl/v1/eu": 'Europe (Recommended)', | ||
"https://api.tosdr.org/crawl/v1/us": 'United States (Recommended)', | ||
"https://api.tosdr.org/crawl/v1/eu-central": 'Europe (Central)', | ||
"https://api.tosdr.org/crawl/v1/eu-west": 'Europe (West)', | ||
"https://api.tosdr.org/crawl/v1/us-east": 'United States (East)', | ||
"https://api.tosdr.org/crawl/v1/us-west": 'United States (West)' | ||
}.freeze | ||
|
||
DEV_CRAWLERS = { | ||
"http://localhost:5000": 'Standalone (localhost:5000)', | ||
"http://crawler:5000": 'Docker-Compose (crawler:5000)' | ||
}.freeze | ||
|
||
before_action :authenticate_user!, except: %i[index show] | ||
before_action :set_document, only: %i[show edit update crawl restore_points] | ||
before_action :set_services, only: %i[new edit create update] | ||
before_action :set_document_names, only: %i[new edit create update] | ||
before_action :set_crawlers, only: %i[new edit create update] | ||
before_action :set_uri, only: %i[new edit create update crawl] | ||
|
||
rescue_from Pundit::NotAuthorizedError, with: :user_not_authorized | ||
|
||
|
@@ -55,18 +40,20 @@ def create | |
@document.user = current_user | ||
@document.name = @document.document_type.name if @document.document_type | ||
|
||
document_url = document_params[:url] | ||
selector = document_params[:selector] | ||
|
||
request = build_request(document_url, @uri, selector) | ||
results = fetch_text(request, @uri, @document) | ||
|
||
@document = results[:document] | ||
message = results[:message] | ||
|
||
if @document.save | ||
crawl_result = perform_crawl | ||
|
||
unless crawl_result.nil? | ||
if crawl_result['error'] | ||
flash[:alert] = crawler_error_message(crawl_result) | ||
else | ||
flash[:notice] = 'The crawler has updated the document' | ||
end | ||
end | ||
flash[:notice] = message | ||
redirect_to document_path(@document) | ||
else | ||
flash.now[:warning] = message.html_safe if message | ||
render :new | ||
end | ||
end | ||
|
@@ -82,25 +69,27 @@ def update | |
@document.name = document_type.name unless @document.name == document_type.name | ||
end | ||
|
||
# we should probably only be running the crawler if the URL or XPath have changed | ||
run_crawler = @document.saved_changes.keys.any? { |attribute| %w[url xpath crawler_server].include? attribute } | ||
crawl_result = perform_crawl if run_crawler | ||
# we should probably only be running the crawler if the URL or css selector have changed | ||
run_crawler = @document.saved_changes.keys.any? { |attribute| %w[url selector].include? attribute } | ||
#### need to crawl regardless once we deploy | ||
if run_crawler | ||
request = build_request(@document.url, @uri, @document.selector) | ||
results = fetch_text(request, @uri, @document) | ||
|
||
@document = results[:document] | ||
@document.last_crawl_date = Time.now.getutc | ||
message = results[:message] | ||
crawl_sucessful = results[:crawl_sucessful] | ||
end | ||
|
||
if @document.save | ||
# only want to do this if XPath or URL have changed | ||
## text is returned blank when there's a defunct URL or XPath | ||
### avoids server error upon 404 error in the crawler | ||
# need to alert people if the crawler wasn't able to retrieve any text... | ||
unless crawl_result.nil? | ||
if crawl_result['error'] | ||
flash[:alert] = crawler_error_message(crawl_result) | ||
else | ||
flash[:notice] = 'The crawler has updated the document' | ||
end | ||
end | ||
if (crawl_sucessful || !run_crawler) && @document.save | ||
message ||= 'Document updated' | ||
flash[:notice] = message | ||
redirect_to document_path(@document) | ||
else | ||
render 'edit', locals: { crawlers: PROD_CRAWLERS } | ||
message ||= 'Document failed to update' | ||
flash.now[:warning] = message | ||
render 'edit' | ||
end | ||
end | ||
|
||
|
@@ -122,17 +111,38 @@ def destroy | |
def show | ||
authorize @document | ||
|
||
@points = @document.points | ||
@missing_points = @points.where(status: 'approved-not-found') | ||
@last_crawled_at = @document.formatted_last_crawl_date | ||
@name = @document.document_type ? @document.document_type.name : @document.name | ||
end | ||
|
||
def crawl | ||
authorize @document | ||
crawl_result = perform_crawl | ||
if crawl_result['error'] | ||
flash[:alert] = crawler_error_message(crawl_result) | ||
|
||
old_text = @document.text | ||
request = build_request(@document.url, @uri, @document.selector) | ||
results = fetch_text(request, @uri, @document) | ||
|
||
@document = results[:document] | ||
@document.last_crawl_date = Time.now.getutc | ||
message = results[:message] | ||
crawl_sucessful = results[:crawl_sucessful] | ||
|
||
text_changed = old_text != @document.text | ||
|
||
if crawl_sucessful && text_changed && @document.save | ||
missing_points = analyze_points | ||
missing_points_count = missing_points.length.to_s | ||
message = `Crawl successful. Document text updated. There are #{missing_points_count} points missing from the new text.` | ||
flash[:notice] = message | ||
elsif crawl_sucessful && !text_changed && @document.save | ||
flash[:notice] = 'Crawl successful. Document text unchanged.' | ||
else | ||
flash[:notice] = 'The crawler has updated the document' | ||
message ||= 'Crawl failed!' | ||
flash.now[:warning] = message | ||
end | ||
|
||
redirect_to document_path(@document) | ||
end | ||
|
||
|
@@ -168,92 +178,63 @@ def set_document_names | |
@document_names = DocumentType.where(status: 'approved').order('name ASC') | ||
end | ||
|
||
def set_crawlers | ||
@crawlers = Rails.env.development? ? DEV_CRAWLERS : PROD_CRAWLERS | ||
end | ||
|
||
def document_params | ||
params.require(:document).permit(:service, :service_id, :user_id, :document_type_id, :name, :url, :xpath, :crawler_server) | ||
params.require(:document).permit(:service, :service_id, :user_id, :document_type_id, :name, :url, :selector) | ||
end | ||
|
||
def crawler_error_message(result) | ||
message = result['message']['name'].to_s | ||
region = result['message']['crawler'].to_s | ||
stacktrace = CGI::escapeHTML(result['message']['remoteStacktrace'].to_s) | ||
|
||
`It seems that our crawler wasn't able to retrieve any text. <br><br>Reason: #{message} <br>Region: #{region} <br>Stacktrace: #{stacktrace}` | ||
def set_uri | ||
url = ENV['OTA_URL'] | ||
@uri = URI(url) | ||
end | ||
|
||
# to-do: refactor out comment assembly | ||
def perform_crawl | ||
authorize @document | ||
@tbdoc = TOSBackDoc.new({ | ||
url: @document.url, | ||
xpath: @document.xpath, | ||
server: @document.crawler_server | ||
}) | ||
|
||
@tbdoc.scrape | ||
@document_comment = DocumentComment.new | ||
|
||
error = @tbdoc.apiresponse['error'] | ||
if error | ||
message_name = @tbdoc.apiresponse['message']['name'] || '' | ||
crawler = @tbdoc.apiresponse['message']['crawler'] || '' | ||
stacktrace = @tbdoc.apiresponse['message']['remoteStacktrace'] || '' | ||
@document_comment.summary = '<span class="label label-danger">Attempted to Crawl Document</span><br>Error Message: <kbd>' + message_name + '</kbd><br>Crawler: <kbd>' + crawler + '</kbd><br>Stacktrace: <kbd>' + stacktrace + '</kbd>' | ||
@document_comment.user_id = current_user.id | ||
@document_comment.document_id = @document.id | ||
end | ||
|
||
document_blank = !@document.text.blank? | ||
old_length = document_blank ? @document.text.length : 0 | ||
old_crc = document_blank ? Zlib.crc32(@document.text) : 0 | ||
new_crc = Zlib.crc32(@tbdoc.newdata) | ||
changes_made = old_crc != new_crc | ||
|
||
if changes_made | ||
@document.update(text: @tbdoc.newdata) | ||
new_length = @document.text ? @document.text.length : 'no text retrieved by crawler' | ||
|
||
# There is a cron job in the crontab of the 'tosdr' user on the forum.tosdr.org | ||
# server which runs once a day and before it deploys the site from edit.tosdr.org | ||
# to tosdr.org, it will run the check_quotes script from | ||
# https://github.com/tosdr/tosback-crawler/blob/225a74b/src/eto-admin.js#L121-L123 | ||
# So that if text has moved without changing, points are updated to the corrected | ||
# quote_start, quote_end, and quote_text values where possible, and/or their status is | ||
# switched between: | ||
# pending <-> pending-not-found | ||
# approved <-> approved-not-found | ||
crawler = @tbdoc.apiresponse['message']['crawler'] || '' | ||
@document_comment.summary = '<span class="label label-info">Document has been crawled</span><br><b>Old length:</b> <kbd>' + old_length.to_s + ' CRC ' + old_crc.to_s + '</kbd><br><b>New length:</b> <kbd>' + new_length.to_s + ' CRC ' + new_crc.to_s + '</kbd><br> Crawler: <kbd>' + crawler + '</kbd>' | ||
@document_comment.user_id = current_user.id | ||
@document_comment.document_id = @document.id | ||
end | ||
def build_request(document_url, uri, selector) | ||
request = Net::HTTP::Post.new(uri) | ||
params = '{"fetch": "' + document_url + '","select": "' + selector + '"}' | ||
request.body = params | ||
request.content_type = 'application/json' | ||
token = ENV['OTA_API_SECRET'] | ||
request['Authorization'] = "Bearer #{token}" | ||
|
||
unless changes_made | ||
@tbdoc.apiresponse['error'] = true | ||
@tbdoc.apiresponse['message'] = { | ||
'name' => 'The source document has not been updated. No changes made.', | ||
'remoteStacktrace' => 'SourceDocument' | ||
} | ||
end | ||
request | ||
end | ||
|
||
message_name = @tbdoc.apiresponse['message']['name'] || '' | ||
crawler = @tbdoc.apiresponse['message']['crawler'] || '' | ||
stacktrace = @tbdoc.apiresponse['message']['remoteStacktrace'] || '' | ||
def analyze_points | ||
document.handle_missing_points | ||
end | ||
|
||
@document_comment.summary = '<span class="label label-danger">Attempted to Crawl Document</span><br>Error Message: <kbd>' + message_name + '</kbd><br>Crawler: <kbd>' + crawler + '</kbd><br>Stacktrace: <kbd>' + stacktrace + '</kbd>' | ||
@document_comment.user_id = current_user.id | ||
@document_comment.document_id = @document.id | ||
def fetch_text(request, uri, document) | ||
crawl_sucessful = false | ||
begin | ||
response_text = Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http| | ||
http.request(request) | ||
end | ||
|
||
if @document_comment.save | ||
puts 'Comment added!' | ||
else | ||
puts 'Error adding comment!' | ||
puts @document_comment.errors.full_messages | ||
case response_text | ||
when Net::HTTPSuccess | ||
puts 'HTTP Success' | ||
response_body = response_text.body | ||
parsed_response_body = JSON.parse(response_body) | ||
document.text = parsed_response_body | ||
crawl_sucessful = true | ||
message = 'Document created!' | ||
else | ||
Rails.logger.error("HTTP Error: #{response.code} - #{response.message}") | ||
message = "HTTP Error: Could not retrieve document text. Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{response.code} - #{response.message}" | ||
end | ||
rescue SocketError => e | ||
# Handle network-related errors | ||
Rails.logger.error("Network Error: #{e.message}") | ||
message = "Network Error: Crawler unreachable. Could not retrieve document text. Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{e.message}" | ||
rescue Timeout::Error => e | ||
# Handle timeout errors | ||
Rails.logger.error("Timeout Error: #{e.message}") | ||
message = "Timeout Error: Could not retrieve document text. Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{e.message}" | ||
rescue StandardError => e | ||
# Handle any other standard errors | ||
Rails.logger.error("Standard Error: #{e.message}") | ||
message = "Standard Error: Could not retrieve document text. Is the crawler running? Contact <a href='mailto:[email protected]'>[email protected]</a>. Details: #{e.message}" | ||
end | ||
|
||
@tbdoc.apiresponse | ||
{ document: document, message: message, crawl_sucessful: crawl_sucessful } | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.