Skip to content

Commit

Permalink
Basic CSV validation (#401)
Browse files Browse the repository at this point in the history
* Basic CSV validation

* Check for required headers
* Check for missing required fields
* Ensure resource_type conforms to expected controlled vocabulary
* Generate tests and fixtures for various kinds of malformed CSVs

* Treat resource type field as required

* Remove unused file

* Remove unused files
  • Loading branch information
bess authored and little9 committed Aug 6, 2019
1 parent 2d836c3 commit 0d65e93
Show file tree
Hide file tree
Showing 17 changed files with 404 additions and 28 deletions.
4 changes: 4 additions & 0 deletions .rubocop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,11 @@ Metrics/ClassLength:
- 'app/controllers/catalog_controller.rb'
- 'app/actors/hyrax/actors/file_set_actor.rb'
- 'app/forms/hyrax/forms/collection_form.rb'
- app/uploaders/zizia/csv_manifest_validator.rb

Metrics/MethodLength:
Exclude:
- app/uploaders/zizia/csv_manifest_validator.rb

RSpec/ExampleLength:
Exclude:
Expand Down
7 changes: 1 addition & 6 deletions app/importers/curate_mapper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ class CurateMapper < Zizia::HashMapper
attr_reader :row_number

CURATE_TERMS_MAP = {
title: "Title"
title: "Desc - Title"
}.freeze

DELIMITER = '|~|'
Expand All @@ -20,11 +20,6 @@ def self.allowed_headers
['Visibility']
end

# What columns must exist in the CSV
def self.required_headers
['Title']
end

def fields
# The fields common to all object types
common_fields = CURATE_TERMS_MAP.keys + [:visibility]
Expand Down
199 changes: 199 additions & 0 deletions app/uploaders/zizia/csv_manifest_validator.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# frozen_string_literal: true

# Validate a CSV file.
#
# Don't put expensive validations in this class.
# This is meant to be used for running a few quick
# validations before starting a CSV-based import.
# It will be called during the HTTP request/response,
# so long-running validations will make the page load
# slowly for the user. Any validations that are slow
# should be run in background jobs during the import
# instead of here.
module Zizia
class CsvManifestValidator
# @param manifest_uploader [CsvManifestUploader] The manifest that's mounted to a CsvImport record. See carrierwave gem documentation. This is basically a wrapper for the CSV file.
def initialize(manifest_uploader)
@csv_file = manifest_uploader.file
@errors = []
@warnings = []
end

# Errors and warnings for the CSV file.
attr_reader :errors, :warnings
attr_reader :csv_file

def validate
parse_csv
return unless @rows

missing_headers
duplicate_headers
unrecognized_headers
missing_values
# invalid_license # Not yet implemented for Emory
invalid_resource_type
# invalid_rights_statement # Not yet implemented for Emory
end

# One record per row
def record_count
return nil unless @rows
@rows.size - 1 # Don't include the header row
end

def delimiter
@delimiter ||= default_delimiter
end
attr_writer :delimiter

private

def default_delimiter
Zizia::HyraxBasicMetadataMapper.new.delimiter
end

def valid_headers
[
'Item ID', 'Desc - Abstract',
'Desc - Administrative Unit',
'Desc - Call Number/MSS Number',
'Desc - Collection Name',
'Desc - Contact Information',
'Desc - Creator',
'Desc - Date Created',
'Desc - Date Published',
'Desc - Genre - AAT',
'Desc - Genre - Legacy Data',
'Desc - Holding Repository',
'Desc - Institution',
'Desc - Language - Primary',
'Desc - Language - Secondary',
'Desc - Legacy Identifier',
'Desc - Notes',
'Desc - PID',
'Desc - Place of Publication',
'Desc - Publisher',
'Desc - Rights Statement',
'Desc - RightsStatement.org Designation (Label)',
'Desc - RightsStatement.org Designation (URI)',
'Desc - Subject - Corporate Name - LCNAF',
'Desc - Subject - Geographic - LCSH',
'Desc - Subject - Keywords',
'Desc - Subject - Meeting Name - LCNAF',
'Desc - Subject - Personal Name - LCNAF',
'Desc - Subject - Topic - LCSH',
'Desc - Subject - Uniform Title - LCNAF',
'Desc - Table of Contents (Books)',
'Desc - Title',
'Desc - Type of Resource',
'Digital Object - Data Classification',
'Digital Object - Structure',
'Digital Object - Viewer Settings',
'Digital Object - Visibility',
'Filename'
]
end

def parse_csv
@rows = CSV.read(csv_file.path)
@headers = @rows.first || []
@transformed_headers = @headers.map { |header| header.downcase.strip }
rescue
@errors << 'We are unable to read this CSV file.'
end

def missing_headers
required_headers.each do |header|
next if @transformed_headers.include?(header.downcase.strip)
@errors << "Missing required column: #{header.titleize.squeeze(' ')}. Your spreadsheet must have this column. If you already have this column, please check the spelling and capitalization."
end
end

def required_headers
['Desc - Title', 'Filename', "Desc - Type of Resource"]
end

def duplicate_headers
duplicates = []
sorted_headers = @transformed_headers.sort
sorted_headers.each_with_index do |x, i|
duplicates << x if x == sorted_headers[i + 1]
end
duplicates.uniq.each do |header|
@errors << "Duplicate column names: You can have only one \"#{header.titleize}\" column."
end
end

# Warn the user if we find any unexpected headers.
def unrecognized_headers
normalized_valid_headers = valid_headers.map { |a| a.downcase.strip }
extra_headers = @transformed_headers - normalized_valid_headers
extra_headers.each do |header|
@warnings << "The field name \"#{header}\" is not supported. This field will be ignored, and the metadata for this field will not be imported."
end
end

def transformed_required_headers
required_headers.map { |a| a.downcase.strip.squeeze(' ') }
end

def missing_values
column_numbers = transformed_required_headers.map { |header| @transformed_headers.find_index(header) }.compact
@rows.each_with_index do |row, i|
column_numbers.each_with_index do |column_number, j|
next unless row[column_number].blank?
@errors << "Missing required metadata in row #{i + 1}: \"#{required_headers[j].titleize}\" field cannot be blank"
end
end
end

# Only allow valid license values expected by Hyrax.
# Otherwise the app throws an error when it displays the work.
def invalid_license
validate_values('license', :valid_licenses)
end

def invalid_resource_type
validate_values('Desc - Type of Resource', :valid_resource_types)
end

def invalid_rights_statement
validate_values('rights statement', :valid_rights_statements)
end

def valid_licenses
@valid_license_ids ||= Hyrax::LicenseService.new.authority.all.select { |license| license[:active] }.map { |license| license[:id] }
end

# @return Array containing all valid URIs and all valid labels
def valid_resource_types
@valid_resource_type_ids ||= Qa::Authorities::Local.subauthority_for('resource_types').all.select { |term| term[:active] }.map { |term| term[:id] }
@valid_resource_type_labels ||= Qa::Authorities::Local.subauthority_for('resource_types').all.select { |term| term[:active] }.map { |term| term[:label] }
@valid_resource_type_ids + @valid_resource_type_labels + @valid_resource_type_labels.map { |a| a.downcase.strip.squeeze(' ') }
end

def valid_rights_statements
@valid_rights_statement_ids ||= Qa::Authorities::Local.subauthority_for('rights_statements').all.select { |term| term[:active] }.map { |term| term[:id] }
end

# Make sure this column contains only valid values
def validate_values(header_name, valid_values_method)
column_number = @transformed_headers.find_index(header_name.downcase.strip.squeeze(' '))
return unless column_number

@rows.each_with_index do |row, i|
next if i.zero? # Skip the header row
next unless row[column_number]

values = row[column_number].split(delimiter)
valid_values = method(valid_values_method).call
invalid_values = values.select { |value| !valid_values.include?(value) }

invalid_values.each do |value|
@errors << "Invalid #{header_name.titleize} in row #{i + 1}: #{value}"
end
end
end
end
end
2 changes: 0 additions & 2 deletions config/initializers/prepends.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
require_relative '../prepends/hyrax_record_importer_prepends'
require_relative '../prepends/csv_manifest_validator_prepends'

Zizia::HyraxRecordImporter.prepend(HyraxRecordImporterPrepends)
Zizia::CsvManifestValidator.prepend(CsvManifestValidatorPrepends)
9 changes: 0 additions & 9 deletions config/prepends/csv_manifest_validator_prepends.rb

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Item ID,Desc - Abstract,Desc - Administrative Unit,Desc - Call Number/MSS Number,Desc - Call Number/MSS Number,Desc - Collection Name,Desc - Contact Information,Desc - Creator,Desc - Date Created,Desc - Date Published,Desc - Genre - AAT,Desc - Genre - Legacy Data,Desc - Holding Repository,Desc - Institution,Desc - Language - Primary,Desc - Language - Secondary,Desc - Legacy Identifier,Desc - Notes,Desc - PID,Desc - Place of Publication,Desc - Publisher,Desc - Rights Statement,Desc - RightsStatement.org Designation (Label),Desc - RightsStatement.org Designation (URI),Desc - Subject - Corporate Name - LCNAF,Desc - Subject - Geographic - LCSH,Desc - Subject - Keywords,Desc - Subject - Meeting Name - LCNAF,Desc - Subject - Personal Name - LCNAF,Desc - Subject - Topic - LCSH,Desc - Subject - Uniform Title - LCNAF,Desc - Table of Contents (Books),Desc - Title,Desc - Type of Resource,Digital Object - Data Classification,Digital Object - Structure,Digital Object - Viewer Settings,Digital Object - Visibility,Filename
156758,"Recto: Faculty and Grauates, University of West Tennessee, 1921, Memphis, Tenn., Hooks Bros. Photo., Nahm Studio, Pueblo, Medicine, E.C. Outten, Panama, P.A. Tirador, P.I., E.P. Henry, Okla., D.W. Briggs, A.B., Ark., J.S. Cobb, M.D., Fla., Nurse Training, E. Cheatham, R.N., C.K. Cribb, R.N., G.M. Moore, Tenn., B.T. Anderson, Miss., L.V. Albudy, Tex., F.L. Avery, Okla., Dentistry, M.R. Ransom, Mo., S.P. Robertson, M.D., S.C., E.L. Hairston, Va., W.H. Young, Tex., B.F. McCleave, M.D., S.C., R.L. Flagg, A.B., M.D., Miss., B.F. McCleave, M.D., T.E. Cox, M.D., F.A. Moore, M.D., B.D. Harrell, M.D., R.L. Flagg, M.D., J.W. Beckette, M.D., N.M. Watson, M.D., O.W. Hooge, M.D., F.W. Thurman, D.D.S., W. Waters, D.D.S., W.E. Cloud, D.D.S., U.S. Walton, D.D.S., O.B. Braithwhite, Dean Dental College, J.C. Hairston, Dean - M.D., C.A. Terrell, M.D., Dean Surgery, B.S. Lynk, Ph.C, Dean - Ph.C., M.V. Lynk, M.S., M.D., LL.G., President","Stuart A. Rose Manuscript, Archives and Rare Book Library",MSS 1218,MSS 1218,Robert Langmuir African American photograph collection,,Hooks Brothers.,1921,,card photographs (photographs),,"Stuart A. Rose Manuscript, Archives and Rare Book Library",Emory University,,,MSS1218_B011_I026,,,,,"Emory University does not control copyright for this image.¬† This image is made available for individual viewing and reference for educational purposes only such as personal study, preparation for teaching, and research.¬† Your reproduction, distribution, public display or other re-use of any content beyond a fair use as codified in section 107 of US Copyright Law is at your own risk.¬† We are always interested in learning more about our collections.¬† If you have information regarding this photograph, please contact [email protected].",No Copyright - United States,http://rightsstatements.org/vocab/NoC-US/1.0/,"University of Tennessee, Memphis.",Memphis (Tenn.),"Education: Colleges, universities, and technical colleges",,"Robertson, S. P., M.D.|Harrell, B. D., M.D.|Cox, T. E., M.D.|Cheatham, E.|Young, W. H.|Terrell, C. A., M.D., Dean of Surgery.|Cloud, W. E., D.D.S.|Braithwhite, O. B., Dean of Dental College.|Cobb, J.S., M.D.|Waters, W., D.D.S.|Hooge, O. W., M.D.|Watson, N. M., M.D.|Walton, U. S., D.D.S.|Thurman, F. W., D.D.S.|Briggs, D. W.|Albudy, L. V.|McCleave, B. F., M.D.|Moore, G. M.|Henry, E. P.|Hairston, E. L.|Ransom, M. R.|Lynk, B. S., Ph.C, Dean of Ph.C.|Avery, F. L.|Flagg, R. L., M.D.|Moore, F. A., M.D.|""Outten, E. C.|Hairston, J. C., Dean of M.D.|Tirador, P. A.|Lynk, M. V., M.S., M.D., LL.G., President.""|Anderson, B. T.|Beckette, J. W., M.D.|Cribb, C. K.",College graduates.|African American photographers.|Nursing.|Medicine.|Dentistry.|Universities and colleges--Faculty.,,,"Faculty and graduates of University of West Tennessee, Memphis, Tenn. in medicine, dentistry and nurse training in 1921",still image,Confidential,Standalone,,Public,MSS1218_B011_I026_P0001_ARCH.tif
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Item ID,another_header_1,another_header_2,Desc - Abstract,Desc - Administrative Unit,Desc - Call Number/MSS Number,Desc - Collection Name,Desc - Contact Information,Desc - Creator,Desc - Date Created,Desc - Date Published,Desc - Genre - AAT,Desc - Genre - Legacy Data,Desc - Holding Repository,Desc - Institution,Desc - Language - Primary,Desc - Language - Secondary,Desc - Legacy Identifier,Desc - Notes,Desc - PID,Desc - Place of Publication,Desc - Publisher,Desc - Rights Statement,Desc - RightsStatement.org Designation (Label),Desc - RightsStatement.org Designation (URI),Desc - Subject - Corporate Name - LCNAF,Desc - Subject - Geographic - LCSH,Desc - Subject - Keywords,Desc - Subject - Meeting Name - LCNAF,Desc - Subject - Personal Name - LCNAF,Desc - Subject - Topic - LCSH,Desc - Subject - Uniform Title - LCNAF,Desc - Table of Contents (Books),Desc - Title,Desc - Type of Resource,Digital Object - Data Classification,Digital Object - Structure,Digital Object - Viewer Settings,Digital Object - Visibility,Filename
156758,,,"Recto: Faculty and Grauates, University of West Tennessee, 1921, Memphis, Tenn., Hooks Bros. Photo., Nahm Studio, Pueblo, Medicine, E.C. Outten, Panama, P.A. Tirador, P.I., E.P. Henry, Okla., D.W. Briggs, A.B., Ark., J.S. Cobb, M.D., Fla., Nurse Training, E. Cheatham, R.N., C.K. Cribb, R.N., G.M. Moore, Tenn., B.T. Anderson, Miss., L.V. Albudy, Tex., F.L. Avery, Okla., Dentistry, M.R. Ransom, Mo., S.P. Robertson, M.D., S.C., E.L. Hairston, Va., W.H. Young, Tex., B.F. McCleave, M.D., S.C., R.L. Flagg, A.B., M.D., Miss., B.F. McCleave, M.D., T.E. Cox, M.D., F.A. Moore, M.D., B.D. Harrell, M.D., R.L. Flagg, M.D., J.W. Beckette, M.D., N.M. Watson, M.D., O.W. Hooge, M.D., F.W. Thurman, D.D.S., W. Waters, D.D.S., W.E. Cloud, D.D.S., U.S. Walton, D.D.S., O.B. Braithwhite, Dean Dental College, J.C. Hairston, Dean - M.D., C.A. Terrell, M.D., Dean Surgery, B.S. Lynk, Ph.C, Dean - Ph.C., M.V. Lynk, M.S., M.D., LL.G., President","Stuart A. Rose Manuscript, Archives and Rare Book Library",MSS 1218,Robert Langmuir African American photograph collection,,Hooks Brothers.,1921,,card photographs (photographs),,"Stuart A. Rose Manuscript, Archives and Rare Book Library",Emory University,,,MSS1218_B011_I026,,,,,"Emory University does not control copyright for this image.¬† This image is made available for individual viewing and reference for educational purposes only such as personal study, preparation for teaching, and research.¬† Your reproduction, distribution, public display or other re-use of any content beyond a fair use as codified in section 107 of US Copyright Law is at your own risk.¬† We are always interested in learning more about our collections.¬† If you have information regarding this photograph, please contact [email protected].",No Copyright - United States,http://rightsstatements.org/vocab/NoC-US/1.0/,"University of Tennessee, Memphis.",Memphis (Tenn.),"Education: Colleges, universities, and technical colleges",,"Robertson, S. P., M.D.|Harrell, B. D., M.D.|Cox, T. E., M.D.|Cheatham, E.|Young, W. H.|Terrell, C. A., M.D., Dean of Surgery.|Cloud, W. E., D.D.S.|Braithwhite, O. B., Dean of Dental College.|Cobb, J.S., M.D.|Waters, W., D.D.S.|Hooge, O. W., M.D.|Watson, N. M., M.D.|Walton, U. S., D.D.S.|Thurman, F. W., D.D.S.|Briggs, D. W.|Albudy, L. V.|McCleave, B. F., M.D.|Moore, G. M.|Henry, E. P.|Hairston, E. L.|Ransom, M. R.|Lynk, B. S., Ph.C, Dean of Ph.C.|Avery, F. L.|Flagg, R. L., M.D.|Moore, F. A., M.D.|""Outten, E. C.|Hairston, J. C., Dean of M.D.|Tirador, P. A.|Lynk, M. V., M.S., M.D., LL.G., President.""|Anderson, B. T.|Beckette, J. W., M.D.|Cribb, C. K.",College graduates.|African American photographers.|Nursing.|Medicine.|Dentistry.|Universities and colleges--Faculty.,,,"Faculty and graduates of University of West Tennessee, Memphis, Tenn. in medicine, dentistry and nurse training in 1921",still image,Confidential,Standalone,,Public,MSS1218_B011_I026_P0001_ARCH.tif
Loading

0 comments on commit 0d65e93

Please sign in to comment.