-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Basic CSV validation * Check for required headers * Check for missing required fields * Ensure resource_type conforms to expected controlled vocabulary * Generate tests and fixtures for various kinds of malformed CSVs * Treat resource type field as required * Remove unused file * Remove unused files
- Loading branch information
Showing
17 changed files
with
404 additions
and
28 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
# frozen_string_literal: true | ||
|
||
# Validate a CSV file. | ||
# | ||
# Don't put expensive validations in this class. | ||
# This is meant to be used for running a few quick | ||
# validations before starting a CSV-based import. | ||
# It will be called during the HTTP request/response, | ||
# so long-running validations will make the page load | ||
# slowly for the user. Any validations that are slow | ||
# should be run in background jobs during the import | ||
# instead of here. | ||
module Zizia | ||
class CsvManifestValidator | ||
# @param manifest_uploader [CsvManifestUploader] The manifest that's mounted to a CsvImport record. See carrierwave gem documentation. This is basically a wrapper for the CSV file. | ||
def initialize(manifest_uploader) | ||
@csv_file = manifest_uploader.file | ||
@errors = [] | ||
@warnings = [] | ||
end | ||
|
||
# Errors and warnings for the CSV file. | ||
attr_reader :errors, :warnings | ||
attr_reader :csv_file | ||
|
||
def validate | ||
parse_csv | ||
return unless @rows | ||
|
||
missing_headers | ||
duplicate_headers | ||
unrecognized_headers | ||
missing_values | ||
# invalid_license # Not yet implemented for Emory | ||
invalid_resource_type | ||
# invalid_rights_statement # Not yet implemented for Emory | ||
end | ||
|
||
# One record per row | ||
def record_count | ||
return nil unless @rows | ||
@rows.size - 1 # Don't include the header row | ||
end | ||
|
||
def delimiter | ||
@delimiter ||= default_delimiter | ||
end | ||
attr_writer :delimiter | ||
|
||
private | ||
|
||
def default_delimiter | ||
Zizia::HyraxBasicMetadataMapper.new.delimiter | ||
end | ||
|
||
def valid_headers | ||
[ | ||
'Item ID', 'Desc - Abstract', | ||
'Desc - Administrative Unit', | ||
'Desc - Call Number/MSS Number', | ||
'Desc - Collection Name', | ||
'Desc - Contact Information', | ||
'Desc - Creator', | ||
'Desc - Date Created', | ||
'Desc - Date Published', | ||
'Desc - Genre - AAT', | ||
'Desc - Genre - Legacy Data', | ||
'Desc - Holding Repository', | ||
'Desc - Institution', | ||
'Desc - Language - Primary', | ||
'Desc - Language - Secondary', | ||
'Desc - Legacy Identifier', | ||
'Desc - Notes', | ||
'Desc - PID', | ||
'Desc - Place of Publication', | ||
'Desc - Publisher', | ||
'Desc - Rights Statement', | ||
'Desc - RightsStatement.org Designation (Label)', | ||
'Desc - RightsStatement.org Designation (URI)', | ||
'Desc - Subject - Corporate Name - LCNAF', | ||
'Desc - Subject - Geographic - LCSH', | ||
'Desc - Subject - Keywords', | ||
'Desc - Subject - Meeting Name - LCNAF', | ||
'Desc - Subject - Personal Name - LCNAF', | ||
'Desc - Subject - Topic - LCSH', | ||
'Desc - Subject - Uniform Title - LCNAF', | ||
'Desc - Table of Contents (Books)', | ||
'Desc - Title', | ||
'Desc - Type of Resource', | ||
'Digital Object - Data Classification', | ||
'Digital Object - Structure', | ||
'Digital Object - Viewer Settings', | ||
'Digital Object - Visibility', | ||
'Filename' | ||
] | ||
end | ||
|
||
def parse_csv | ||
@rows = CSV.read(csv_file.path) | ||
@headers = @rows.first || [] | ||
@transformed_headers = @headers.map { |header| header.downcase.strip } | ||
rescue | ||
@errors << 'We are unable to read this CSV file.' | ||
end | ||
|
||
def missing_headers | ||
required_headers.each do |header| | ||
next if @transformed_headers.include?(header.downcase.strip) | ||
@errors << "Missing required column: #{header.titleize.squeeze(' ')}. Your spreadsheet must have this column. If you already have this column, please check the spelling and capitalization." | ||
end | ||
end | ||
|
||
def required_headers | ||
['Desc - Title', 'Filename', "Desc - Type of Resource"] | ||
end | ||
|
||
def duplicate_headers | ||
duplicates = [] | ||
sorted_headers = @transformed_headers.sort | ||
sorted_headers.each_with_index do |x, i| | ||
duplicates << x if x == sorted_headers[i + 1] | ||
end | ||
duplicates.uniq.each do |header| | ||
@errors << "Duplicate column names: You can have only one \"#{header.titleize}\" column." | ||
end | ||
end | ||
|
||
# Warn the user if we find any unexpected headers. | ||
def unrecognized_headers | ||
normalized_valid_headers = valid_headers.map { |a| a.downcase.strip } | ||
extra_headers = @transformed_headers - normalized_valid_headers | ||
extra_headers.each do |header| | ||
@warnings << "The field name \"#{header}\" is not supported. This field will be ignored, and the metadata for this field will not be imported." | ||
end | ||
end | ||
|
||
def transformed_required_headers | ||
required_headers.map { |a| a.downcase.strip.squeeze(' ') } | ||
end | ||
|
||
def missing_values | ||
column_numbers = transformed_required_headers.map { |header| @transformed_headers.find_index(header) }.compact | ||
@rows.each_with_index do |row, i| | ||
column_numbers.each_with_index do |column_number, j| | ||
next unless row[column_number].blank? | ||
@errors << "Missing required metadata in row #{i + 1}: \"#{required_headers[j].titleize}\" field cannot be blank" | ||
end | ||
end | ||
end | ||
|
||
# Only allow valid license values expected by Hyrax. | ||
# Otherwise the app throws an error when it displays the work. | ||
def invalid_license | ||
validate_values('license', :valid_licenses) | ||
end | ||
|
||
def invalid_resource_type | ||
validate_values('Desc - Type of Resource', :valid_resource_types) | ||
end | ||
|
||
def invalid_rights_statement | ||
validate_values('rights statement', :valid_rights_statements) | ||
end | ||
|
||
def valid_licenses | ||
@valid_license_ids ||= Hyrax::LicenseService.new.authority.all.select { |license| license[:active] }.map { |license| license[:id] } | ||
end | ||
|
||
# @return Array containing all valid URIs and all valid labels | ||
def valid_resource_types | ||
@valid_resource_type_ids ||= Qa::Authorities::Local.subauthority_for('resource_types').all.select { |term| term[:active] }.map { |term| term[:id] } | ||
@valid_resource_type_labels ||= Qa::Authorities::Local.subauthority_for('resource_types').all.select { |term| term[:active] }.map { |term| term[:label] } | ||
@valid_resource_type_ids + @valid_resource_type_labels + @valid_resource_type_labels.map { |a| a.downcase.strip.squeeze(' ') } | ||
end | ||
|
||
def valid_rights_statements | ||
@valid_rights_statement_ids ||= Qa::Authorities::Local.subauthority_for('rights_statements').all.select { |term| term[:active] }.map { |term| term[:id] } | ||
end | ||
|
||
# Make sure this column contains only valid values | ||
def validate_values(header_name, valid_values_method) | ||
column_number = @transformed_headers.find_index(header_name.downcase.strip.squeeze(' ')) | ||
return unless column_number | ||
|
||
@rows.each_with_index do |row, i| | ||
next if i.zero? # Skip the header row | ||
next unless row[column_number] | ||
|
||
values = row[column_number].split(delimiter) | ||
valid_values = method(valid_values_method).call | ||
invalid_values = values.select { |value| !valid_values.include?(value) } | ||
|
||
invalid_values.each do |value| | ||
@errors << "Invalid #{header_name.titleize} in row #{i + 1}: #{value}" | ||
end | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,3 @@ | ||
require_relative '../prepends/hyrax_record_importer_prepends' | ||
require_relative '../prepends/csv_manifest_validator_prepends' | ||
|
||
Zizia::HyraxRecordImporter.prepend(HyraxRecordImporterPrepends) | ||
Zizia::CsvManifestValidator.prepend(CsvManifestValidatorPrepends) |
This file was deleted.
Oops, something went wrong.
2 changes: 2 additions & 0 deletions
2
spec/fixtures/csv_import/csv_files_with_problems/duplicate_headers.csv
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
Item ID,Desc - Abstract,Desc - Administrative Unit,Desc - Call Number/MSS Number,Desc - Call Number/MSS Number,Desc - Collection Name,Desc - Contact Information,Desc - Creator,Desc - Date Created,Desc - Date Published,Desc - Genre - AAT,Desc - Genre - Legacy Data,Desc - Holding Repository,Desc - Institution,Desc - Language - Primary,Desc - Language - Secondary,Desc - Legacy Identifier,Desc - Notes,Desc - PID,Desc - Place of Publication,Desc - Publisher,Desc - Rights Statement,Desc - RightsStatement.org Designation (Label),Desc - RightsStatement.org Designation (URI),Desc - Subject - Corporate Name - LCNAF,Desc - Subject - Geographic - LCSH,Desc - Subject - Keywords,Desc - Subject - Meeting Name - LCNAF,Desc - Subject - Personal Name - LCNAF,Desc - Subject - Topic - LCSH,Desc - Subject - Uniform Title - LCNAF,Desc - Table of Contents (Books),Desc - Title,Desc - Type of Resource,Digital Object - Data Classification,Digital Object - Structure,Digital Object - Viewer Settings,Digital Object - Visibility,Filename | ||
156758,"Recto: Faculty and Grauates, University of West Tennessee, 1921, Memphis, Tenn., Hooks Bros. Photo., Nahm Studio, Pueblo, Medicine, E.C. Outten, Panama, P.A. Tirador, P.I., E.P. Henry, Okla., D.W. Briggs, A.B., Ark., J.S. Cobb, M.D., Fla., Nurse Training, E. Cheatham, R.N., C.K. Cribb, R.N., G.M. Moore, Tenn., B.T. Anderson, Miss., L.V. Albudy, Tex., F.L. Avery, Okla., Dentistry, M.R. Ransom, Mo., S.P. Robertson, M.D., S.C., E.L. Hairston, Va., W.H. Young, Tex., B.F. McCleave, M.D., S.C., R.L. Flagg, A.B., M.D., Miss., B.F. McCleave, M.D., T.E. Cox, M.D., F.A. Moore, M.D., B.D. Harrell, M.D., R.L. Flagg, M.D., J.W. Beckette, M.D., N.M. Watson, M.D., O.W. Hooge, M.D., F.W. Thurman, D.D.S., W. Waters, D.D.S., W.E. Cloud, D.D.S., U.S. Walton, D.D.S., O.B. Braithwhite, Dean Dental College, J.C. Hairston, Dean - M.D., C.A. Terrell, M.D., Dean Surgery, B.S. Lynk, Ph.C, Dean - Ph.C., M.V. Lynk, M.S., M.D., LL.G., President","Stuart A. Rose Manuscript, Archives and Rare Book Library",MSS 1218,MSS 1218,Robert Langmuir African American photograph collection,,Hooks Brothers.,1921,,card photographs (photographs),,"Stuart A. Rose Manuscript, Archives and Rare Book Library",Emory University,,,MSS1218_B011_I026,,,,,"Emory University does not control copyright for this image.  This image is made available for individual viewing and reference for educational purposes only such as personal study, preparation for teaching, and research.  Your reproduction, distribution, public display or other re-use of any content beyond a fair use as codified in section 107 of US Copyright Law is at your own risk.  We are always interested in learning more about our collections.  If you have information regarding this photograph, please contact [email protected].",No Copyright - United States,http://rightsstatements.org/vocab/NoC-US/1.0/,"University of Tennessee, Memphis.",Memphis (Tenn.),"Education: Colleges, universities, and technical colleges",,"Robertson, S. P., M.D.|Harrell, B. D., M.D.|Cox, T. E., M.D.|Cheatham, E.|Young, W. H.|Terrell, C. A., M.D., Dean of Surgery.|Cloud, W. E., D.D.S.|Braithwhite, O. B., Dean of Dental College.|Cobb, J.S., M.D.|Waters, W., D.D.S.|Hooge, O. W., M.D.|Watson, N. M., M.D.|Walton, U. S., D.D.S.|Thurman, F. W., D.D.S.|Briggs, D. W.|Albudy, L. V.|McCleave, B. F., M.D.|Moore, G. M.|Henry, E. P.|Hairston, E. L.|Ransom, M. R.|Lynk, B. S., Ph.C, Dean of Ph.C.|Avery, F. L.|Flagg, R. L., M.D.|Moore, F. A., M.D.|""Outten, E. C.|Hairston, J. C., Dean of M.D.|Tirador, P. A.|Lynk, M. V., M.S., M.D., LL.G., President.""|Anderson, B. T.|Beckette, J. W., M.D.|Cribb, C. K.",College graduates.|African American photographers.|Nursing.|Medicine.|Dentistry.|Universities and colleges--Faculty.,,,"Faculty and graduates of University of West Tennessee, Memphis, Tenn. in medicine, dentistry and nurse training in 1921",still image,Confidential,Standalone,,Public,MSS1218_B011_I026_P0001_ARCH.tif |
2 changes: 2 additions & 0 deletions
2
spec/fixtures/csv_import/csv_files_with_problems/extra_headers.csv
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
Item ID,another_header_1,another_header_2,Desc - Abstract,Desc - Administrative Unit,Desc - Call Number/MSS Number,Desc - Collection Name,Desc - Contact Information,Desc - Creator,Desc - Date Created,Desc - Date Published,Desc - Genre - AAT,Desc - Genre - Legacy Data,Desc - Holding Repository,Desc - Institution,Desc - Language - Primary,Desc - Language - Secondary,Desc - Legacy Identifier,Desc - Notes,Desc - PID,Desc - Place of Publication,Desc - Publisher,Desc - Rights Statement,Desc - RightsStatement.org Designation (Label),Desc - RightsStatement.org Designation (URI),Desc - Subject - Corporate Name - LCNAF,Desc - Subject - Geographic - LCSH,Desc - Subject - Keywords,Desc - Subject - Meeting Name - LCNAF,Desc - Subject - Personal Name - LCNAF,Desc - Subject - Topic - LCSH,Desc - Subject - Uniform Title - LCNAF,Desc - Table of Contents (Books),Desc - Title,Desc - Type of Resource,Digital Object - Data Classification,Digital Object - Structure,Digital Object - Viewer Settings,Digital Object - Visibility,Filename | ||
156758,,,"Recto: Faculty and Grauates, University of West Tennessee, 1921, Memphis, Tenn., Hooks Bros. Photo., Nahm Studio, Pueblo, Medicine, E.C. Outten, Panama, P.A. Tirador, P.I., E.P. Henry, Okla., D.W. Briggs, A.B., Ark., J.S. Cobb, M.D., Fla., Nurse Training, E. Cheatham, R.N., C.K. Cribb, R.N., G.M. Moore, Tenn., B.T. Anderson, Miss., L.V. Albudy, Tex., F.L. Avery, Okla., Dentistry, M.R. Ransom, Mo., S.P. Robertson, M.D., S.C., E.L. Hairston, Va., W.H. Young, Tex., B.F. McCleave, M.D., S.C., R.L. Flagg, A.B., M.D., Miss., B.F. McCleave, M.D., T.E. Cox, M.D., F.A. Moore, M.D., B.D. Harrell, M.D., R.L. Flagg, M.D., J.W. Beckette, M.D., N.M. Watson, M.D., O.W. Hooge, M.D., F.W. Thurman, D.D.S., W. Waters, D.D.S., W.E. Cloud, D.D.S., U.S. Walton, D.D.S., O.B. Braithwhite, Dean Dental College, J.C. Hairston, Dean - M.D., C.A. Terrell, M.D., Dean Surgery, B.S. Lynk, Ph.C, Dean - Ph.C., M.V. Lynk, M.S., M.D., LL.G., President","Stuart A. Rose Manuscript, Archives and Rare Book Library",MSS 1218,Robert Langmuir African American photograph collection,,Hooks Brothers.,1921,,card photographs (photographs),,"Stuart A. Rose Manuscript, Archives and Rare Book Library",Emory University,,,MSS1218_B011_I026,,,,,"Emory University does not control copyright for this image.  This image is made available for individual viewing and reference for educational purposes only such as personal study, preparation for teaching, and research.  Your reproduction, distribution, public display or other re-use of any content beyond a fair use as codified in section 107 of US Copyright Law is at your own risk.  We are always interested in learning more about our collections.  If you have information regarding this photograph, please contact [email protected].",No Copyright - United States,http://rightsstatements.org/vocab/NoC-US/1.0/,"University of Tennessee, Memphis.",Memphis (Tenn.),"Education: Colleges, universities, and technical colleges",,"Robertson, S. P., M.D.|Harrell, B. D., M.D.|Cox, T. E., M.D.|Cheatham, E.|Young, W. H.|Terrell, C. A., M.D., Dean of Surgery.|Cloud, W. E., D.D.S.|Braithwhite, O. B., Dean of Dental College.|Cobb, J.S., M.D.|Waters, W., D.D.S.|Hooge, O. W., M.D.|Watson, N. M., M.D.|Walton, U. S., D.D.S.|Thurman, F. W., D.D.S.|Briggs, D. W.|Albudy, L. V.|McCleave, B. F., M.D.|Moore, G. M.|Henry, E. P.|Hairston, E. L.|Ransom, M. R.|Lynk, B. S., Ph.C, Dean of Ph.C.|Avery, F. L.|Flagg, R. L., M.D.|Moore, F. A., M.D.|""Outten, E. C.|Hairston, J. C., Dean of M.D.|Tirador, P. A.|Lynk, M. V., M.S., M.D., LL.G., President.""|Anderson, B. T.|Beckette, J. W., M.D.|Cribb, C. K.",College graduates.|African American photographers.|Nursing.|Medicine.|Dentistry.|Universities and colleges--Faculty.,,,"Faculty and graduates of University of West Tennessee, Memphis, Tenn. in medicine, dentistry and nurse training in 1921",still image,Confidential,Standalone,,Public,MSS1218_B011_I026_P0001_ARCH.tif |
Oops, something went wrong.