-
Notifications
You must be signed in to change notification settings - Fork 0
/
spelling_correct.py
211 lines (159 loc) · 8.79 KB
/
spelling_correct.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""This script automates word spelling correction over one or more
text files. The algorithm that identifies and corrects a misspelled
word is quite computationally expensive, therefore runtime can be
slow. Expect 3 to 5 minutes per 1000 text files. Run the following
at the Terminal prompt to get started for Python 2.7:
$ python2 spelling_correct.py --input test_batch_input --logging True
or (for Python 3.x)
$ python3 spelling_correct.py --input test_batch_input --logging True
Although this script does a generally good job, given the complexity of
language, no spell corrector can obtain perfect accuracy for all texts.
To compensate for this to some degree, there is a --logging option
included that can be passed at script call. After a successful run, the
user can eyeball the logs that are generated in correct_log.txt to get a
transparent understanding of where misspelled words were found and how they
were corrected. Particulary bad mistakes may need to be hand-corrected.
This log file will never be overwritten, so if you want to start fresh, delete
it manually. A new one will automatically be created on the next run.
-------------------------------
The core spell correcting function requires the Python autocorrect package.
In order to install this in your current environment, input the following at
your Terminal prompt:
$ pip install autocorrect
For more details on this package see:
> https://pypi.python.org/pypi/autocorrect/0.2.0 --- (Package host)
> https://github.com/phatpiglet/autocorrect/ --- (Source repository)
> http://norvig.com/spell-correct.html --- (Theoretical basis)
-------------------------------
Current issues:
1.) Capitalized words are ignored entirely. This was done so that
that names of people, places, events, etc., don't get attempt
getting respelled.
*** Trying to fix this issue would likely cause even worse issues
2.) Misspelled words can be handled, however grammar errors and/or
incorrectly used words cannot be. "Loss" in the sentence -
"I always loss my keys." would not be respelled as "lose",
although that is the correct word to use in that context.
*** Fixing this issue is out of the scope of this task.
3.) Context around words is not considered, so all the words in "John
wanted to buy some peanut" would be considered correctly spelled,
even though a person would want to change "peanut" to "peanuts".
*** Fixing this issue is out of the scope of this task.
4.) Two (otherwise correctly spelled) words that are not separated by
spaces cause major issues. If the corrector encounters "toprepare"
for example, it will respell it as "prepare", although one can clearly
see it should be "to prepare". "perperson" becoming "perversion" is a
particularly bad mistake.
*** Working on handling this using regex.
5.) URLs and email addresses are troublesome because they are typically
presented entirely in lowercase and involve names of people, places
events, companies, etc.
*** Working on handling this using regex.
"""
# Allow compatibility between Python 2.7 and 3.5
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from io import open
#####
import glob # Allows batch input file fetching
import os # Allows convenient file handling
import time # Calculates script runtime
import spelling_utils # Collection of helper functions in the utils file
# Allows parsing of user-specified options at script call
from optparse import OptionParser
# Initialize parser, then adds arguments. Input is required,
PARSER = OptionParser()
PARSER.add_option('-i', '--input', dest='input_directory',
help='input directory with files that need spelling correction')
PARSER.add_option('-l', '--logging', dest='logging', default="False",
help='log words that were caught and corrected')
PARSER.add_option('-v', '--verbose', dest='verbose', default="True",
help='toggle console output')
OPTIONS, _ = PARSER.parse_args() # Convert options into usable strings
# Load user options into global constants for use in the script
INPUT_DIR = OPTIONS.input_directory # Directory containing input files (required)
LOGGING = OPTIONS.logging # Logs original and correct words in corrected_log.txt
VERBOSE = OPTIONS.verbose # Display console output (optional)
# Ensure input directory exists, otherwise the script exits
assert os.path.isdir(INPUT_DIR), \
"The input directory doesn't exist in the working directory"
# Default name for output directory, change if needed
OUTPUT_DIR = 'corrected_output'
# Check if output directory exists, otherwise create one
if not os.path.isdir(OUTPUT_DIR):
os.mkdir(OUTPUT_DIR)
##### Main Function #####
def spelling_corrector(input_dir, logging='False', verbose='True'):
"""Takes an input directory and processes all .txt files within, attempting
to replace all misspelled words with their correctly spelled counterparts.
Outputs to a separate directory new files where structure and content of
original text is preserved, albiet with as few spelling errors as possible.
`input_dir` = name of directory that contains input .txt files
`logging` = logs all instances of misspelled words with correct versions
`verbose` = toggles console output (default True)
"""
start_time = time.time() # Start runtime
# Set verbose and logging flags for use throughout
logging = spelling_utils.str_to_bool(logging)
verbose = spelling_utils.str_to_bool(verbose)
# Initialize generator that fetches input files one by one
input_fetcher = glob.iglob('{}/*.txt'.format(input_dir))
# Script begins
if verbose:
print("\n------------------------------\n"
"Spelling correction starting\n"
"------------------------------\n")
attempted_files = 0 # Track how many files attempted conversion
completed_files = 0 # Track how many files converted successfully
# Main loop that batch processes input files
while input_fetcher:
try:
current_input = next(input_fetcher) # Fetch next input file
except StopIteration:
break # Execute when there are no more files to fetch
# Create a path for the output file in the output directory
# Flag each output file with '--CORRECTED', indicating that any
# spelling errors have been found and corrected
current_output = os.path.join(OUTPUT_DIR, '{0}--CORRECTED{1}'.format(
*os.path.splitext(os.path.basename(current_input))))
attempted_files += 1 # Increment prior to processing
# Split text into list of complete words, whitespace, and punctuation
split_passage = spelling_utils.read_input(current_input)
# Check input to ensure that each file contains content before sending
# to parser. Skip and log files that don't meet this criteria
if not split_passage:
# Display bad input file
if verbose:
print("* ".rjust(2) + "No text was found in {}. "
"Nothing to convert.".format(current_input))
# Rebuild text with any spelling errors corrected
rebuilt_passage = spelling_utils.correct_spelling(
split_passage, logging, current_input)
# Write new passage to file in output directory at the path
# specified in current_output variable
with open(current_output, 'w', encoding='utf-8') as file_handle:
file_handle.write(rebuilt_passage)
# Display where current output file is located on the local file system
if verbose:
print("*".rjust(5), "\t", current_input, ">>>", current_output)
# Increment upon successful processing of current input file
completed_files += 1
# Main loop ends
end_time = time.time() # Runtime ends
# Output Report
if verbose:
print("\n--------------------------------------------\n"
"Conversion complete\n"
"{0} of {1} files converted in {2} minutes\n"
"--------------------------------------------\n".format(
completed_files, attempted_files, round((
end_time - start_time) / 60, 3)))
# Exit
return
# Standard code that allows the file to be run as a script from the terminal
if __name__ == '__main__':
spelling_corrector(INPUT_DIR, logging=LOGGING, verbose=VERBOSE)