-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add experimental import of JSONL files, with optional multiprocessing #4855 #4858 #4918
add experimental import of JSONL files, with optional multiprocessing #4855 #4858 #4918
Conversation
# during an import that uses multiprocessing. | ||
# see https://stackoverflow.com/a/49461944/3873885 | ||
django.setup() | ||
from django.db import connection, connections, transaction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E402 module level import not at top of file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact this line must be here before the models are imported, see code comment and stack overflow link.
@@ -828,7 +855,9 @@ def import_business_data(self, data_source, config_file=None, overwrite=None, bu | |||
path = utils.get_valid_path(source) | |||
if path is not None: | |||
print 'Importing {0}. . .'.format(path) | |||
BusinessDataImporter(path, config_file).import_business_data(overwrite=overwrite, bulk=bulk_load, create_concepts=create_concepts, create_collections=create_collections) | |||
BusinessDataImporter(path, config_file).import_business_data(overwrite=overwrite, | |||
bulk=bulk_load, create_concepts=create_concepts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to ignore this rule in this context. The linter would have me indent the continuation of the arguments very far to the right which in my mind is more difficult to read.
BusinessDataImporter(path, config_file).import_business_data(overwrite=overwrite, bulk=bulk_load, create_concepts=create_concepts, create_collections=create_collections) | ||
BusinessDataImporter(path, config_file).import_business_data(overwrite=overwrite, | ||
bulk=bulk_load, create_concepts=create_concepts, | ||
create_collections=create_collections, use_multiprocessing=use_multiprocessing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to ignore this rule in this context. The linter would have me indent the continuation of the arguments very far to the right which in my mind is more difficult to read.
…into 4855_4858_jsonl_multiprocessing_4_4_x
@@ -503,6 +509,7 @@ def load_business_data(package_dir): | |||
business_data.append(os.path.join(package_dir, 'business_data', f)) | |||
else: | |||
business_data += glob.glob(os.path.join(package_dir, 'business_data','*.json')) | |||
business_data += glob.glob(os.path.join(package_dir, 'business_data','*.jsonl')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E231 missing whitespace after ','
This can now be tested by loading the demo HPLA package, using the branch I set up that has JSONL files in the business data directory:
To test multiprocessing you'll need to import those business data files individually and use the flag as described in the main body of this pull request. |
Types of changes
Description of Change
This pull request adds support for the import of JOSNL (json lines) files through the existing import commands.
python manage.py packages -o import_business_data -s resources.jsonl
It also adds support for the use of multiprocessing during JSONL import.
python manage.py packages -o import_business_data -s resources.jsonl --use_multiprocessing
Both of these new features should be considered "experimental". One of the main reasons for this is that the baked-in print statements that are within the actual import code (as well as the error reporting) have not yet been updated to work with these two new features. So, the console printouts during JSONL import look like this:
and while using multiprocessing they get all jumbled up:
Another important thing to note is that while multiple processes are running, you can't just
ctrl+c
to stop the operation, as the processes just keep spawning.To address these shortcomings I have added some warnings.
For JSONL:
For multiprocessing:
and that warning is followed by a confirmation prompt (which you can bypass by adding
-y/--yes
to the initial command).Issues Solved
#4858
#4855
Checklist
Ticket Background
Further comments