-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EM-DAT Extraction Transform #128
Conversation
2551d53
to
08bbaa4
Compare
f99d8cc
to
8341b0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Few comments
# Get latest emdat extraction object so that we do not need to fetch historical data | ||
latest_extraction = ( | ||
ExtractionData.objects.filter( | ||
source=ExtractionData.Source.EMDAT, status=ExtractionData.Status.SUCCESS, resp_data__isnull=False | ||
) | ||
.exclude(source_validation_status=ExtractionData.ValidationStatus.NO_DATA) | ||
.order_by("-created_at") | ||
.first() | ||
) | ||
if latest_extraction: | ||
with latest_extraction.resp_data.open() as data_file: | ||
data = data_file.read() | ||
|
||
data_json = json.loads(data) | ||
if data_json["data"]["public_emdat"]: | ||
total_hazard_objects = data_json["data"]["public_emdat"]["total_available"] | ||
# total_hazard_objects is passed as offset not to fetch historical data | ||
variables = {"offset": total_hazard_objects, "include_hist": False, "classif": classification_keys} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this outside try/catch as we aren't handling this there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thenav56 we need to use this inside loop, so we cannot move this out side try/catch. I am using this variable to fetch data for the latest year.
with latest_extraction.resp_data.open() as data_file: | ||
data = data_file.read() | ||
|
||
data_json = json.loads(data) | ||
if data_json["data"]["public_emdat"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should maybe add another JSON field in ExtractionData to store this kind of information (maybe metadata?). Then define dataclasse/schema for that field for each data source if required. Then get this information directly instead of loading raw dataset each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_json = json.loads(data) | ||
if data_json["data"]["public_emdat"]: | ||
total_hazard_objects = data_json["data"]["public_emdat"]["total_available"] | ||
# total_hazard_objects is passed as offset not to fetch historical data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's append with XXX: here as this is a hack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thenav56 this is removed in next commit, as we are using year filter to fetch data.
extraction_id = import_hazard_data() | ||
|
||
# Transform the data from emdat | ||
transform_emdat_data(extraction_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we aren't running this in separate celery tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thenav56 I did this so that transformation starts only after the extraction ends. The whole extraction object is saved in a row. Thus pass extraction_id into transformation_emdat() and do the required transformation.
apps/etl/transform/sources/emdat.py
Outdated
""" | ||
ext_instance = ExtractionData.objects.filter(id=extraction_id).first() | ||
if ext_instance and ext_instance.source_validation_status == ExtractionData.ValidationStatus.NO_DATA: | ||
logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.error(
Let's using warning... using logger.error will send alert to sentry.
apps/etl/transform/sources/emdat.py
Outdated
logger.error( | ||
"No data available", | ||
exe_info=True, | ||
extra={"source": ExtractionData.Source.EMDAT, "extraction_id": ext_instance.id}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra={"source": ExtractionData.Source.EMDAT, "extraction_id": ext_instance.id},
This is not required here
d8883ad
to
f99d8cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thenav56 I have replied to your comments. Please have a see.
# Get latest emdat extraction object so that we do not need to fetch historical data | ||
latest_extraction = ( | ||
ExtractionData.objects.filter( | ||
source=ExtractionData.Source.EMDAT, status=ExtractionData.Status.SUCCESS, resp_data__isnull=False | ||
) | ||
.exclude(source_validation_status=ExtractionData.ValidationStatus.NO_DATA) | ||
.order_by("-created_at") | ||
.first() | ||
) | ||
if latest_extraction: | ||
with latest_extraction.resp_data.open() as data_file: | ||
data = data_file.read() | ||
|
||
data_json = json.loads(data) | ||
if data_json["data"]["public_emdat"]: | ||
total_hazard_objects = data_json["data"]["public_emdat"]["total_available"] | ||
# total_hazard_objects is passed as offset not to fetch historical data | ||
variables = {"offset": total_hazard_objects, "include_hist": False, "classif": classification_keys} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thenav56 we need to use this inside loop, so we cannot move this out side try/catch. I am using this variable to fetch data for the latest year.
5826b39
to
cae3c0c
Compare
Changes
This PR doesn't introduce any:
print
This PR contains valid: