Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EM-DAT Extraction Transform #128

Merged
merged 4 commits into from
Feb 4, 2025
Merged

EM-DAT Extraction Transform #128

merged 4 commits into from
Feb 4, 2025

Conversation

Rup-Narayan-Rajbanshi
Copy link
Contributor

@Rup-Narayan-Rajbanshi Rup-Narayan-Rajbanshi commented Jan 23, 2025

Changes

  • Add Extraction and Transformation for EMDAT

This PR doesn't introduce any:

  • temporary files, auto-generated files or secret keys
  • n+1 queries
  • flake8 issues
  • print
  • typos
  • unwanted comments

This PR contains valid:

  • tests
  • permission checks (tests here too)
  • translations

Copy link
Member

@thenav56 thenav56 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Few comments

Comment on lines 144 to 161
# Get latest emdat extraction object so that we do not need to fetch historical data
latest_extraction = (
ExtractionData.objects.filter(
source=ExtractionData.Source.EMDAT, status=ExtractionData.Status.SUCCESS, resp_data__isnull=False
)
.exclude(source_validation_status=ExtractionData.ValidationStatus.NO_DATA)
.order_by("-created_at")
.first()
)
if latest_extraction:
with latest_extraction.resp_data.open() as data_file:
data = data_file.read()

data_json = json.loads(data)
if data_json["data"]["public_emdat"]:
total_hazard_objects = data_json["data"]["public_emdat"]["total_available"]
# total_hazard_objects is passed as offset not to fetch historical data
variables = {"offset": total_hazard_objects, "include_hist": False, "classif": classification_keys}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this outside try/catch as we aren't handling this there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thenav56 we need to use this inside loop, so we cannot move this out side try/catch. I am using this variable to fetch data for the latest year.

Comment on lines +154 to +126
with latest_extraction.resp_data.open() as data_file:
data = data_file.read()

data_json = json.loads(data)
if data_json["data"]["public_emdat"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maybe add another JSON field in ExtractionData to store this kind of information (maybe metadata?). Then define dataclasse/schema for that field for each data source if required. Then get this information directly instead of loading raw dataset each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thenav56 I added a json field in PR:#129 for this functionality.

data_json = json.loads(data)
if data_json["data"]["public_emdat"]:
total_hazard_objects = data_json["data"]["public_emdat"]["total_available"]
# total_hazard_objects is passed as offset not to fetch historical data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's append with XXX: here as this is a hack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thenav56 this is removed in next commit, as we are using year filter to fetch data.

Comment on lines +10 to +13
extraction_id = import_hazard_data()

# Transform the data from emdat
transform_emdat_data(extraction_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we aren't running this in separate celery tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thenav56 I did this so that transformation starts only after the extraction ends. The whole extraction object is saved in a row. Thus pass extraction_id into transformation_emdat() and do the required transformation.

"""
ext_instance = ExtractionData.objects.filter(id=extraction_id).first()
if ext_instance and ext_instance.source_validation_status == ExtractionData.ValidationStatus.NO_DATA:
logger.error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.error(

Let's using warning... using logger.error will send alert to sentry.

logger.error(
"No data available",
exe_info=True,
extra={"source": ExtractionData.Source.EMDAT, "extraction_id": ext_instance.id},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra={"source": ExtractionData.Source.EMDAT, "extraction_id": ext_instance.id},

This is not required here

@Rup-Narayan-Rajbanshi Rup-Narayan-Rajbanshi force-pushed the feature/extract-emdat branch 2 times, most recently from d8883ad to f99d8cc Compare January 27, 2025 09:52
Copy link
Contributor Author

@Rup-Narayan-Rajbanshi Rup-Narayan-Rajbanshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thenav56 I have replied to your comments. Please have a see.

Comment on lines 144 to 161
# Get latest emdat extraction object so that we do not need to fetch historical data
latest_extraction = (
ExtractionData.objects.filter(
source=ExtractionData.Source.EMDAT, status=ExtractionData.Status.SUCCESS, resp_data__isnull=False
)
.exclude(source_validation_status=ExtractionData.ValidationStatus.NO_DATA)
.order_by("-created_at")
.first()
)
if latest_extraction:
with latest_extraction.resp_data.open() as data_file:
data = data_file.read()

data_json = json.loads(data)
if data_json["data"]["public_emdat"]:
total_hazard_objects = data_json["data"]["public_emdat"]["total_available"]
# total_hazard_objects is passed as offset not to fetch historical data
variables = {"offset": total_hazard_objects, "include_hist": False, "classif": classification_keys}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thenav56 we need to use this inside loop, so we cannot move this out side try/catch. I am using this variable to fetch data for the latest year.

@Rup-Narayan-Rajbanshi Rup-Narayan-Rajbanshi changed the title Feature/extract emdat EM-DAT Extraction Transform Jan 28, 2025
@frozenhelium frozenhelium requested a review from thenav56 February 4, 2025 04:10
@frozenhelium frozenhelium merged commit 1ce874b into develop Feb 4, 2025
@frozenhelium frozenhelium deleted the feature/extract-emdat branch February 4, 2025 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants