Update Structured File Ingestor #180

smgallo · 2017-07-07T16:31:56Z

Update the StructuredFileIngestor to utilize the StructuredFile data endpoint.

This PR is a prerequisite for the following PRs:

XDMoD-VA: Updates to support new StructuredFileIngestor xdmod-value-analytics#18
XSEDE: ubccr/xdmod-xsede#48

Description

This work integrates the new StructuredFile data endpopint into the StructuredFileIngestor.

StructuredFileIngestor

Since the StructuredFile endpoint implements Iterator, we can now simply iterate over the endpoint to access the data records in the file (or multiple files if using the DirectoryScanner endpoint). Common functionality was consolidated in parent classes to make it more broadly useful to all ingestors. For example, the code generating the destination_record_map was moved from pdoIngestor to aRdbmsDestinationAction so it can be used by StructuredFileIngestor, allowing the ingestor to be greatly simplified and the specification of actions using that ingestor more concise.

We have deprecated the use of the source_data and destination_field directives in the action configuration for structured files. Data that was previously contained directly in the action definition has been moved to their own data files in the etl_data.d directory. Since the data was entirely made up of JSON arrays, a header record was added containing the field names (this is now supported by the StructuredFile endpoint as described below). The destination_field directives have also been removed and replaced by the destination_record_map directive (now automatically generated in many cases). This simplifies configuration and allows us separate action configuration and data.

StructuredFile Data Endpoint

The StrucutredFile endpoint now supports auto-discovery of the fields in records and also supports returning a subset or superset of the fields using the following directives:

header_record A boolean where true (the default value) indicates that the first record is a header record containing field names. This may be ignored by the implementation if a specific file format (such as a JSON object) does not support this behavior.
field_names An array of field names to return. Different file formats may interpret field names differently, but all implementations are expected to return data for all fields specified here. If a field does not exist in the data, its value is expected to be null. If this option is not present, all existing fields in the record are returned. Note that the existing record fields are typically determined by examining the first record present.

CSV/TSV files (to be supported in the future) may contain an optional header row specifying field names. If a header row exists it contains the field names and is skipped for data import. If the header row does not exist, the field names must be specified in the endpoint options. There is no support for optional fields.

header_record If true skip the first record for data and use its values as the field names.
field_names: [...] If header_record is false, field_names is required and if there are more fields present in the file than field names, ignore the additional data fields. If header_record is true then field_names specifies the fields to return providing null values if the requested field does not exist in the data (e.g., there are more fields requested than exist in the data).

JSON arrays are treated similarly to CSV/TSV files.

header_record If true skip the first record and use its values as the field names
field_names: [...] If header_record is false, field_names is required and if there are more fields present in the file than field names, ignore the additional data fields. If header_record is true then field_names specifies the fields to return providing null values if the requested field does not exist in the data.

JSON objects specify field names as keys in the object. If field_names is not specified, the keys of the first object will be examined and used as the field names. Support for optional fields is handled by specifying the complete set of possible fields using field_names and returning null values for fields not present in the record.

header_record Ignored for this data type.
field_names: [...] If specified, return these record fields with null values for any fields that are not found in the record. If the data contains optional fields, then field_names must be specified and must contain the full list of possible field names. If field_names is not specified and subsequent data contains more fields than were found in the first record, those fields will be ignored and only fields found in the first record will be returned.

Motivation and Context

Preparatory work for ingesting Eucalyptus data. The end goal is to be able to use the StructuredFile data endpoint to parse JSON, CSV/TSV, and resource manager log files. This will allow us to consolidate ingestion code in one place and simply iterate over parsed results the same way we do for database result sets.

Tests performed

All existing tests pass. In addition, ingestion was performed on all of the data for XDMoD-VA (JSON files), Resource Allocations (XDCDB query, JSON file, Update ingestor), job dimension tables (JSON files), and XDCDB jobs (XDCDB query, MySQL query, ingestion, aggregation).

XDMoD-VA

./etl_overseer.php -c ../../../etc/etl/etl.json -v debug -p value-analytics -o "experimental_enable_batch_aggregation=true"

./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t people -x jobs_person_id
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t people_organizations -x division -x appointment_type --ignore-column-type
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t people_groups \
  -t people_identifiers \
  -t organizations \
  -t identity_providers \
  -t groups \
  -t grant_types \
  -t grants_people \
  -t grants \
  -t grant_roles \
  -t funding_agencies
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t va_grant_fact_by_year -x pi_jobs_person_id -w 'src.year_id < 201700000'
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t va_grant_fact_by_quarter -x pi_jobs_person_id -w 'src.quarter_id < 201700002'
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t va_grant_fact_by_month -x pi_jobs_person_id -w 'src.month_id < 201700005'
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t va_grant_fact_by_day -x pi_jobs_person_id -w 'src.day_id < 201700145'

Resource Allocations

./etl_overseer.php -c ../../../etc/etl/etl.json -v debug -p jobs-common

./verify_table_data.php -s modw_cloud_baseline -d modw_cloud_etltest -n 1 -v info \
  -t countable_type \
  -t job_record_type \
  -t job_task_type \
  -t unit \
  -t submission_venue

Common Job Tables

./etl_overseer.php -c ../../../etc/etl/etl.json -v debug -p jobs-common

./verify_table_data.php -s modw_cloud_baseline -d modw_cloud_etltest -n 1 -v info \
  -t countable_type \
  -t job_record_type \
  -t job_task_type \
  -t unit \
  -t submission_venue

XDCDB Jobs

./etl_overseer.php -c ../../../etc/etl/etl.json -s "2017-01-01 00:00:00" -e "2017-01-31 23:59:59" -v debug -p jobs-xdcdb

./verify_table_data.php -s modw_cloud_baseline -d modw_cloud_etltest -n 1 -v info \
  -x last_modified \
  -t job_records \
  -t job_tasks
./verify_table_data.php -s modw_cloud_baseline -d modw_cloud_etltest -n 1 -v info \
  -x last_modified \
  -t jobfact_by_year \
  -t jobfact_by_month \
  -t jobfact_by_quarter \
  -t jobfact_by_year

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project as found in the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

…-file-ingestor

…stor to aRdbmsDestinationAction

…to aAction and aRdbmsDestinationAction

…-file-ingestor

…e endpoint plus associated tests

…-file-ingestor

…date ETL config to use new endpoint

plessbd

Remove commented out code then LGTM

plessbd · 2017-07-17T14:25:23Z

classes/ETL/Aggregator/pdoAggregator.php

@@ -770,11 +773,13 @@ protected function _execute($aggregationUnit)
            $sourceJoins = $this->etlSourceQuery->joins;
            $firstJoin = array_shift($sourceJoins);
            $newFirstJoin = clone $firstJoin;
-            $newFirstJoin->setName($tmpTableName);
+            // $newFirstJoin->setName($tmpTableName);


remove commented out code

plessbd · 2017-07-17T14:25:34Z

classes/ETL/Aggregator/pdoAggregator.php

            $newFirstJoin->schema = $this->sourceEndpoint->getSchema();

-            $this->etlSourceQuery->deleteJoins();
-            $this->etlSourceQuery->addJoin($newFirstJoin);
+            // $this->etlSourceQuery->deleteJoins();


remove commented out code

plessbd · 2017-07-17T14:25:41Z

classes/ETL/Aggregator/pdoAggregator.php

@@ -788,8 +793,9 @@ protected function _execute($aggregationUnit)



remove commented out code

plessbd · 2017-07-17T14:25:47Z

classes/ETL/Aggregator/pdoAggregator.php

-            $this->etlSourceQuery->addJoin($this->etlSourceQueryOrigFromTable);
+            // $this->etlSourceQuery->deleteJoins();
+            $this->etlSourceQuery->joins = array($this->etlSourceQueryOrigFromTable);
+            // $this->etlSourceQuery->addJoin($this->etlSourceQueryOrigFromTable);


remove commented out code

* Add support for iterating over a StructuredFile data endpoint * Improve comments, logging, debugging output, and phpcs style fixes * Move handling of destination_record_map config directive from pdoIngestor to aRdbmsDestinationAction * Consolidate common functionality. Move pre- and post-execute tasks into aAction and aRdbmsDestinationAction * Add handling of requested and expected record fields to StructuredFile endpoint plus associated tests * Apply command-line where clause to table row count calculation in table verifier * Fix support for passing overrides on the ETL command line * Allow public properties of any object type to be verified * Fix bugs in experimental aggregation from update to DbModel * Add handling of record fields to StructuredFile endpoint * Update ingestors to use updated StructuredFile endpoint * Update sample ETL files * Move source data out of ETL table definitions into data directory. Update ETL config to use new endpoint

smgallo added 20 commits June 19, 2017 09:50

Add support for iterating over a StructuredFile data endpoint

7764fc0

Improve comments, logging, and phpcs style fixes

93ee7f9

Merge branch 'xdmod7.0' of github.com:ubccr/xdmod into etl/structured…

765f18c

…-file-ingestor

Move handling of destination_record_map config directive from pdoInge…

d7e2402

…stor to aRdbmsDestinationAction

Consolidate common functionality. Move pre- and post-execute tasks in…

536ee64

…to aAction and aRdbmsDestinationAction

Merge branch 'xdmod7.0' of github.com:ubccr/xdmod into etl/structured…

59f5049

…-file-ingestor

Add handling of requested and expected record fields to StructuredFil…

c0af908

…e endpoint plus associated tests

Merge branch 'xdmod7.0' of github.com:ubccr/xdmod into etl/structured…

a967a25

…-file-ingestor

Apply command-line where clause to table row count calculation

8381b8e

Fix support for passing overrides on the ETL command line

efe7468

Improve debugging output

6f7a502

Allow public properties of any object type to be verified

5a5f352

Fix bugs in experimental aggregation from update to DbModel

79a8f27

Improve debugging

7c84f41

PHPCS fixes

9d148d4

PHPCS fixes

0cbbe57

Add handling of record fields to StructuredFile endpoint

4c0c698

Update ingestors to use updated StructuredFile endpoint

635d774

Update sample ETL files

d6863bd

Move source data out of ETL table definitions into data directory. Up…

ab66e93

…date ETL config to use new endpoint

smgallo added this to the v7.0.0 milestone Jul 7, 2017

smgallo added bug Bugfixes enhancement Enhancement of the functionality of an existing feature Category:ETL Extract Transform Load labels Jul 7, 2017

smgallo requested review from tyearke, plessbd, ryanrath and jpwhite4 July 7, 2017 16:42

Fix PHPCS style error

786fc57

plessbd suggested changes Jul 17, 2017

View reviewed changes

Remove commented out code

33d7bce

plessbd approved these changes Jul 18, 2017

View reviewed changes

smgallo merged commit 7649467 into ubccr:xdmod7.0 Jul 19, 2017

smgallo deleted the etl/structured-file-ingestor branch September 22, 2017 17:39

jtpalmer mentioned this pull request Nov 9, 2017

Update storage ingestors for StructuredFileIngestor #330

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Structured File Ingestor #180

Update Structured File Ingestor #180

smgallo commented Jul 7, 2017 •

edited

Loading

plessbd left a comment

plessbd Jul 17, 2017

plessbd Jul 17, 2017

plessbd Jul 17, 2017

plessbd Jul 17, 2017

		@@ -788,8 +793,9 @@ protected function _execute($aggregationUnit)

Update Structured File Ingestor #180

Update Structured File Ingestor #180

Conversation

smgallo commented Jul 7, 2017 • edited Loading

Description

StructuredFileIngestor

StructuredFile Data Endpoint

Motivation and Context

Tests performed

XDMoD-VA

Resource Allocations

Common Job Tables

XDCDB Jobs

Types of changes

Checklist:

plessbd left a comment

Choose a reason for hiding this comment

plessbd Jul 17, 2017

Choose a reason for hiding this comment

plessbd Jul 17, 2017

Choose a reason for hiding this comment

plessbd Jul 17, 2017

Choose a reason for hiding this comment

plessbd Jul 17, 2017

Choose a reason for hiding this comment

smgallo commented Jul 7, 2017 •

edited

Loading