-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Structured File Ingestor #180
Update Structured File Ingestor #180
Conversation
…stor to aRdbmsDestinationAction
…to aAction and aRdbmsDestinationAction
…e endpoint plus associated tests
…date ETL config to use new endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented out code then LGTM
@@ -770,11 +773,13 @@ protected function _execute($aggregationUnit) | |||
$sourceJoins = $this->etlSourceQuery->joins; | |||
$firstJoin = array_shift($sourceJoins); | |||
$newFirstJoin = clone $firstJoin; | |||
$newFirstJoin->setName($tmpTableName); | |||
// $newFirstJoin->setName($tmpTableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented out code
$newFirstJoin->schema = $this->sourceEndpoint->getSchema(); | ||
|
||
$this->etlSourceQuery->deleteJoins(); | ||
$this->etlSourceQuery->addJoin($newFirstJoin); | ||
// $this->etlSourceQuery->deleteJoins(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented out code
@@ -788,8 +793,9 @@ protected function _execute($aggregationUnit) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented out code
$this->etlSourceQuery->addJoin($this->etlSourceQueryOrigFromTable); | ||
// $this->etlSourceQuery->deleteJoins(); | ||
$this->etlSourceQuery->joins = array($this->etlSourceQueryOrigFromTable); | ||
// $this->etlSourceQuery->addJoin($this->etlSourceQueryOrigFromTable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented out code
* Add support for iterating over a StructuredFile data endpoint * Improve comments, logging, debugging output, and phpcs style fixes * Move handling of destination_record_map config directive from pdoIngestor to aRdbmsDestinationAction * Consolidate common functionality. Move pre- and post-execute tasks into aAction and aRdbmsDestinationAction * Add handling of requested and expected record fields to StructuredFile endpoint plus associated tests * Apply command-line where clause to table row count calculation in table verifier * Fix support for passing overrides on the ETL command line * Allow public properties of any object type to be verified * Fix bugs in experimental aggregation from update to DbModel * Add handling of record fields to StructuredFile endpoint * Update ingestors to use updated StructuredFile endpoint * Update sample ETL files * Move source data out of ETL table definitions into data directory. Update ETL config to use new endpoint
* Add support for iterating over a StructuredFile data endpoint * Improve comments, logging, debugging output, and phpcs style fixes * Move handling of destination_record_map config directive from pdoIngestor to aRdbmsDestinationAction * Consolidate common functionality. Move pre- and post-execute tasks into aAction and aRdbmsDestinationAction * Add handling of requested and expected record fields to StructuredFile endpoint plus associated tests * Apply command-line where clause to table row count calculation in table verifier * Fix support for passing overrides on the ETL command line * Allow public properties of any object type to be verified * Fix bugs in experimental aggregation from update to DbModel * Add handling of record fields to StructuredFile endpoint * Update ingestors to use updated StructuredFile endpoint * Update sample ETL files * Move source data out of ETL table definitions into data directory. Update ETL config to use new endpoint
* Add support for iterating over a StructuredFile data endpoint * Improve comments, logging, debugging output, and phpcs style fixes * Move handling of destination_record_map config directive from pdoIngestor to aRdbmsDestinationAction * Consolidate common functionality. Move pre- and post-execute tasks into aAction and aRdbmsDestinationAction * Add handling of requested and expected record fields to StructuredFile endpoint plus associated tests * Apply command-line where clause to table row count calculation in table verifier * Fix support for passing overrides on the ETL command line * Allow public properties of any object type to be verified * Fix bugs in experimental aggregation from update to DbModel * Add handling of record fields to StructuredFile endpoint * Update ingestors to use updated StructuredFile endpoint * Update sample ETL files * Move source data out of ETL table definitions into data directory. Update ETL config to use new endpoint
Update the StructuredFileIngestor to utilize the StructuredFile data endpoint.
This PR is a prerequisite for the following PRs:
Description
This work integrates the new
StructuredFile
data endpopint into theStructuredFileIngestor
.StructuredFileIngestor
Since the
StructuredFile
endpoint implementsIterator
, we can now simply iterate over the endpoint to access the data records in the file (or multiple files if using theDirectoryScanner
endpoint). Common functionality was consolidated in parent classes to make it more broadly useful to all ingestors. For example, the code generating thedestination_record_map
was moved frompdoIngestor
toaRdbmsDestinationAction
so it can be used byStructuredFileIngestor
, allowing the ingestor to be greatly simplified and the specification of actions using that ingestor more concise.We have deprecated the use of the
source_data
anddestination_field
directives in the action configuration for structured files. Data that was previously contained directly in the action definition has been moved to their own data files in theetl_data.d
directory. Since the data was entirely made up of JSON arrays, a header record was added containing the field names (this is now supported by theStructuredFile
endpoint as described below). Thedestination_field
directives have also been removed and replaced by thedestination_record_map
directive (now automatically generated in many cases). This simplifies configuration and allows us separate action configuration and data.StructuredFile Data Endpoint
The
StrucutredFile
endpoint now supports auto-discovery of the fields in records and also supports returning a subset or superset of the fields using the following directives:header_record
A boolean wheretrue
(the default value) indicates that the first record is a header record containing field names. This may be ignored by the implementation if a specific file format (such as a JSON object) does not support this behavior.field_names
An array of field names to return. Different file formats may interpret field names differently, but all implementations are expected to return data for all fields specified here. If a field does not exist in the data, its value is expected to benull
. If this option is not present, all existing fields in the record are returned. Note that the existing record fields are typically determined by examining the first record present.CSV/TSV files (to be supported in the future) may contain an optional header row specifying field names. If a header row exists it contains the field names and is skipped for data import. If the header row does not exist, the field names must be specified in the endpoint options. There is no support for optional fields.
header_record
Iftrue
skip the first record for data and use its values as the field names.field_names: [...]
Ifheader_record
isfalse
,field_names
is required and if there are more fields present in the file than field names, ignore the additional data fields. Ifheader_record
istrue
thenfield_names
specifies the fields to return providingnull
values if the requested field does not exist in the data (e.g., there are more fields requested than exist in the data).JSON arrays are treated similarly to CSV/TSV files.
header_record
Iftrue
skip the first record and use its values as the field namesfield_names: [...]
Ifheader_record
isfalse
,field_names
is required and if there are more fields present in the file than field names, ignore the additional data fields. Ifheader_record
istrue
thenfield_names
specifies the fields to return providingnull
values if the requested field does not exist in the data.JSON objects specify field names as keys in the object. If
field_names
is not specified, the keys of the first object will be examined and used as the field names. Support for optional fields is handled by specifying the complete set of possible fields usingfield_names
and returningnull
values for fields not present in the record.header_record
Ignored for this data type.field_names: [...]
If specified, return these record fields withnull
values for any fields that are not found in the record. If the data contains optional fields, thenfield_names
must be specified and must contain the full list of possible field names. Iffield_names
is not specified and subsequent data contains more fields than were found in the first record, those fields will be ignored and only fields found in the first record will be returned.Motivation and Context
Preparatory work for ingesting Eucalyptus data. The end goal is to be able to use the StructuredFile data endpoint to parse JSON, CSV/TSV, and resource manager log files. This will allow us to consolidate ingestion code in one place and simply iterate over parsed results the same way we do for database result sets.
Tests performed
All existing tests pass. In addition, ingestion was performed on all of the data for XDMoD-VA (JSON files), Resource Allocations (XDCDB query, JSON file, Update ingestor), job dimension tables (JSON files), and XDCDB jobs (XDCDB query, MySQL query, ingestion, aggregation).
XDMoD-VA
Resource Allocations
Common Job Tables
XDCDB Jobs
Types of changes
Checklist: