-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite of StructuredFile data endpoint to support Euca ingest #145
Conversation
…ata filtering by external processes
… reduce chance of false positives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks good. The only overarching concern I had was about how to specify the order filters are applied to a file if there's more than one filter. It looks like the order is currently determined by the order the key-value pairs are given in a config file. This implicitly works given the current way PHP's built-in JSON decoder converts the data to PHP objects, but this behavior is not guaranteed by the JSON specification.
An object is an unordered set of name/value pairs.
I suggest instead using an array for the filters to explicitly encode the order they are applied. If the keys that would be lost are useful for debugging, perhaps the filter objects could take on a new attribute like name
in place of the key (or perhaps the index is good enough).
libraries/utilities.php
Outdated
* @param array $propertyList The list of required properties | ||
* @param array $missing Optional reference to an array that will contain a list of the | ||
* missing properties. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment block needs an @return
.
libraries/utilities.php
Outdated
* the values are property types. | ||
* @param array $messages Optional reference to an array that will contain a list of | ||
* messages regarding the property types. | ||
* @param int $skipMissingProperties If set to FALSE, properties that are not present in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This int
should be a boolean
.
libraries/utilities.php
Outdated
* Verify the types of object properties, optionally skipping properties that are not | ||
* present in the object. Property types must match the PHP is_*() methods (e.g., | ||
* is_int(), is_object(), is_string()) and will silently be skipped if a function | ||
* corresponding to the specified type does not exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a type that can't be checked by a built-in PHP function is given (whether purposefully or by accident), could there be some kind of explicit fallback behavior instead of silently ignoring the property? Perhaps an instanceof
test or just adding a message if an untestable type is given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I'll add a message to the returned message list.
|
||
fclose($this->pipes[1]); | ||
fclose($this->pipes[2]); | ||
fclose($this->pipes[3]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment for $this->pipes
specifies indices 0 through 2, but these lines close entries 1 through 3.
const DEFAULT_READ_BYTES = 4096; | ||
|
||
/** | ||
* The path to a JSON Schema describing each record the JSON file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structured file might not be JSON or use JSON schema for validation.
tools/dev/verify_table_data.php
Outdated
|
||
if ( null === $comparisonColumn ) { | ||
$comparisonColumn = $firstCol; | ||
print "WARNING: No non-nlullable columns, potential for false positives." . PHP_EOL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a minor typo in "non-nullable".
@tyearke the array vs object order is a good point - I'll change it to be an array of objects so we can better specify order. The keys are used for debugging but I was updating my design documentation and came up with a better way to handle the logging and identification so losing them won't matter in the long run. |
@tyearke Requested changes have been made (as well as corresponding changes to ubccr/xdmod-value-analytics#14) |
if ( ! is_object($config) ) { | ||
$this->logger->warning(sprintf( | ||
"Filter config must be an object, '%s' given. Skipping.", | ||
$filterKey, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a stray reference to a no-longer-existent variable here, but other than that, looks good to go.
…#145) * Update commit reference for xdmod-test-artifacts * Add functions for verifying types of stdClass properties * Rename specs_dir -> schema_dir * Rewrite StructuredFile data endpoint to allow record separators and data filtering by external processes * Add tests for File and StructuredFile data endpoints * When verifying tables select a non-nullable column for comparing LEFT OUTER JOIN results to reduce chance of false positives * Return distinct negative values from compare() to facilitate debugging * Change StructuredFile filter definition to an array. Update tests accordingly.
…#145) * Update commit reference for xdmod-test-artifacts * Add functions for verifying types of stdClass properties * Rename specs_dir -> schema_dir * Rewrite StructuredFile data endpoint to allow record separators and data filtering by external processes * Add tests for File and StructuredFile data endpoints * When verifying tables select a non-nullable column for comparing LEFT OUTER JOIN results to reduce chance of false positives * Return distinct negative values from compare() to facilitate debugging * Change StructuredFile filter definition to an array. Update tests accordingly.
This PR is a rewrite of the StructuredFile data endpoint to support ingesting Eucalyptus accounting records. A number of small debugging enhancements implemented during testing are also included.
This PR is a prerequisite for ubccr/xdmod-value-analytics#14.
Description
Structured files contain data in a known (structured) format including CSV, tab-delimited, and JSON. Records are separated by an optional Record Separator (RS) and individual fields may be separated by a Field Separator (FS) if the format allows or requires it (e.g., CSV and TSV). Typically, we will read a structured file by iterating over the records and operating on the fields within the record.
This PR implements a
StructuredFile
data endpoint that implementsIterator
, allowing ETL actions to parse the file and then simply aggregate over the endpoint to access the parsed records. While JSON documents are the only structured file we currently use, CSV/TSV files will be supported in the future and functionality to support them has been included (e.g., field separator and record names). Much of the workflow for handling structured files is contained inaStructuredFile
, allowing the classes that support particular formats such as JSON to implement only the portions that are required for their particular format such as parsing the file format and decoding the individual records. The StructuredFile ingestor has been updated accordingly.In addition, a flexible method for filtering input files through one or more external processes has been implemented.
ETL\DataEndpoint\Filter\ExternalProcess
allows us to specify a pipeline of process that data will be sent through before it is seen by the data endpoint itself. This is implemented using PHP stream filters that are attached directly to a file handle onfopen()
inaStructuredFile::attachFilters()
and process the input stream before it is accessed byfread()
. TheExternalProcess
class opens input and output pipes to the external process and sends data through this process allowing any program that accepts input instdin
and produces output onstdout
to act as a filter. This includesjq
,sed
,awk
, and evengzip
. An example of a filter specification on a data endpoint is:Examples
CSV
\n
(UNIX),\r\n
(Windows),\r
(OSX),
(CSV),\t
JSON
\n
(UNIX),\r\n
(Windows),\r
(OSX)JSON documents can represent arrays and objects and therefore do not require a RS and do not support a FS. For documents that do not specify a RS it is assumed that the document will be either a single record or an array of records.
A single file containing a 2-dimensional array with no RS. This data is parsed all at once and the individual arrays are available through the iterator.
An single file containing an array of objects with no RS where the record names are the object keys. This data is parsed all at once and the individual objects are available through the iterator.
A single file containing one object per line (RS = EOL). The data may be streamed or parsed all at once.
Motivation and Context
Needed a more flexible method for ingesting JSON data in a variety of configurations.
Tests performed
Unit tests were added to cover the following cases:
jq
)jq
)Testing of parsing JSON files was performed by running the XDMoD-VA ETL and comparing the data to a dump of the
modw_value_analytics
database on the "VA Demo" site (see PR ubccr/xdmod-value-analytics#14). In addition, all JSON configuration files that are parsed go through this data endpoint.The
CloudAssetTypeIngestor
action uses an in-line JSON document (which will be depricated at a later date) and was used to verify this functionality.Types of changes
Checklist: