Rewrite of StructuredFile data endpoint to support Euca ingest #145

smgallo · 2017-05-19T20:24:00Z

This PR is a rewrite of the StructuredFile data endpoint to support ingesting Eucalyptus accounting records. A number of small debugging enhancements implemented during testing are also included.

This PR is a prerequisite for ubccr/xdmod-value-analytics#14.

Description

Structured files contain data in a known (structured) format including CSV, tab-delimited, and JSON. Records are separated by an optional Record Separator (RS) and individual fields may be separated by a Field Separator (FS) if the format allows or requires it (e.g., CSV and TSV). Typically, we will read a structured file by iterating over the records and operating on the fields within the record.

This PR implements a StructuredFile data endpoint that implements Iterator, allowing ETL actions to parse the file and then simply aggregate over the endpoint to access the parsed records. While JSON documents are the only structured file we currently use, CSV/TSV files will be supported in the future and functionality to support them has been included (e.g., field separator and record names). Much of the workflow for handling structured files is contained in aStructuredFile, allowing the classes that support particular formats such as JSON to implement only the portions that are required for their particular format such as parsing the file format and decoding the individual records. The StructuredFile ingestor has been updated accordingly.

In addition, a flexible method for filtering input files through one or more external processes has been implemented. ETL\DataEndpoint\Filter\ExternalProcess allows us to specify a pipeline of process that data will be sent through before it is seen by the data endpoint itself. This is implemented using PHP stream filters that are attached directly to a file handle on fopen() in aStructuredFile::attachFilters() and process the input stream before it is accessed by fread(). The ExternalProcess class opens input and output pipes to the external process and sends data through this process allowing any program that accepts input in stdin and produces output on stdout to act as a filter. This includes jq, sed, awk, and even gzip. An example of a filter specification on a data endpoint is:

"endpoints": {
    "source": {
        "type": "jsonfile",
        "name": "Value Analytics People Input File (Filtered to Organizations)",
        "path": "value_analytics/people.json",
        "filters": {
            "jq": {
                "type": "external",
                "path": "jq",
                "arguments": "'map({name: .organizations[].name}) | unique'"
            }
        }
    }
}

Examples

CSV

RS: \n (UNIX), \r\n (Windows), \r (OSX)
FS: , (CSV), \t

one, two, "three blind mice"

JSON

Optional RS: \n (UNIX), \r\n (Windows), \r (OSX)

JSON documents can represent arrays and objects and therefore do not require a RS and do not support a FS. For documents that do not specify a RS it is assumed that the document will be either a single record or an array of records.

A single file containing a 2-dimensional array with no RS. This data is parsed all at once and the individual arrays are available through the iterator.

[
    [1, 2, 3, 4],
    [2, 3, 5, 6]
]

An single file containing an array of objects with no RS where the record names are the object keys. This data is parsed all at once and the individual objects are available through the iterator.

[
    {
        "first": "Steve",
        "last": "Gallo",
        "middle": "M."
    },{
        "first": "Ben",
        "last": "Plessenger"
    }
]

A single file containing one object per line (RS = EOL). The data may be streamed or parsed all at once.

{ "first": "Steve", "last": "Gallo", "middle": "M." }
{ "first": "Ben", "last": "Plessenger" }

Motivation and Context

Needed a more flexible method for ingesting JSON data in a variety of configurations.

Tests performed

Unit tests were added to cover the following cases:

File data endpoint
- Test trying to read a directory instead of a file
- Test trying to open a file with an invalid mode
- Test reading a simple string from a file
StructuredFile data endpoint (extends File)
- Test parsing a simple JSON file containing an array of objects (using sample XDMoD-VA data)
- Test parsing a simple JSON file containing multiple objects, each on a single line, separated by a newline (using sample Euca accounting records)
- Test error reporting when config is not valid
- Test error reporting when a filter type is not provided
- Test filter syntax error
- Test parsing a simple JSON file containing an array of objects filtered through an external process (jq)
- Test parsing a simple JSON file containing multiple records separated by a newline and filtered through an external process (jq)
- Test successful JSON schema validation
- Test failed JSON schema validation

Testing of parsing JSON files was performed by running the XDMoD-VA ETL and comparing the data to a dump of the modw_value_analytics database on the "VA Demo" site (see PR ubccr/xdmod-value-analytics#14). In addition, all JSON configuration files that are parsed go through this data endpoint.

The CloudAssetTypeIngestor action uses an in-line JSON document (which will be depricated at a later date) and was used to verify this functionality.

$ ./etl_overseer.php -c ../../../etc/etl/etl.json -a CloudAssetTypeIngestor -v info
2017-05-19 15:04:25 [info] Command: './etl_overseer.php' '-c' '../../../etc/etl/etl.json' '-a' 'CloudAssetTypeIngestor' '-v' 'info'
2017-05-19 15:04:25 [notice] dw_extract_transform_load start (process_start_time: 2017-05-19 15:04:25)
2017-05-19 15:04:25 [info] Loading configuration file /home/smgallo/xdmod-structured-file/etc/etl/etl.json
2017-05-19 15:04:26 [info] Parsing /home/smgallo/xdmod-structured-file/etc/etl/etl.json
...
2017-05-19 15:04:26 [info] Verifying endpoint: ('Utility DB', class=ETL\DataEndpoint\Mysql, config=datawarehouse, schema=modw, host=tas-db1-nofw.ccr.buffalo.edu:3306, user=xdmod)
2017-05-19 15:04:26 [info] Verifying endpoint: ('Cloud DB', class=ETL\DataEndpoint\Mysql, config=datawarehouse, schema=modw_cloud, host=tas-db1-nofw.ccr.buffalo.edu:3306, user=xdmod)
2017-05-19 15:04:26 [info] Verifying endpoint: ('Cloud DB', class=ETL\DataEndpoint\Mysql, config=datawarehouse, schema=modw_cloud, host=tas-db1-nofw.ccr.buffalo.edu:3306, user=xdmod)
2017-05-19 15:04:26 [info] Create action CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor)
2017-05-19 15:04:26 [info] Parse definition file: '/home/smgallo/xdmod-structured-file/etc/etl/etl_tables.d/cloud/asset_type.json'
2017-05-19 15:04:26 [info] Loading configuration file /home/smgallo/xdmod-structured-file/etc/etl/etl_tables.d/cloud/asset_type.json
2017-05-19 15:04:26 [info] Parsing /home/smgallo/xdmod-structured-file/etc/etl/etl_tables.d/cloud/asset_type.json
2017-05-19 15:04:26 [info] Verifying action: CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor)
2017-05-19 15:04:26 [info] Obtaining lock file '/var/lock/xdmod/etlv2_25727'
2017-05-19 15:04:26 [info] start (action_name: CloudAssetTypeIngestor, action: CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor), start_date: , end_date: )
2017-05-19 15:04:26 [info] Truncate destination table: `modw_cloud`.`asset_type`
2017-05-19 15:04:26 [info] Table does not exist: '`modw_cloud`.`asset_type`', skipping.
2017-05-19 15:04:26 [notice] Table `modw_cloud`.`asset_type` does not exist, creating.
2017-05-19 15:04:26 [info] Truncate destination table: `modw_cloud`.`asset_type`
2017-05-19 15:04:26 [info] Process date interval (1/1) (start: none, end: none)
2017-05-19 15:04:26 [notice] (action: CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor), start_time: 1495220666.0755, end_time: 1495220666.0853, elapsed_time: 0.00976, records_loaded: 5)
2017-05-19 15:04:26 [info] ETL\Ingestor\StructuredFileIngestor: Rows Processed: 5 records (Time Taken: 0.01 s)
2017-05-19 15:04:26 [notice] (action: CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor), start_time: 1495220666.0739, end_time: 1495220666.0858, elapsed_time: 0.01188, records_examined: 5, records_loaded: 5)
2017-05-19 15:04:26 [info] end (action_name: CloudAssetTypeIngestor, action: CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor))
2017-05-19 15:04:26 [info] Releasing lock '/var/lock/xdmod/etlv2_25727'
2017-05-19 15:04:26 [notice] dw_extract_transform_load end (process_end_time: 2017-05-19 15:04:26)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project as found in the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

…ata filtering by external processes

… reduce chance of false positives

tyearke

Overall, it looks good. The only overarching concern I had was about how to specify the order filters are applied to a file if there's more than one filter. It looks like the order is currently determined by the order the key-value pairs are given in a config file. This implicitly works given the current way PHP's built-in JSON decoder converts the data to PHP objects, but this behavior is not guaranteed by the JSON specification.

An object is an unordered set of name/value pairs.

I suggest instead using an array for the filters to explicitly encode the order they are applied. If the keys that would be lost are useful for debugging, perhaps the filter objects could take on a new attribute like name in place of the key (or perhaps the index is good enough).

tyearke · 2017-05-22T19:54:13Z

libraries/utilities.php

+ * @param array $propertyList The list of required properties
+ * @param array $missing Optional reference to an array that will contain a list of the
+ *   missing properties.
+ */


This comment block needs an @return.

tyearke · 2017-05-22T19:57:58Z

libraries/utilities.php

+ *   the values are property types.
+ * @param array $messages Optional reference to an array that will contain a list of
+ *   messages regarding the property types.
+ * @param int $skipMissingProperties If set to FALSE, properties that are not present in


This int should be a boolean.

tyearke · 2017-05-22T20:09:26Z

libraries/utilities.php

+ * Verify the types of object properties, optionally skipping properties that are not
+ * present in the object.  Property types must match the PHP is_*() methods (e.g.,
+ * is_int(), is_object(), is_string()) and will silently be skipped if a function
+ * corresponding to the specified type does not exist.


If a type that can't be checked by a built-in PHP function is given (whether purposefully or by accident), could there be some kind of explicit fallback behavior instead of silently ignoring the property? Perhaps an instanceof test or just adding a message if an untestable type is given.

Yes - I'll add a message to the returned message list.

tyearke · 2017-05-24T17:44:38Z

classes/ETL/DataEndpoint/Filter/ExternalProcess.php

+
+        fclose($this->pipes[1]);
+        fclose($this->pipes[2]);
+        fclose($this->pipes[3]);


The comment for $this->pipes specifies indices 0 through 2, but these lines close entries 1 through 3.

tyearke · 2017-05-24T17:56:24Z

classes/ETL/DataEndpoint/aStructuredFile.php

+    const DEFAULT_READ_BYTES = 4096;
+
+    /**
+     * The path to a JSON Schema describing each record the JSON file.


The structured file might not be JSON or use JSON schema for validation.

tyearke · 2017-05-24T19:43:16Z

tools/dev/verify_table_data.php

+
+    if ( null === $comparisonColumn ) {
+        $comparisonColumn = $firstCol;
+        print "WARNING: No non-nlullable columns, potential for false positives." . PHP_EOL;


There's a minor typo in "non-nullable".

smgallo · 2017-05-24T21:39:43Z

@tyearke the array vs object order is a good point - I'll change it to be an array of objects so we can better specify order. The keys are used for debugging but I was updating my design documentation and came up with a better way to handle the logging and identification so losing them won't matter in the long run.

…error reporting

…ordingly.

smgallo · 2017-05-25T14:36:08Z

@tyearke Requested changes have been made (as well as corresponding changes to ubccr/xdmod-value-analytics#14)

tyearke · 2017-05-25T14:45:18Z

classes/ETL/DataEndpoint/aStructuredFile.php

+            if ( ! is_object($config) ) {
+                $this->logger->warning(sprintf(
+                    "Filter config must be an object, '%s' given. Skipping.",
+                    $filterKey,


There's a stray reference to a no-longer-existent variable here, but other than that, looks good to go.

…#145) * Update commit reference for xdmod-test-artifacts * Add functions for verifying types of stdClass properties * Rename specs_dir -> schema_dir * Rewrite StructuredFile data endpoint to allow record separators and data filtering by external processes * Add tests for File and StructuredFile data endpoints * When verifying tables select a non-nullable column for comparing LEFT OUTER JOIN results to reduce chance of false positives * Return distinct negative values from compare() to facilitate debugging * Change StructuredFile filter definition to an array. Update tests accordingly.

smgallo added 8 commits May 19, 2017 15:35

Update commit reference for xdmod-test-artifacts

4d441e8

Add functions for verifying types of stdClass properties

2e2aeb3

Rename specs_dir -> schema_dir

23d4df9

Rewrite StructuredFile data endpoint to allow record separators and d…

0277b6f

…ata filtering by external processes

Tests for File and StructuredFile data endpoints

1c53165

Select a non-nullable column for comparing LEFT OUTER JOIN results to…

1fdaf8b

… reduce chance of false positives

Return distinct negative values from compare() to facilitate debugging

887396c

Remove commented out code

fd1cdb0

smgallo added enhancement Enhancement of the functionality of an existing feature Category:ETL Extract Transform Load labels May 19, 2017

smgallo added this to the v6.7.0 milestone May 19, 2017

smgallo requested review from plessbd, jpwhite4 and tyearke May 19, 2017 20:24

smgallo mentioned this pull request May 19, 2017

Modifications to support new StructuredFile ingestor ubccr/xdmod-value-analytics#14

Merged

Passify the almighty linter

d54385c

tyearke suggested changes May 24, 2017

View reviewed changes

smgallo added 2 commits May 25, 2017 09:21

Documentation changes, minor updates to utility functions for better …

4a9549c

…error reporting

Change StructuredFile filter definition to an array. Update tests acc…

cc12009

…ordingly.

tyearke suggested changes May 25, 2017

View reviewed changes

Remove stray reference to $filterKey

0f772ab

tyearke approved these changes May 25, 2017

View reviewed changes

smgallo merged commit 66ba63a into ubccr:xdmod6.7 May 25, 2017

smgallo deleted the etl/structured-file branch May 25, 2017 15:17

tyearke mentioned this pull request Jun 6, 2017

Merge Open XDMoD 6.7 work into 7.0 #152

Merged

tyearke modified the milestones: v7.0.0, v6.7.0 Jun 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite of StructuredFile data endpoint to support Euca ingest #145

Rewrite of StructuredFile data endpoint to support Euca ingest #145

smgallo commented May 19, 2017

tyearke left a comment

tyearke May 22, 2017

tyearke May 22, 2017

tyearke May 22, 2017

smgallo May 25, 2017

tyearke May 24, 2017

tyearke May 24, 2017

tyearke May 24, 2017

smgallo commented May 24, 2017

smgallo commented May 25, 2017

tyearke May 25, 2017

Rewrite of StructuredFile data endpoint to support Euca ingest #145

Rewrite of StructuredFile data endpoint to support Euca ingest #145

Conversation

smgallo commented May 19, 2017

Description

Examples

CSV

JSON

Motivation and Context

Tests performed

Types of changes

Checklist:

tyearke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smgallo commented May 24, 2017

smgallo commented May 25, 2017

Choose a reason for hiding this comment