DirectoryScanner data endpoint #154

smgallo · 2017-06-07T18:53:44Z

This is the initial implementation of the DirectoryScanner data endpoint to support ingestion of cloud data.

Description

The Directory Scanner Data Endpoint is designed to be a drop-in replacement for a Structured File
endpoint. It is designed as a wrapper around the Structured File endpoint that allows a Structured
File endpoint to be instantiated for each file in a directory (or subdirectory) matching a specified
criteria. It supports directory and file name pattern filtering as well as filtering on last
modified dates. The Directory Scanner implements the Iterator interface and the iteration spans
the union of the records in each of the files matching the criteria. For example, if 2 files are found in a
directory then the Directory Scanner iterator will span all of the records in both files.

This is designed so that an action can simply iterate over the Directory Scanner to access all
records in all matching files contained in the directory, independent of their format. For example:

$config = array(
    'name' => 'Euca files',
    'type' => 'directoryscanner',
    'path' => '/some/directory',
    'handlers' => array(
        (object) array (
            'extension' => '.json',
            'type' => 'jsonfile',
            'record_separator' => "\n"
        )
    )
);
$options = new DataEndpointOptions($config);
$scanner = DataEndpoint::factory($options, $this->logger);
$scanner->verify();
$scanner->connect();

foreach ( $scanner as $key => $record ) {
    // Do something
}

Note that the handler object is used as a template for the StructuredFile configuration. The
required name and path will be injected on instantiation. Multiple handlers may be specified for
different file extensions but records in each file must contain the same set of data field in order
to be ingested properly. Support for the ETL Overseer macros ${LAST_MODIFIED_START_DATE} and
${LAST_MODIFIED_END_DATE} will be added along with StructuredFile action enhancements.

"endpoints": {
    "source": {
        "name": "My Name",
        "path": "/path/to/directory",
        "file_pattern": "/regex/",
        "directory_pattern": "/regex/",
        "recursion_depth": 1,
        "last_modified_start": "${LAST_MODIFIED_START_DATE}",
        "last_modified_end": "${LAST_MODIFIED_END_DATE}",
        "type": "directoryscanner",
        "handlers": [
            {
                "extension": ".json",
                "type": "jsonfile",
                "record_schema_path": "value_analytics/user.json",
                "record_separator": "\n"
            },
            {
                "extension": ".csv",
                "type": "csvfile",
                "record_separator": "\n",
                "field_separator": ","
            }
        ]
    }
}

Filtering

Filtering is available using the following criteria. Files will only be examined if the meet all
specified criteria. Note that these criteria are only used by the Directory Scanner endpoint for
selecting files to examine, and not for the actual parsing of those files or the ingestion of the data.

directory_pattern Only directories matching the regex specified will be examined. If the
directory portion of a file does not match this criteria it will be skipped.
file_pattern Only files matching the regex specified will be processed.
recursion_depth Only descend this deep into the directory hierarchy when searching for files.
last_modified_start and last_modified_end. Only files modified on or before/after this time
will be processed. If the variables ${LAST_MODIFIED_START_DATE} or ${LAST_MODIFIED_END_DATE}
are specified as the value then last modified value passed on the ETL Overseer command line will
be used. In the future, a value of "auto" may be available that will evaluate to time of the last
execution of the directory scanner or the max(last_modified) of the table. Note that this only
applies to the data endpoint processing the files, the action that processes the data may use the
last_modified value in a different way. Note that the ETL Overseer will need to be able to
specify a last modified time of "none" so all files could be examined during a full ingestion.

Filtering Use Cases

directory_pattern Directory names may be in a specific format and we will want to only process
files in directories conforming to this format. For example, the Euca accounting directories are
named using the date (e.g., 2017-05-10) that contains an accounting file and may contain a
subdirectory called raw. We do not want to descend into this subdirectory so we would restrict
the search using "directory_pattern": "/(?!.*raw)/".
file_pattern File names may be in a specific format or we may want to only find files with a
particular extension. For example, only process resource manager log files of the form 20120101
using "file_pattern": "/^\d{4}\d{2}\d{2}$/". Another example is finding only files with a
".json" extension using "file_pattern": "/.json$/".
last_modified_start When shredding or ingesting resource manager or cloud log files, it is
possible that hundreds or even thousands of files may exist in the log directory. Rather than
process all files, only new files or those that contain updated data should be
examined. Specifying a value for last_modified_start we can examine only files that were
modified on or after the specified time. For example, "last_modified_start": "${LAST_MODIFIED_START_DATE}" will use the value of the last modified start date as passed to
the ETL Overseer allowing us to specify values such as --last-modified-start-date "now - 5 days".

Motivation and Context

Scanning a directory and parsing/ingesting all or a subset of files in that directory will be needed for the cloud as well as migrating ETLv1 to ETLv2 (e.g., resource manager log files).

Tests performed

Component tests have been created for the following cases:

StructuredFile
- Unknown/invalid filter executable
- Parsing of an empty file (i.e., no records)
DirectoryScanner
- Invalid endpoint options
- Path is not a directory
- Scan all files in a directory including an empty file (verify number of files and total number of records found)
- Apply directory and file regex patterns (verify number of files and total number of records found)
- Apply last modified start and end filters

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project as found in the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

…scanner

jpwhite4 · 2017-06-09T14:16:57Z

classes/ETL/DataEndpoint/DirectoryScanner.php

+            // We do not want to return directories as part of the traversal so we need to
+            // apply directory/file patterns and other checks AFTER traversing the
+            // directory tree. Otherwise, directories may be inadvertantly filtered and
+            // the files missed.


It is probably not a problem for the typical use cases for this ingestor code, but in the PCP archive parsing world we:

explicitly do not descend into directories that do not match

have the ability to use date encoded directory names and to filter on them.

have the ability to use date encoded file names to infer the file create time rather than calling stat() on the file.

The main reasons for this is that when you have lots of files stored on a network filesystem it can take a very very long time to scan the full directory tree and call stat() on all of the files. (before these changes were implemented it was taking over 24 hours to scan the tree each day!).

@jpwhite4 thanks for the input. I'll add this to my notes for round 2 so that we can support as many use cases as possible. Filtering to not descend into directories is a fairly simple thing to do (I did this when starting out but found the for my use cases the parent directory didn't match and it never made it into a subdirectory). Inferring the time using a regex on the filename should also be easy.

* Add Iterator functionality * Improve argument validation * Streamline Iterator implementation * Add tests for invalid filter exe and parsing empty file * Update tests to use new artifacts directory * Initial DirectoryScanner data endpoint and tests

smgallo added 7 commits June 7, 2017 13:54

Add Iterator functionality

7aff085

Improve argument validation

9d7ec19

Streamline Iterator implementation

7fd5227

Add tests for invalid filter exe and parsing empty file

938c022

Update tests to use new artifacts directory

200cb3e

Merge branch 'xdmod7.0' of github.com:ubccr/xdmod into etl/directory-…

d1026bb

…scanner

Initial DirectoryScanner data endpoint and tests

52e7293

smgallo added the Category:ETL Extract Transform Load label Jun 7, 2017

smgallo added this to the v7.0.0 milestone Jun 7, 2017

smgallo requested review from jtpalmer, plessbd, jpwhite4, minnus and tyearke June 7, 2017 18:53

smgallo added 2 commits June 7, 2017 15:10

Fix PHP 5.3.3 compatibility. Update comments

6228249

More PHP 5.3 compatibility fixes

4983f80

jpwhite4 approved these changes Jun 9, 2017

View reviewed changes

smgallo merged commit bfc2eb1 into ubccr:xdmod7.0 Jun 12, 2017

smgallo deleted the etl/directory-scanner branch June 12, 2017 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DirectoryScanner data endpoint #154

DirectoryScanner data endpoint #154

smgallo commented Jun 7, 2017

jpwhite4 Jun 9, 2017 •

edited

Loading

smgallo Jun 9, 2017

DirectoryScanner data endpoint #154

DirectoryScanner data endpoint #154

Conversation

smgallo commented Jun 7, 2017

Description

Filtering

Filtering Use Cases

Motivation and Context

Tests performed

Types of changes

Checklist:

jpwhite4 Jun 9, 2017 • edited Loading

Choose a reason for hiding this comment

smgallo Jun 9, 2017

Choose a reason for hiding this comment

jpwhite4 Jun 9, 2017 •

edited

Loading