Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DirectoryScanner data endpoint #154

Merged
merged 9 commits into from
Jun 12, 2017
Merged

Conversation

smgallo
Copy link
Contributor

@smgallo smgallo commented Jun 7, 2017

This is the initial implementation of the DirectoryScanner data endpoint to support ingestion of cloud data.

Description

The Directory Scanner Data Endpoint is designed to be a drop-in replacement for a Structured File
endpoint. It is designed as a wrapper around the Structured File endpoint that allows a Structured
File endpoint to be instantiated for each file in a directory (or subdirectory) matching a specified
criteria. It supports directory and file name pattern filtering as well as filtering on last
modified dates. The Directory Scanner implements the Iterator interface and the iteration spans
the union of the records in each of the files matching the criteria. For example, if 2 files are found in a
directory then the Directory Scanner iterator will span all of the records in both files.

This is designed so that an action can simply iterate over the Directory Scanner to access all
records in all matching files contained in the directory, independent of their format. For example:

$config = array(
    'name' => 'Euca files',
    'type' => 'directoryscanner',
    'path' => '/some/directory',
    'handlers' => array(
        (object) array (
            'extension' => '.json',
            'type' => 'jsonfile',
            'record_separator' => "\n"
        )
    )
);
$options = new DataEndpointOptions($config);
$scanner = DataEndpoint::factory($options, $this->logger);
$scanner->verify();
$scanner->connect();

foreach ( $scanner as $key => $record ) {
    // Do something
}

Note that the handler object is used as a template for the StructuredFile configuration. The
required name and path will be injected on instantiation. Multiple handlers may be specified for
different file extensions but records in each file must contain the same set of data field in order
to be ingested properly. Support for the ETL Overseer macros ${LAST_MODIFIED_START_DATE} and
${LAST_MODIFIED_END_DATE} will be added along with StructuredFile action enhancements.

"endpoints": {
    "source": {
        "name": "My Name",
        "path": "/path/to/directory",
        "file_pattern": "/regex/",
        "directory_pattern": "/regex/",
        "recursion_depth": 1,
        "last_modified_start": "${LAST_MODIFIED_START_DATE}",
        "last_modified_end": "${LAST_MODIFIED_END_DATE}",
        "type": "directoryscanner",
        "handlers": [
            {
                "extension": ".json",
                "type": "jsonfile",
                "record_schema_path": "value_analytics/user.json",
                "record_separator": "\n"
            },
            {
                "extension": ".csv",
                "type": "csvfile",
                "record_separator": "\n",
                "field_separator": ","
            }
        ]
    }
}

Filtering

Filtering is available using the following criteria. Files will only be examined if the meet all
specified criteria. Note that these criteria are only used by the Directory Scanner endpoint for
selecting files to examine, and not for the actual parsing of those files or the ingestion of the data.

  1. directory_pattern Only directories matching the regex specified will be examined. If the
    directory portion of a file does not match this criteria it will be skipped.
  2. file_pattern Only files matching the regex specified will be processed.
  3. recursion_depth Only descend this deep into the directory hierarchy when searching for files.
  4. last_modified_start and last_modified_end. Only files modified on or before/after this time
    will be processed. If the variables ${LAST_MODIFIED_START_DATE} or ${LAST_MODIFIED_END_DATE}
    are specified as the value then last modified value passed on the ETL Overseer command line will
    be used. In the future, a value of "auto" may be available that will evaluate to time of the last
    execution of the directory scanner or the max(last_modified) of the table. Note that this only
    applies to the data endpoint processing the files, the action that processes the data may use the
    last_modified value in a different way. Note that the ETL Overseer will need to be able to
    specify a last modified time of "none" so all files could be examined during a full ingestion.

Filtering Use Cases

  1. directory_pattern Directory names may be in a specific format and we will want to only process
    files in directories conforming to this format. For example, the Euca accounting directories are
    named using the date (e.g., 2017-05-10) that contains an accounting file and may contain a
    subdirectory called raw. We do not want to descend into this subdirectory so we would restrict
    the search using "directory_pattern": "/(?!.*raw)/".

  2. file_pattern File names may be in a specific format or we may want to only find files with a
    particular extension. For example, only process resource manager log files of the form 20120101
    using "file_pattern": "/^\d{4}\d{2}\d{2}$/". Another example is finding only files with a
    ".json" extension using "file_pattern": "/.json$/".

  3. last_modified_start When shredding or ingesting resource manager or cloud log files, it is
    possible that hundreds or even thousands of files may exist in the log directory. Rather than
    process all files, only new files or those that contain updated data should be
    examined. Specifying a value for last_modified_start we can examine only files that were
    modified on or after the specified time. For example, "last_modified_start": "${LAST_MODIFIED_START_DATE}" will use the value of the last modified start date as passed to
    the ETL Overseer allowing us to specify values such as --last-modified-start-date "now - 5 days".

Motivation and Context

Scanning a directory and parsing/ingesting all or a subset of files in that directory will be needed for the cloud as well as migrating ETLv1 to ETLv2 (e.g., resource manager log files).

Tests performed

Component tests have been created for the following cases:

  • StructuredFile
    • Unknown/invalid filter executable
    • Parsing of an empty file (i.e., no records)
  • DirectoryScanner
    • Invalid endpoint options
    • Path is not a directory
    • Scan all files in a directory including an empty file (verify number of files and total number of records found)
    • Apply directory and file regex patterns (verify number of files and total number of records found)
    • Apply last modified start and end filters

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project as found in the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@smgallo smgallo added the Category:ETL Extract Transform Load label Jun 7, 2017
@smgallo smgallo added this to the v7.0.0 milestone Jun 7, 2017
// We do not want to return directories as part of the traversal so we need to
// apply directory/file patterns and other checks AFTER traversing the
// directory tree. Otherwise, directories may be inadvertantly filtered and
// the files missed.
Copy link
Member

@jpwhite4 jpwhite4 Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably not a problem for the typical use cases for this ingestor code, but in the PCP archive parsing world we:

  • explicitly do not descend into directories that do not match
  • have the ability to use date encoded directory names and to filter on them.
  • have the ability to use date encoded file names to infer the file create time rather than calling stat() on the file.

The main reasons for this is that when you have lots of files stored on a network filesystem it can take a very very long time to scan the full directory tree and call stat() on all of the files. (before these changes were implemented it was taking over 24 hours to scan the tree each day!).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpwhite4 thanks for the input. I'll add this to my notes for round 2 so that we can support as many use cases as possible. Filtering to not descend into directories is a fairly simple thing to do (I did this when starting out but found the for my use cases the parent directory didn't match and it never made it into a subdirectory). Inferring the time using a regex on the filename should also be easy.

@smgallo smgallo merged commit bfc2eb1 into ubccr:xdmod7.0 Jun 12, 2017
@smgallo smgallo deleted the etl/directory-scanner branch June 12, 2017 20:15
ryanrath pushed a commit to ryanrath/xdmod that referenced this pull request Jul 24, 2017
* Add Iterator functionality
* Improve argument validation
* Streamline Iterator implementation
* Add tests for invalid filter exe and parsing empty file
* Update tests to use new artifacts directory
* Initial DirectoryScanner data endpoint and tests
chakrabortyr pushed a commit to chakrabortyr/xdmod that referenced this pull request Oct 17, 2017
* Add Iterator functionality
* Improve argument validation
* Streamline Iterator implementation
* Add tests for invalid filter exe and parsing empty file
* Update tests to use new artifacts directory
* Initial DirectoryScanner data endpoint and tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category:ETL Extract Transform Load
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants