-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DirectoryScanner data endpoint #154
Conversation
// We do not want to return directories as part of the traversal so we need to | ||
// apply directory/file patterns and other checks AFTER traversing the | ||
// directory tree. Otherwise, directories may be inadvertantly filtered and | ||
// the files missed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably not a problem for the typical use cases for this ingestor code, but in the PCP archive parsing world we:
- explicitly do not descend into directories that do not match
- have the ability to use date encoded directory names and to filter on them.
- have the ability to use date encoded file names to infer the file create time rather than calling stat() on the file.
The main reasons for this is that when you have lots of files stored on a network filesystem it can take a very very long time to scan the full directory tree and call stat() on all of the files. (before these changes were implemented it was taking over 24 hours to scan the tree each day!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpwhite4 thanks for the input. I'll add this to my notes for round 2 so that we can support as many use cases as possible. Filtering to not descend into directories is a fairly simple thing to do (I did this when starting out but found the for my use cases the parent directory didn't match and it never made it into a subdirectory). Inferring the time using a regex on the filename should also be easy.
* Add Iterator functionality * Improve argument validation * Streamline Iterator implementation * Add tests for invalid filter exe and parsing empty file * Update tests to use new artifacts directory * Initial DirectoryScanner data endpoint and tests
* Add Iterator functionality * Improve argument validation * Streamline Iterator implementation * Add tests for invalid filter exe and parsing empty file * Update tests to use new artifacts directory * Initial DirectoryScanner data endpoint and tests
This is the initial implementation of the DirectoryScanner data endpoint to support ingestion of cloud data.
Description
The Directory Scanner Data Endpoint is designed to be a drop-in replacement for a Structured File
endpoint. It is designed as a wrapper around the Structured File endpoint that allows a Structured
File endpoint to be instantiated for each file in a directory (or subdirectory) matching a specified
criteria. It supports directory and file name pattern filtering as well as filtering on last
modified dates. The Directory Scanner implements the
Iterator
interface and the iteration spansthe union of the records in each of the files matching the criteria. For example, if 2 files are found in a
directory then the Directory Scanner iterator will span all of the records in both files.
This is designed so that an action can simply iterate over the Directory Scanner to access all
records in all matching files contained in the directory, independent of their format. For example:
Note that the handler object is used as a template for the StructuredFile configuration. The
required name and path will be injected on instantiation. Multiple handlers may be specified for
different file extensions but records in each file must contain the same set of data field in order
to be ingested properly. Support for the ETL Overseer macros
${LAST_MODIFIED_START_DATE}
and${LAST_MODIFIED_END_DATE}
will be added along with StructuredFile action enhancements.Filtering
Filtering is available using the following criteria. Files will only be examined if the meet all
specified criteria. Note that these criteria are only used by the Directory Scanner endpoint for
selecting files to examine, and not for the actual parsing of those files or the ingestion of the data.
directory_pattern
Only directories matching the regex specified will be examined. If thedirectory portion of a file does not match this criteria it will be skipped.
file_pattern
Only files matching the regex specified will be processed.recursion_depth
Only descend this deep into the directory hierarchy when searching for files.last_modified_start
andlast_modified_end
. Only files modified on or before/after this timewill be processed. If the variables
${LAST_MODIFIED_START_DATE}
or${LAST_MODIFIED_END_DATE}
are specified as the value then last modified value passed on the ETL Overseer command line will
be used. In the future, a value of "auto" may be available that will evaluate to time of the last
execution of the directory scanner or the
max(last_modified)
of the table. Note that this onlyapplies to the data endpoint processing the files, the action that processes the data may use the
last_modified
value in a different way. Note that the ETL Overseer will need to be able tospecify a last modified time of "none" so all files could be examined during a full ingestion.
Filtering Use Cases
directory_pattern
Directory names may be in a specific format and we will want to only processfiles in directories conforming to this format. For example, the Euca accounting directories are
named using the date (e.g.,
2017-05-10
) that contains an accounting file and may contain asubdirectory called
raw
. We do not want to descend into this subdirectory so we would restrictthe search using
"directory_pattern": "/(?!.*raw)/"
.file_pattern
File names may be in a specific format or we may want to only find files with aparticular extension. For example, only process resource manager log files of the form
20120101
using
"file_pattern": "/^\d{4}\d{2}\d{2}$/"
. Another example is finding only files with a".json" extension using
"file_pattern": "/.json$/"
.last_modified_start
When shredding or ingesting resource manager or cloud log files, it ispossible that hundreds or even thousands of files may exist in the log directory. Rather than
process all files, only new files or those that contain updated data should be
examined. Specifying a value for
last_modified_start
we can examine only files that weremodified on or after the specified time. For example,
"last_modified_start": "${LAST_MODIFIED_START_DATE}"
will use the value of the last modified start date as passed tothe ETL Overseer allowing us to specify values such as
--last-modified-start-date "now - 5 days"
.Motivation and Context
Scanning a directory and parsing/ingesting all or a subset of files in that directory will be needed for the cloud as well as migrating ETLv1 to ETLv2 (e.g., resource manager log files).
Tests performed
Component tests have been created for the following cases:
Types of changes
Checklist: