- 
                Notifications
    You must be signed in to change notification settings 
- Fork 89
Fix: load training data from all sub dirs #396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Hmm my only concern here is that it changes behavior for existing customers. Can you also take a look at the PR that added the function in question and see if there's anything we need to account for here? https://github.com/aws/sagemaker-xgboost-container/pull/158/files | 
| Thanks for this PR, apart from above just curious to know, as discussed in our sync were we actually able to find similar logic in 1P closed source containers(AIAlgorithmsSDK) where this is already being supported? And whether this fix is based on that to maintain consistency(although that's not needed but just to get the understanding about it) | 
| 
 Thanks for the headsup. I think theoretically this change won't affect the previous workflow with cross validation after revisited https://github.com/aws/sagemaker-xgboost-container/pull/158/files. There's a local integ test for cross validation here https://github.com/aws/sagemaker-xgboost-container/blob/master/test/integration/local/test_kfold.py, also I will sync with AP oncall to see if they could perform their integ tests on some test images. | 
| 
 Thanks for the comment! I haven't looked into that closely but let's keep this discussion through slack as we don't want expose much SDK stuff here. | 
|  | ||
| def _make_symlinks_for_files_under_a_folder(dest_path: str, data_path: str): | ||
| if (not os.path.exists(dest_path)) or (not os.path.exists(data_path)): | ||
| raise exc.AlgorithmError("Unable to create symlinks as {data_path} or {dest_path} doesn't exist ") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be an f-string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, yeah, I will update
| if (not os.path.exists(dest_path)) or (not os.path.exists(data_path)): | ||
| raise exc.AlgorithmError("Unable to create symlinks as {data_path} or {dest_path} doesn't exist ") | ||
|  | ||
| logging.info("Making smlinks from folder {} to folder {}".format(data_path, dest_path)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can use f-strings here instead of format for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, Thanks
| if item.is_file(): | ||
| _make_symlink(item.path, dest_path, item.name) | ||
| elif item.is_dir(): | ||
| _make_symlinks_for_files_under_a_folder(dest_path, item.path) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to worry about recursion depth here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emmm, I was thinking about this too. I will add a depth limit of 5 here.
| elif os.path.isdir(data_path): | ||
| # traverse all sub-dirs to load all training data | ||
| _make_symlinks_for_files_under_a_folder(files_path, data_path) | ||
| elif os.path.isfile(data_path): | ||
| files_path = data_path | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just call _make_symlinks_for_files_under_a_folder  here? Isn't it able to handle singular files:?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it 's able to do that..Let me refactor the info message in next commit to prevent confusion.
| os.symlink(path, base_name) | ||
| def _make_symlink(path, source_path, name): | ||
| base_name = os.path.join(source_path, name) | ||
| file_name = base_name + str(hash(path)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in case of hash collisions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An error will be thrown out saying there is an existing symlink between A and B. The reason for hash(path) is that no matter how complicated the file structure is, the path for a single file is normally unique and it won't be too long of a string when we limit the depth for recursion.
| return files_path | ||
|  | ||
|  | ||
| def _make_symlinks_from_a_folder(dest_path: str, data_path: str, depth: int): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we made this an iterative function we wouldn't need to worry about this depth at all (though it would probably still make sense to have some cap). Feel free to leave as is, but maybe 5 is a more reasonable depth.
| Do we need to fix 1.3 and 1.5 as well? | 
| 
 Yes. This seems a general issue for all versions. | 
* Fix: load training data from all sub dirs * fix flake8 formatting * fix local integ test * resolve some comments
* Fix: load training data from all sub dirs * fix flake8 formatting * fix local integ test * resolve some comments
Issue #, if available:
https://answers.amazon.com/posts/362803
Previously if customer set the input files channel to be
s3://bucket/sagemaker/train/(file mode)it will only load the files directly under this path or a random sub dir based on workflow here
https://github.com/aws/sagemaker-xgboost-container/blob/master/src/sagemaker_xgboost_container/data_utils.py#L557
https://github.com/aws/sagemaker-xgboost-container/blob/master/src/sagemaker_xgboost_container/data_utils.py#L638-L647
With this change, the behavior of loading files will be consistent with pipe mode, which is that all the sub folders under the root dir customer specified will be traversed and all the files found there will be loaded.
Description of changes:
Testing:
unit tests,
Tested with Notebook : https://haixiw-test.notebook.us-west-2.sagemaker.aws/notebooks/XGBoost_abalone/xgboost_abalone.ipynb
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.