Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-trivial directory layout #22

Open
fwang2 opened this issue Aug 28, 2018 · 4 comments
Open

non-trivial directory layout #22

fwang2 opened this issue Aug 28, 2018 · 4 comments

Comments

@fwang2
Copy link
Contributor

fwang2 commented Aug 28, 2018

As far as directory layout goes, currently LCIO just lay it out on a per-process basis, all files are created under p_rankid. This is overly naive and won't scale when we have billions of files - as it literally put hundreds of millions of file under a SINGLE directory. In short, we need a non-trivial directory layout scheme. Maybe also drawn from a distribution.

@mjbach
Copy link
Collaborator

mjbach commented Aug 28, 2018

How about a max_files_per_directory setting? Each process gets ceil(num_files/max_files_per_dir) directories, then randomly assigns a file to one of its directories.

@fwang2
Copy link
Contributor Author

fwang2 commented Aug 28, 2018

then random is not random, right? say each process is responsible for 1m files, and max_file_per_dir is 1000, then by calculation, you will create 1000 dirs under and put 1000 files in each.

@fwang2
Copy link
Contributor Author

fwang2 commented Aug 28, 2018

How about avg_files_per_dir = n, and this will define how many sub-dirs you will create.
However, the # of files you create under each sub-dir will be drawn from a normal distribution, and n is mu.

@mjbach
Copy link
Collaborator

mjbach commented Aug 28, 2018

That could work. That would also make analysis easier as well since it will add another normalization factor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants