-
Notifications
You must be signed in to change notification settings - Fork 228
extend preprocess_data_dist to handle jsonl files #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
269af4e
9ba081b
d29a702
ed49713
e94f2a0
687ff32
af59545
4f648a0
9f2ba6a
72d6c9c
8b67bec
3eca1f3
a691b48
e4a34e2
8b168ca
7a02693
eca2940
b14491d
ec11281
354d13b
2dc3f7a
980e904
ebd20a6
69b2f49
50de06a
af290ad
4b58c74
71a2fdc
73d3a24
b9e69be
da615c6
c42f41f
a3a7d53
163310a
01b2be0
4b6e8ff
ea08555
2524fce
ca14d48
d482f36
28d76f5
f706108
f122883
57c012e
eed8327
a75cfc2
afcfcf9
74b733a
dadb51b
a2f8fa0
39e6cd7
2a29d99
d6fa895
1216c0a
a64d3da
fde439e
fb274bf
d428c02
ba14351
852fdd0
61f4b46
18881ae
bd6f41f
3488d0b
1305fe9
6bcac1f
813d068
0510081
1fea302
d360313
20a43af
7b08347
8d448bc
6f7519f
3f9078d
b9aa845
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,5 @@ | ||
| import os | ||
| import stat | ||
| import numpy as np | ||
|
|
||
| import torch | ||
|
|
@@ -142,7 +143,8 @@ def all_sum_(self, vals: np.array): | |
| dist.all_reduce(tensor, op=dist.ReduceOp.SUM) | ||
|
|
||
| def open(self, filename, truncate=None): | ||
| """Create, truncate, and open a file shared by all ranks.""" | ||
| """Create, truncate, and open a file for writing shared by all ranks.""" | ||
| f = None | ||
|
|
||
| # Don't truncate existing file until all ranks reach this point | ||
| self.barrier() | ||
|
|
@@ -162,6 +164,8 @@ def open(self, filename, truncate=None): | |
|
|
||
| except Exception as e: | ||
| err = e | ||
| if f is not None: | ||
| f.close() | ||
|
|
||
| # Verify that rank 0 created the file | ||
| self.allraise_if(err) | ||
|
|
@@ -175,6 +179,40 @@ def open(self, filename, truncate=None): | |
| err = e | ||
|
|
||
| # Verify that all ranks successfully opened the file | ||
| if not self.alltrue(err is None): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I overall curious why you need to close the file? Raise should destroy everything no?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, good point. I think the raise should do the trick, since the file handle will not be returned and go out of scope in that case. I'll simplify that code. |
||
| # Someone failed to open the file. | ||
| # If we succeeded, close our file. | ||
| if f is not None: | ||
| f.close() | ||
|
|
||
| # All raise an exception if anyone did | ||
| self.allraise_if(err) | ||
|
|
||
| return f | ||
|
|
||
| def openread(self, filename): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't that just
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the open-for-write function, I have that written as a two phase process, where rank 0 creates and truncates the file, then other ranks open the file afterwards. In the open-for-read, all procs open the file simultaneously. I think it's useful to keep the two-phase step for creating the file, because create/truncate can be expensive on some file systems. However, I suspect this can be refactored to have a single open call so that |
||
| """Open a shared file for reading by all ranks.""" | ||
| f = None | ||
|
|
||
| # Don't attempt to open until all ranks are ready. | ||
| self.barrier() | ||
|
|
||
| # Open the file for reading on all ranks. | ||
| # Catch exception if the rank fails. | ||
| err = None | ||
| try: | ||
| f = open(filename, 'rb') | ||
| except Exception as e: | ||
| err = e | ||
|
|
||
| # Verify that all ranks successfully opened the file | ||
| if not self.alltrue(err is None): | ||
| # Someone failed to open the file. | ||
| # If we succeeded, close our file. | ||
| if f is not None: | ||
| f.close() | ||
|
|
||
| # All raise an exception if anyone did | ||
| self.allraise_if(err) | ||
|
|
||
| return f | ||
|
|
@@ -218,3 +256,51 @@ def rename(self, srcfile, destfile): | |
|
|
||
| # Verify that the rename succeeded | ||
| self.allraise_if(err) | ||
|
|
||
| def exists(self, filename): | ||
| """Test whether file exists and broadcast result to all ranks.""" | ||
| # We'll capture any exception in this variable | ||
| err = None | ||
|
|
||
| # Rank 0 executes the existence check | ||
| exists = False | ||
| if self.rank == 0: | ||
| try: | ||
| exists = os.path.exists(filename) | ||
| except Exception as e: | ||
| err = e | ||
|
|
||
| # Verify that the check succeeded | ||
| self.allraise_if(err) | ||
|
|
||
| # Get value from rank 0 | ||
| exists = self.bcast(exists, root=0) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. technically you can have all ranks to run Also sidenote, we're having a lot of err = None
# Rank 0 executes the existence check
exists = False
if self.rank == 0:
try:
do_something
except Exception as e:
err = e
# Verify that the check succeeded
self.allraise_if(err)can we make a helper and factorise that code somewhere? You could pass a method as an argument. (there might not even be a need for
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I'll see if I can create a helper routine to account for that pattern. I tend to use this pattern for a couple of reasons. One is that having rank 0 do the check and bcast the result tends to be more scalable than having all ranks do the check directly. For example, a A second benefit is that it guarantees that all procs see a consistent result. As an example, imagine that someone deletes the file while the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay this makes sense to me. Thanks for the great explanation! |
||
| return exists | ||
|
|
||
| def stat(self, filename, field): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure why we need this helper, can't all ranks run |
||
| """Lookup field from stat on file and broadcast to all ranks.""" | ||
| # We'll capture any exception in this variable | ||
| err = None | ||
|
|
||
| # Rank 0 does the actual stat call | ||
| val = None | ||
| if self.rank == 0: | ||
| try: | ||
| val = os.stat(filename)[field] | ||
| except Exception as e: | ||
| err = e | ||
|
|
||
| # Verify that the stat succeeded | ||
| self.allraise_if(err) | ||
|
|
||
| # Get value from rank 0 | ||
| val = self.bcast(val, root=0) | ||
| return val | ||
|
|
||
| def filesize(self, filename): | ||
| """Lookup filesize and broadcast to all ranks.""" | ||
| return self.stat(filename, stat.ST_SIZE) | ||
|
|
||
| def mtime(self, filename): | ||
| """Lookup file mtime and broadcast to all ranks.""" | ||
| return self.stat(filename, stat.ST_MTIME) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead scope everything? It would allow exception handling to be specific? Like if truncate fails then we need to close the file.