-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix isolate options #913
Fix isolate options #913
Conversation
Codecov Report
@@ Coverage Diff @@
## master #913 +/- ##
==========================================
+ Coverage 61.08% 61.08% +<.01%
==========================================
Files 203 203
Lines 16938 16940 +2
==========================================
+ Hits 10346 10348 +2
Misses 6592 6592
Continue to review full report at Codecov.
|
Thanks for looking at the options! As you said, use_cgroup is always enabled so the second commit doesn't change the behaviour (still appreciated!). I'm curious about the first, since we have test that should pick up if timing is not checked appropriately. Is it only wrong when using multiple processes/threads? In which direction? |
As you said, use_cgroup is always enabled so the second commit doesn't change the behaviour (still appreciated!).
Actually, it is enabled by default and it is possible to switch it off
in CMS configuration.
I'm curious about the first, since we have test that should pick up if
timing is not checked appropriately. Is it only wrong when using multiple
processes/threads? In which direction?
Yes, it is off only with multiple processes (maybe also threads).
It measures only run time of the master process, not of all processes
in the cgroup.
|
I cannot reproduce, can you help me understand (trying to figure out if this justify backporting and releasing a new version for 1.3). With python & threading: import sys
import threading
import time
def w():
t = time.time()
i = 0
while time.time() - t < 1:
i += 1
n = int(sys.stdin.readline().strip())
t = threading.Thread(target=w)
t.start()
t.join()
sys.stdout.write("correct %d\n" % n) $ echo 1 | time python /tmp/batchstdio.py
correct 1
1.04user 0.01system 0:01.06elapsed 99%CPU (0avgtext+0avgdata 9884maxresident)k CMS: With C & fork: #include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <time.h>
int main() {
int n;
pid_t ret = fork();
if (ret == 0) {
int i = 0;
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
while(1) {
i++;
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
long long delta_us = (end.tv_sec - start.tv_sec) * 1000000 + (end.tv_nsec - start.tv_nsec) / 1000;
if (delta_us > 1000000) {
break;
}
}
if (i < 10000) {
printf("slow\n");
}
} else {
if (ret != -1) {
waitpid(ret, NULL, 0);
scanf("%d", &n);
printf("correct %d\n", n);
}
}
return 0;
} $ echo 1 | time /tmp/fork
correct 1
0.11user 0.88system 0:01.00elapsed 99%CPU (0avgtext+0avgdata 2672maxresident)k
0inputs+0outputs (0major+168minor)pagefaults 0swaps CMS: (note these are not hitting wall time limit, which is 2). |
Your Python test with threads times out properly (Linux kernel adds execution
times of all threads), but the C test does not:
***@***.***:~/a/tree/isolate$ isolate -v -p10 --cg --run -t 0.5 -M- -- ./fork
Using control group box-0 under parent .
Entering control group box-0
Started proxy_pid=13261 box_pid=13262 box_pid_inside_ns=2
Binding ./box on box (flags 1006)
Binding /bin on bin (flags 1007)
Binding /dev on dev (flags 1003)
Binding /lib on lib (flags 1007)
Binding /lib64 on lib64 (flags 1007)
Mounting proc on proc (flags 5)
Binding /usr on usr (flags 1007)
Binding //etc on etc (flags 1007)
[timer][timer][timer][timer][timer][timer][timer][timer][timer][timer]correct 1
time:1.004
time-wall:1.052
max-rss:1376
csw-voluntary:7
csw-forced:2
cg-mem:5028
status:TO
message:Time limit exceeded
Time limit exceeded
Please observe that the time limit was reported only after the program finished.
|
So it's more a matter of detecting the TO early, rather than detecting it at all? |
So it's more a matter of detecting the TO early, rather than detecting it at all?
It is a little bit unexpected, but yes :-)
It turns out that the wait4() syscall is not reporting the resource
usage of the finished process only, but it also adds times of all its
finished children. This does not seem to be documented anywhere, but
the kernel source is clear.
However, it can still be misused: Suppose you have a task with time
limit 1s. You fork two processes: process A calculates for 900 ms,
process B calculates for 900 ms and then it loops forever. The main
process waits for A, and then it exists, leaving B running. Process B
is killed by the kernel when tearing down the PID namespace, but it
is too late: only the time spent in A was included in the total time
of the parent process. Hence you spent 1800 ms by computing, but
Isolate sees only 900 ms.
I would really like to have this fix backported to the 1.3 branch.
|
Thank you, merged in both and we'll release a 1.3.1 in O(days), possibly with the quota fix too. Speaking of quota. We use -f to limit the size of each file written (default to 1GB), and we have strict control on what files are writable (at most 1 for Batch, 0 for Communication and TwoSteps). Isn't that enough? |
Thank you, merged in both and we'll release a 1.3.1 in O(days), possibly with the quota fix too.
Thanks!
Since this has shown that isolate's parsing of options is not robust
enough, I plan to change it a bit:
* --cg-mem and similar CG-related options will be refused
in non-CG mode.
* --cg-timing will be the default in --cg mode. It is dubious
if it is ever useful to use non-CG timing with CGs, but if
somebody still wants to do it, there will be --no-cg-mode
available.
Speaking of quota. We use -f to limit the size of each file written (default
to 1GB), and we have strict control on what files are writable (at most
1 for Batch, 0 for Communication and TwoSteps). Isn't that enough?
I looked at the IsolateSandbox class once again and I see no way how to
circumvent the file size limit, so no urgent fix is needed.
In the long term, I do not feel much comfortable with the current mechanism.
At the first sight, it seems to fit the needs, but it has many sharp corners:
for example, the tested program cannot write something to the output file,
then delete it, and start writing again. Of course, this is not a common
access pattern, but it is obviously not forbidden by the usual contest rules,
so it should work. Also, it is bold to assume that no library function will
ever need to create a temporary file.
For these reasons, I would prefer to use the directory created by Isolate
itself when initializing the sandbox. You can create the input files and
the executable there and have "isolate --run" change the owner of all files
to the UID of the sandboxed program. Hence, the program will have a fully
writable directory, limited only by the filesystem quota. Finally, Isolate
will change the UID back to the caller's one, so CMS will have full access
to the files.
Does this make sense?
|
Good idea on avoiding incompatible combinations of options. For quota: I think we want a subdirectory of isolate's dir as fully writable by the program, in case (we write other stuff in the main dir, such as run.log and commands.log). I agree with the idea, the quota limit seems fairer and tiny bit safer than the current approach (but again, I'm far from being an expert). In case I think we can have a reasonable default stored in an undocumented option similar to As you said, it seems less important for 1.3 though. |
I'm going to close this and create an issue to track quota. |
Using cgroups without --cg-timing makes little sense and it is too easy to forget to ask for --cg-timing explicitly. Whoever wishes to use the single-process time limit mechanism with CG, he can make his wish explicit by --no-cg-timing. For more background, see cms-dev/cms#913.
I reviewed how current CMS calls isolate and there are several problems with it:
(1) When using cgroups (which is currently the CMS default), --cg-timing is not specified, so the time limit does not work properly.
(2) When not using cgroups, memory limit is still passed as --cg-mem, which does not work.
(3) We do not set --quota, so unless root set up FS quotas for all UIDs used by isolate, the disk space consumed by the solution is not limited at all.
This pull request solves (1) and (2). I would like to know you opinion how to specify the quota: should I add a setting in the config file for that? Or a more general config setting for adding arbitrary isolate options (--extra-time could be handy, too)?
This change is