-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Installation at NERSC #184
Comments
Hi Giuseppe, Have you tried to install it using PyPI instead of downloading the source code from Git? It's not a solution, but it might help to understand what's the culprit:
|
Hi, |
Hmmm, it seems the same. Does it work when you install something more trivial, like "tqdm"?
(I believe it should, as the problem seems to be due to |
I tried with |
I cannot do the test, because when I try to install something on
|
|
Bingo! So it's a problem with NumPy. I would ask the NERSC staff for assistance, as I have no clue about what's causing it. |
If I may add a hint, it seems that the problem is NumPy's attempt to install an executable (
|
I wrote to NERSC staff about the problem. Regarding your second comment, you cannot combine the two flags: |
I had some interactions with the NERSC staff. I put here a summary of them:
and I created a new environment where I installed lbs with
At this point lbs is in /global/common/software/... , but the packages like numpy were ALSO in /global/homes/b/bortolam/.local/cori/3.9-anaconda-2021.11 because I installed them there when my environments were in $HOME. So, the packages in $HOME were appearing in the python search path BEFORE my litebird conda packages, in fact Then I run again HOWEVER it seems that the scripts are still slow... For example, the e2e-simulation.py for all the detectors of L2-050 takes
We thought this was related to the filesystem because we did not expect such an increase of running time by using more MPI ranks, but maybe this is incorrect... P.s.: This last thing about running times may be off topic "Installation at NERSC", that should be solved. |
Interesting analysis! It really seems that the filesystem plays some part in this, but it's hard to tell why. Might you please create a new script that is exactly the same as the e2e script, but it just stops after having queried the data from the IMO? If you run it with an increasing number of processes (1, 2, 3, 4… up to ~10), we can measure how much the IMO is effectively slowed down by the NERSC filesystem. In the meantime, I have done a few benchmarks of |
I did the test that you were suggesting. I attach a file showing the running times of the instructions vs the number of processors nproc. The file named running_times_no_quat.pdf is the same but without the quaternions curve. The labels show the instructions and the xticks show the number of processors that I tried. I did 1 run per number of processors, so the lines could be subject to scatterings. Up to 30 nproc only 1 nose is used, while for nproc=40,50,60 two nodes are used (that's the motivation for the decrease from 30 to 40 I suppose, because they are split between the nodes). |
thanks @marcobortolami for doing this test! it seems to me the issue is in the plot running_times.pdf . the execution of sim.generate_spin2ecl_quaternions increases w/ Nproc. what 's going on there? |
I don't know the code well enough to be sure what exactly is happening inside
That could lead to a quadratic increase of running threads, and will most likely kill performance. |
Looking a bit closer at the code, there doesn't seem to be any call to |
@nraffuzz and I had a call with @paganol and @sgiardie about the problem. We post here a recap of the discussion. Brief recap: the run time was badly scaling with the number of processors as for Test 1: after having a deeper look at the lbs code, @paganol suggested to use a Test 2: we set the parameter Note: We noticed that here the code is not parallelized. Do you think that this implementation could help with the running time problem? Solution: @paganol hinted that the binding of the CPUs to tasks might be sub-optimal by looking at the slurm file. Thus we added the flag |
Excellent news, good job!
Ah, yes, I remember that TODO. If I recall correctly, this will help especially in those cases where you use a |
Am I interpreting this correctly that, even in the fast version, a CPU takes around 24s to compute the quaternions for a single day of the mission? Assuming that this is a simple rotation operation this feels horribly slow to me. I would have expected that a single CPU can produce several million quaternions per second, but this value seems to be quite at odds with your result. Please let me know if I misunderstood something! But if not, it might be worthwhile to tweak not only the parallelization, but also the basic algorithm itself. |
In the case we called Test 1, we made use of 60 processors, not a single CPU, that's to say the simulation was 60 days long instead of a single day. The 60-days-simulation lasted then ~24s. Then, with the same configuration (60proc = 60days), we added |
OK, if 60 CPUs together compute 60 days in 8s, then one can estimate that in this setup a single CPU takes 8s to compute a single day, or 1440 quaternions in total (one for every minute of the day). |
I did a few benchmarks using the following script, which simulates one detector with 50 Hz sampling rate (very high!) for one day, computing one quaternion/s: import litebird_sim as lbs
import numpy as np
import time
def create_fake_detector(sampling_rate_hz, quat=np.array([0.0, 0.0, 0.0, 1.0])):
return lbs.DetectorInfo(name="dummy", sampling_rate_hz=sampling_rate_hz, quat=quat)
sim = lbs.Simulation(
base_path="/home/tomasi/tmp/litebird_sim_benckmark",
start_time=0.0,
duration_s=86400.0,
)
det = create_fake_detector(sampling_rate_hz=50.0)
sim.create_observations(
detectors=[det], num_of_obs_per_detector=1, split_list_over_processes=False
)
assert len(sim.observations) == 1
obs = sim.observations[0]
scanning_strategy = lbs.SpinningScanningStrategy(
spin_sun_angle_rad=0.0, precession_rate_hz=1.0, spin_rate_hz=1.0
)
# Generate the quaternions (one per each second)
start = time.perf_counter_ns()
sim.generate_spin2ecl_quaternions(
scanning_strategy=scanning_strategy, delta_time_s=1.0, append_to_report=False
)
stop = time.perf_counter_ns()
elapsed_time = (stop - start) * 1.0e-9
print("Elapsed time for generate_spin2ecl_quaternions: {} s".format(elapsed_time))
print("Shape of the quaternions: ", sim.spin2ecliptic_quats.quats.shape)
print(
"Speed: {:.1e} quat/s".format(sim.spin2ecliptic_quats.quats.shape[0] / elapsed_time)
)
instr = lbs.InstrumentInfo(spin_boresight_angle_rad=np.deg2rad(15.0))
# Compute the pointings by running a "slerp" operation
start = time.perf_counter_ns()
pointings_and_polangle = lbs.get_pointings(
obs,
spin2ecliptic_quats=sim.spin2ecliptic_quats,
detector_quats=np.array([[0.0, 0.0, 0.0, 1.0]]),
bore2spin_quat=instr.bore2spin_quat,
)
stop = time.perf_counter_ns()
elapsed_time = (stop - start) * 1.0e-9
print("Elapsed time for get_pointings: {} s".format((stop - start) * 1e-9))
print("Shape of the pointings: ", pointings_and_polangle.shape)
print(
"Speed: {:.1e} pointings/s".format(pointings_and_polangle.shape[1] / elapsed_time)
) I ran the code using IPython:
However, since this is the first run, the timings include the JIT compilation phase. Running the script once more on the same prompt provides something in line with what @mreineck expects:
Now there are roughly 8 million quaternions/s and ~3 million pointings/s. Could it be that the slow times we are seeing are due to Numba's JIT? To check if this is true, it might help to structure the code so that the main is wrapped in a function that takes the number of elements, and then call it twice: def main(num_of_samples=None):
# Here comes the script. If `num_of_samples` is None, just use the start and end times
# specified in the TOML file.
…
main(50) # Warm up Numba by running a simulation with very few samples
main() # Now run the full simulation What do you think? Does it make sense? |
Thank you very much for the test, @ziotom78! The warmed-up numbers indeed look much more like what I expected. I also agree that the testing strategy makes perfect sense. |
Hopefully, I found a smarter way to do the test… In #190 I simply defined a The same PR adds the benchmark code to the directory @nraffuzz , @marcobortolami , when you have time, may you please test the E2E scripts with the code in #190, to see if the execution speeds has really improved? I do not think this is urgent, so take your time! |
Thank you very much!
We do not think that the |
Thanks @marcobortolami , very nice test! In some sense it does not look so weird that there is no significant performance difference, because I have always seen that Numba is very fast in compiling stuff… The weird result is the one we got a few weeks ago, where Anyway, since it doesn't hurt, I would prefer to leave |
Hi.
Do you think we can close the issue now? Or do we have anything else to discuss about this issue? |
Yes, I believe it can be closed. Thank you! |
@marcobortolami and I are unable to install
litebird_sim
at NERSC in a directory in the SCRATCH. As you know filesystem is faster there than in $HOME and for production runs this could be really cumbersome.The installation procedure is the following:
and the error we get is :
Running with :
Anybody having similar issues ?
The text was updated successfully, but these errors were encountered: