Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation at NERSC #184

Closed
giuspugl opened this issue Jul 29, 2022 · 27 comments
Closed

Installation at NERSC #184

giuspugl opened this issue Jul 29, 2022 · 27 comments

Comments

@giuspugl
Copy link
Collaborator

@marcobortolami and I are unable to install litebird_sim at NERSC in a directory in the SCRATCH. As you know filesystem is faster there than in $HOME and for production runs this could be really cumbersome.
The installation procedure is the following:

module load python 
git clone https://github.com/litebird/litebird_sim.git 
cd litebird_sim 
export PREFIX=${SCRATCH}/lbs_software 
mkdir $PREFIX 
pip install --prefix $PREFIX . 

and the error we get is :

ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/global/common/software/nersc/cori-2022q1/sw/python/3.9-anaconda-2021.11/bin/f2py'
Consider using the `--user` option or check the permissions.

Running with :

pip 21.2.4 
python  3.9.7

Anybody having similar issues ?

@ziotom78
Copy link
Member

Hi Giuseppe,

Have you tried to install it using PyPI instead of downloading the source code from Git? It's not a solution, but it might help to understand what's the culprit:

pip install --prefix $PREFIX litebird_sim

@marcobortolami
Copy link
Contributor

marcobortolami commented Jul 29, 2022

Hi,
By running pip install --prefix $PREFIX litebird_sim I get the following error:
ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/global/common/software/nersc/cori-2022q1/sw/python/3.9-anaconda-2021.11/bin/f2py' Consider using the '--user' option or check the permissions.

@ziotom78
Copy link
Member

Hmmm, it seems the same. Does it work when you install something more trivial, like "tqdm"?

pip install --prefix $PREFIX tqdm

(I believe it should, as the problem seems to be due to f2py, but it's worth checking it…)

@marcobortolami
Copy link
Contributor

I tried with pip install --prefix $PREFIX emcee and it works

@ziotom78
Copy link
Member

f2py is installed by NumPy. May you please try to install the latest NumPy version to check if it works?

pip install --prefix $PREFIX numpy==1.21.5

I cannot do the test, because when I try to install something on cori, I get this message:

Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))) - skipping

@marcobortolami
Copy link
Contributor

pip install --prefix $PREFIX numpy==1.21.5 gives:
Collecting numpy==1.21.5
Downloading numpy-1.21.5-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
|████████████████████████████████| 15.7 MB 13.0 MB/s
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.20.3
Uninstalling numpy-1.20.3:
ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/global/common/software/nersc/cori-2022q1/sw/python/3.9-anaconda-2021.11/bin/f2py'
Consider using the '--user' option or check the permissions.

@ziotom78
Copy link
Member

Bingo! So it's a problem with NumPy. I would ask the NERSC staff for assistance, as I have no clue about what's causing it.

@ziotom78
Copy link
Member

If I may add a hint, it seems that the problem is NumPy's attempt to install an executable (f2py) instead of just a Python library. Does the flag --user solve the problem?

pip install --user --prefix $PREFIX numpy==1.21.5

@marcobortolami
Copy link
Contributor

I wrote to NERSC staff about the problem. Regarding your second comment, you cannot combine the two flags:
ERROR: Can not combine '--user' and '--prefix' as they imply different installation locations

@marcobortolami
Copy link
Contributor

I had some interactions with the NERSC staff. I put here a summary of them:

  • it's usually not a good idea to put a python installation on $SCRATCH since files on $SCRATCH are be purged after 12 weeks
  • for a more high-performance place for a python stack, they suggest using the /global/common/software filesystem
  • thus, I removed the conda environments I was using before, I added these lines in $HOME/.condarc
envs_dirs:
- /global/common/software/mp107/<your username>/conda/conda_envs

pkgs_dirs:
- /global/common/software/mp107/<your username>/conda/conda_pkgs

channels:
- defaults

and I created a new environment where I installed lbs with

module load python
conda create -n litebird python=3.9 -y
conda activate litebird
git clone https://github.com/litebird/litebird_sim.git
cd litebird_sim
pip install .

At this point lbs is in /global/common/software/... , but the packages like numpy were ALSO in /global/homes/b/bortolam/.local/cori/3.9-anaconda-2021.11 because I installed them there when my environments were in $HOME. So, the packages in $HOME were appearing in the python search path BEFORE my litebird conda packages, in fact
python -c "import numpy;print(numpy.__file__)"
had the following output
/global/homes/b/bortolam/.local/cori/3.9-anaconda-2021.11/lib/python3.9/site-packages/numpy/__init__.py
They suggested to delete (this is what I did) or rename the 3.9-anaconda-2021.11 folder or to unset PYTHONUSERBASE

Then I run again pip install . in the litebird_sim folder because I had some package-related errors and now the e2e-simulation.py script runs correctly and the print(package.__file__) gives a path in /global/common/software/.


HOWEVER it seems that the scripts are still slow... For example, the e2e-simulation.py for all the detectors of L2-050 takes

  • 3m10s for 1 day of simulation (and 1 MPI rank, as each rank handles 1 simulation day)
  • 6m26s for 64 days of simulation (with 64 MPI ranks)
  • >30 min for 365 days of simulation (with 365 MPI ranks). A lot of time (~18m before the filesystem change) is taken just for these instructions
lbs.Imo()
lbs.Simulation(...)
sim.imo.query(...)
lbs.InstrumentInfo(...)
lbs.DetectorInfo.from_imo(...)

We thought this was related to the filesystem because we did not expect such an increase of running time by using more MPI ranks, but maybe this is incorrect...

P.s.: This last thing about running times may be off topic "Installation at NERSC", that should be solved.

@ziotom78
Copy link
Member

ziotom78 commented Aug 5, 2022

HOWEVER it seems that the scripts are still slow... For example, the e2e-simulation.py for all the detectors of L2-050 takes

* 3m10s for 1 day of simulation (and 1 MPI rank, as each rank handles 1 simulation day)

* 6m26s for 64 days of simulation (with 64 MPI ranks)

* >30 min for 365 days of simulation (with 365 MPI ranks). A lot of time (~18m before the filesystem change) is taken just for these instructions
lbs.Imo()
lbs.Simulation(...)
sim.imo.query(...)
lbs.InstrumentInfo(...)
lbs.DetectorInfo.from_imo(...)

We thought this was related to the filesystem because we did not expect such an increase of running time by using more MPI ranks, but maybe this is incorrect...

Interesting analysis! It really seems that the filesystem plays some part in this, but it's hard to tell why. Might you please create a new script that is exactly the same as the e2e script, but it just stops after having queried the data from the IMO? If you run it with an increasing number of processes (1, 2, 3, 4… up to ~10), we can measure how much the IMO is effectively slowed down by the NERSC filesystem.

In the meantime, I have done a few benchmarks of litebird_sim.imo on my own workstation but haven't found any behavior similar to yours. I do see some slowdown as the number of processes increase, but it is minimal and way less severe than what you're seeing. Note that some slowdown is however expected, as my script just parses the IMO and extracts information about some tens of detectors: the bottleneck here is JSON parsing (each process spends ~50% of the time parsing the IMO JSON file), and of course once you have several processes reading the same file concurrently, this will introduce some measurable delay.

@marcobortolami
Copy link
Contributor

I did the test that you were suggesting. I attach a file showing the running times of the instructions vs the number of processors nproc. The file named running_times_no_quat.pdf is the same but without the quaternions curve. The labels show the instructions and the xticks show the number of processors that I tried. I did 1 run per number of processors, so the lines could be subject to scatterings. Up to 30 nproc only 1 nose is used, while for nproc=40,50,60 two nodes are used (that's the motivation for the decrease from 30 to 40 I suppose, because they are split between the nodes).
What do you think?
running_times.pdf
running_times_no_quat.pdf

@giuspugl
Copy link
Collaborator Author

giuspugl commented Aug 8, 2022

thanks @marcobortolami for doing this test! it seems to me the issue is in the plot running_times.pdf . the execution of sim.generate_spin2ecl_quaternions increases w/ Nproc. what 's going on there?
I also didn't think about the JSON parsing issue mentioned by @ziotom78 ...

@mreineck
Copy link
Collaborator

mreineck commented Aug 8, 2022

I don't know the code well enough to be sure what exactly is happening inside generate_spin2ecl_quaternions, but here is a hypothesis: could it be that this function

  • is called several times siultaneously on one node, and
  • internally uses ducc0.pointingprovider with an nthreads argument that implies further parallelization inside?

That could lead to a quadratic increase of running threads, and will most likely kill performance.

@mreineck
Copy link
Collaborator

mreineck commented Aug 8, 2022

Looking a bit closer at the code, there doesn't seem to be any call to ducc0.pointingprovider in the code in question ... sorry for the false alarm!

@marcobortolami
Copy link
Contributor

@nraffuzz and I had a call with @paganol and @sgiardie about the problem. We post here a recap of the discussion.

Brief recap: the run time was badly scaling with the number of processors as for sim.generate_spin2ecl_quaternions, the higher the number of processors, the longer the run time (see running_times.pdf ). We installed everything on /global/common/software/ but the problem did not seem to vanish. With 60 processors, so 60 simulation days, sim.generate_spin2ecl_quaternions takes t_quat ~ 200s.

Test 1: after having a deeper look at the lbs code, @paganol suggested to use a double (for testing the run time) and not an astropy.time Time instance as start_time. This because here, the instructions at lines 391-393 seem to be the most time consuming ones. This reduces t_quat to ~ 24s (60 proc). Thus, we found the bottleneck. However, we might need to use an instance of astropy.time.

Test 2: we set the parameter delta_time_s=300 in sim.generate_spin2ecl_quaternions to see the scaling (default is 60). This reduces t_quat to ~ 58s (60 proc). Here we are using an instance of astropy.time. The scaling is more or less as expected.

Note: We noticed that here the code is not parallelized. Do you think that this implementation could help with the running time problem?

Solution: @paganol hinted that the binding of the CPUs to tasks might be sub-optimal by looking at the slurm file. Thus we added the flag --cpu-bind=cores to srun, forcing a binding between tasks and cores. We are thus forcing the number of processors (cores) to match the numbers of simulation days (tasks). This reduces t_quat to ~ 8s (60 proc), using delta_time_s=60 and an instance of astropy.time. Now with 365 proc t_quat is ~ 44s. Finally, now the time to run the complete e2e simulation for all the detectors of L2-050 with 365 days of simulation is 10m 32s instead of > 30m. For example, the time for
lbs.Imo()
lbs.Simulation(...)
sim.imo.query(...)
lbs.InstrumentInfo(...)
lbs.DetectorInfo.from_imo(...)
reduces from ~18m to ~50s.

@ziotom78
Copy link
Member

Excellent news, good job!

Note: We noticed that here the code is not parallelized. Do you think that this implementation could help with the running time problem?

Ah, yes, I remember that TODO. If I recall correctly, this will help especially in those cases where you use a astropy.time.Time object to track time, but it should not affect the running time significantly if you just use float. I opened a new issue (#189) to track this.

@mreineck
Copy link
Collaborator

mreineck commented Aug 29, 2022

Brief recap: the run time was badly scaling with the number of processors as for sim.generate_spin2ecl_quaternions, the higher the number of processors, the longer the run time (see running_times.pdf ). We installed everything on /global/common/software/ but the problem did not seem to vanish. With 60 processors, so 60 simulation days, sim.generate_spin2ecl_quaternions takes t_quat ~ 200s.

Test 1: after having a deeper look at the lbs code, @paganol suggested to use a double (for testing the run time) and not an astropy.time Time instance as start_time. This because here, the instructions at lines 391-393 seem to be the most time consuming ones. This reduces t_quat to ~ 24s (60 proc). Thus, we found the bottleneck. However, we might need to use an instance of astropy.time.

Am I interpreting this correctly that, even in the fast version, a CPU takes around 24s to compute the quaternions for a single day of the mission? Assuming that this is a simple rotation operation this feels horribly slow to me. I would have expected that a single CPU can produce several million quaternions per second, but this value seems to be quite at odds with your result.

Please let me know if I misunderstood something! But if not, it might be worthwhile to tweak not only the parallelization, but also the basic algorithm itself.

@nraffuzz
Copy link
Contributor

In the case we called Test 1, we made use of 60 processors, not a single CPU, that's to say the simulation was 60 days long instead of a single day. The 60-days-simulation lasted then ~24s. Then, with the same configuration (60proc = 60days), we added --cpu-bind=cores to srun, and the 60-days-simulation took only ~8s.
Also, regarding the amount of rotations a single CPU can make in a second, that number is limited by delta_time_s in sim.generate_spin2ecl_quaternions. The parameter delta_time_s specifies how often should quaternions be computed, which is 60s by default.
Please let me know if something is not clear.

@mreineck
Copy link
Collaborator

In the case we called Test 1, we made use of 60 processors, not a single CPU, that's to say the simulation was 60 days long instead of a single day. The 60-days-simulation lasted then ~24s. Then, with the same configuration (60proc = 60days), we added --cpu-bind=cores to srun, and the 60-days-simulation took only ~8s.

OK, if 60 CPUs together compute 60 days in 8s, then one can estimate that in this setup a single CPU takes 8s to compute a single day, or 1440 quaternions in total (one for every minute of the day).
I'm completely at a loss how this operation can be so slow; my expectation would be that this should take a few milliseconds instead of 8 seconds.

@ziotom78
Copy link
Member

ziotom78 commented Aug 29, 2022

I did a few benchmarks using the following script, which simulates one detector with 50 Hz sampling rate (very high!) for one day, computing one quaternion/s:

import litebird_sim as lbs
import numpy as np
import time


def create_fake_detector(sampling_rate_hz, quat=np.array([0.0, 0.0, 0.0, 1.0])):
    return lbs.DetectorInfo(name="dummy", sampling_rate_hz=sampling_rate_hz, quat=quat)


sim = lbs.Simulation(
    base_path="/home/tomasi/tmp/litebird_sim_benckmark",
    start_time=0.0,
    duration_s=86400.0,
)
det = create_fake_detector(sampling_rate_hz=50.0)

sim.create_observations(
    detectors=[det], num_of_obs_per_detector=1, split_list_over_processes=False
)
assert len(sim.observations) == 1
obs = sim.observations[0]

scanning_strategy = lbs.SpinningScanningStrategy(
    spin_sun_angle_rad=0.0, precession_rate_hz=1.0, spin_rate_hz=1.0
)

# Generate the quaternions (one per each second)
start = time.perf_counter_ns()
sim.generate_spin2ecl_quaternions(
    scanning_strategy=scanning_strategy, delta_time_s=1.0, append_to_report=False
)
stop = time.perf_counter_ns()
elapsed_time = (stop - start) * 1.0e-9

print("Elapsed time for generate_spin2ecl_quaternions: {} s".format(elapsed_time))
print("Shape of the quaternions: ", sim.spin2ecliptic_quats.quats.shape)
print(
    "Speed: {:.1e} quat/s".format(sim.spin2ecliptic_quats.quats.shape[0] / elapsed_time)
)

instr = lbs.InstrumentInfo(spin_boresight_angle_rad=np.deg2rad(15.0))

# Compute the pointings by running a "slerp" operation
start = time.perf_counter_ns()
pointings_and_polangle = lbs.get_pointings(
    obs,
    spin2ecliptic_quats=sim.spin2ecliptic_quats,
    detector_quats=np.array([[0.0, 0.0, 0.0, 1.0]]),
    bore2spin_quat=instr.bore2spin_quat,
)
stop = time.perf_counter_ns()
elapsed_time = (stop - start) * 1.0e-9

print("Elapsed time for get_pointings: {} s".format((stop - start) * 1e-9))
print("Shape of the pointings: ", pointings_and_polangle.shape)
print(
    "Speed: {:.1e} pointings/s".format(pointings_and_polangle.shape[1] / elapsed_time)
)

I ran the code using IPython:

$ ipython
Python 3.9.5 (default, Jun  4 2021, 12:28:51) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.0.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %run benchmark.py
Elapsed time for generate_spin2ecl_quaternions: 0.848702189 s
Shape of the quaternions:  (86401, 4)
Speed: 1.0e+05 quat/s
Elapsed time for get_pointings: 1.822759492 s
Shape of the pointings:  (1, 4320000, 3)
Speed: 2.4e+06 pointings/s

However, since this is the first run, the timings include the JIT compilation phase. Running the script once more on the same prompt provides something in line with what @mreineck expects:

In [2]: %run benchmark.py
Elapsed time for generate_spin2ecl_quaternions: 0.010928339 s
Shape of the quaternions:  (86401, 4)
Speed: 7.9e+06 quat/s
Elapsed time for get_pointings: 1.492646147 s
Shape of the pointings:  (1, 4320000, 3)
Speed: 2.9e+06 pointings/s

Now there are roughly 8 million quaternions/s and ~3 million pointings/s.

Could it be that the slow times we are seeing are due to Numba's JIT? To check if this is true, it might help to structure the code so that the main is wrapped in a function that takes the number of elements, and then call it twice:

def main(num_of_samples=None):
    # Here comes the script. If `num_of_samples` is None, just use the start and end times
    # specified in the TOML file.main(50)  # Warm up Numba by running a simulation with very few samples
main()  # Now run the full simulation

What do you think? Does it make sense?

@mreineck
Copy link
Collaborator

mreineck commented Aug 29, 2022

Thank you very much for the test, @ziotom78!

The warmed-up numbers indeed look much more like what I expected. I also agree that the testing strategy makes perfect sense.
Still I wonder if something might be broken on the NERSC machine where the original run was done ... numba should not spend several seconds in non-jitted code, or am I mistaken?
I'm wondering if numba was simply inactive during that run for some reason and pure Python was executed instead.

@ziotom78
Copy link
Member

Hopefully, I found a smarter way to do the test… In #190 I simply defined a _precompile function in scanning.py that is always called when the module is imported. This makes the benchmark code I posted above run at maximum speed even during its first call.

The same PR adds the benchmark code to the directory benchmarks; now that we've started producing simulations, I would like to populate that folder with other scripts as well.

@nraffuzz , @marcobortolami , when you have time, may you please test the E2E scripts with the code in #190, to see if the execution speeds has really improved? I do not think this is urgent, so take your time!

@marcobortolami
Copy link
Contributor

Thank you very much!
@nraffuzz and I run e2e_simulation.py on cineca and it took 7min35sec instead of the 8min37sec on nersc.
We then tested the _precompile() function in the following way: we run e2e_simulation.py commenting and uncommenting the _precompile() function in litebird_sim, with the same simulation conditions.
Here the results:

_precompile() not commented:
simulation time:  45.193973541259766
time for filling 1/f noise timeline:  4.553291320800781
time for filling white noise timeline:  0.10382962226867676
time for obs initialization:  0.048781633377075195
time for pointings:  2.1843249797821045
time for reading and scanning map for TODs:  4.006235122680664
time for dipole construction:  2.5596871376037598
time for saving tods:  0.6159660816192627
time for saving tods:  116.41815876960754
time for report:  4.310925245285034
_precompile() commented:
simulation time:  46.06910419464111
time for filling 1/f noise timeline:  4.037186145782471
time for filling white noise timeline:  0.10503125190734863
time for obs initialization:  0.04358625411987305
time for pointings:  2.195441961288452
time for reading and scanning map for TODs:  4.122196912765503
time for dipole construction:  2.6025073528289795
time for saving tods:  0.824676513671875
time for saving tods:  117.44954562187195
time for report:  4.448533058166504

We do not think that the _precompile() function has a great impact on the running times, but it is still a smart improvement of the code.

@ziotom78
Copy link
Member

ziotom78 commented Oct 5, 2022

Thanks @marcobortolami , very nice test! In some sense it does not look so weird that there is no significant performance difference, because I have always seen that Numba is very fast in compiling stuff… The weird result is the one we got a few weeks ago, where _precompile() did a huge difference. But it's so reassuring that on Cineca the time required to generate pointings is ~2 s!

Anyway, since it doesn't hurt, I would prefer to leave _precompile() where it is.

@marcobortolami
Copy link
Contributor

Hi.

  • For the last test we used only 2 detectors (sorry for not mentioning this in the comment).
  • I also think that it is better to leave the _precompile() function.

Do you think we can close the issue now? Or do we have anything else to discuss about this issue?

@ziotom78
Copy link
Member

Yes, I believe it can be closed. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants