Starting with the cluster and Python/Jupyter

mkounalakis · 25 November 2020 12:13

Hi everyone,

I would like to start using the cluster for python / ipython notebook simulations, however my understanding is that some things need to be set up first (I have no experience using a cluster before).

I have already created an hpc05 account and went through the very first steps of logging in and transferring files (PUTTY and WinSCP installed). I mainly used the hpcwiki however I could not find any information on what needs to be set in order to be able to run ipynb on the cluster.

If there is anyone with experience on that I would be grateful for some help!

anton-akhmerov · 25 November 2020 20:22

Hi @mkounalakis!

Can you describe more about your planned computation? Usually one uses the cluster in order to benefit from running everything in parallel. However a single jupyter notebook is nearly the opposite—it’s interactive, so a lot of time it is idle (all the time while you are implementing new code/viewing results, etc).

Typically we use a separate package like ipyparallel or dask to distribute tasks over the cluster. We do launch these packages from a notebook.

mkounalakis · 25 November 2020 22:42

Yes certainly!

I numerically solve a 2D Fokker-Planck equation which is basically a partial differential equation containing a partial derivative with respect to time and partial derivatives with respect to x and y. I discretise the equation in the spatial dimensions (N by M grid) so I have a system of N by M ordinary differential equations which are solved using a python ode solver. The result I get is the probability density P(t,x,y) for all t, x and y, which I can use to calculate <x(t)> for example.
(This can successfully run on my laptop, although I need to severely constrain my grid size due to memory issues).

However, the quantity I am really interested in is the correlation function <x(t)x(t’)>, which requires the knowledge of the conditional probability density P(t,x,y|t’,x’,y’), which basically means that I have to run my simulation above N by M times (for each initial condition P(t’,x’,y’)=1). So this is where the cluster could help speed things up.

I would also like to compute P(t,x,y) while sweeping one of the parameters in the equation, which is what the cluster is really helpful for as far as I understand.

If running jupyter notebooks is much more complicated than running python scripts I could also go for that option. But I am not sure how to set it up so any help by someone with experience would be greatly appreciated!

janiserdmanis · 1 December 2020 07:48

An easy way to start getting used to with the cluster is an interactive mode. To start using it, you can execute on the cluster qsub -I -l nodes=2 which will start the job and will put you in one of the nodes. You can get the cores allocated to you in the machinefile which can be listed on hpc05 with cat $PBS_NODEFILE. You shall be able to ssh in any of the specified nodes. The next step is to make ssh passwordless between all two nodes for which I did use a neat bash line which I, unfortunately, can’t recall.

When that is done, you can feed the machinefile for most systems. Unfortunately, I can not understand why the first results on google with machinefile and python leads to MPI. For Julia, I can use the machinefile with julia --machine-file=$PBS_NODEFILE which will initiate workers and connect them with the master instance.

Note this setup is prone to errors of allocating resources longer than necessary as it requires to exit the job manually. That is probably also a reason why interactive jobs resources are limited to something like 47 cores. Nevertheless, I find the setup quite useful until I debug code and parameters for ordinary job submission. Perhaps it could be a useful workflow for you as well.

mkounalakis · 3 December 2020 09:02

thanks a lot for your replies @janiserdmanis and @anton-akhmerov !

I am trying to set things up (running jupyter + ipyparallel) with help from a colleague. Once I manage i will post here the process I followed (or come back with more questions!).

anton-akhmerov · 3 December 2020 10:42

Also as a quick check, the memory utilization of 2D Fokker-Planck shouldn’t be exorbitant. If you share a minimal code, I’d be happy to take a look if something stands out.

mkounalakis · 11 December 2020 22:16

Ok I will post the process I followed here for future users who want to set this up (if there are corrections/alternatives you may reply to this post).

Anaconda
First I installed anaconda on my directory because I want to be able to install any package I want (also anaconda comes with ipython and jupyter). I did this using the following commands (on my Putty terminal):

wget -P /tmp https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
bash /tmp/Anaconda3-2020.02-Linux-x86_64.sh

Then I add anaconda3 to my path by modifying the .bashrc file (see also: https://problemsolvingwithpython.com/01-Orientation/01.05-Installing-Anaconda-on-Linux/)

Jupyter notebook
You can create a jupyter notebook (for this it’s good to open a new screen in your terminal) with the following command

jupyter notebook --no-browser --port=nnnn

with nnnn being 4 digits of your choice. Then the following url with a token will appear: http://localhost:nnnn/?token=......

You can then access the notebook from your computer by typing the following in your terminal

ssh -N -f -L localhost:xxxx:localhost:nnnn yournetid@hpc05.tudelft.net

(xxxx is a different 4-digit number of your choice) and filling in your netid password. Open a browser in your computer and type in the same token as above

http://localhost:xxxx/?token=......

Now you should be able to run jupyter notebooks on hpc05 using your computer.

Note on running nodes
For reserving nodes on the cluster I use ipyparallel which has good documentation on how to set it up (https://ipyparallel.readthedocs.io/en/latest/process.html). The command for reserving N nodes is

ipcluster start --profile=pbs --n=N

However, if you also want that the nodes run using your conda environment you need to use this command

anaconda3/bin/python anaconda3/bin/ipcluster start --profile=pbs --n=N

mkounalakis · 11 December 2020 22:22

Sorry about the memory issue, that was due to a gif generating file that kept running on my notebook …

jjwesdorp · 21 January 2022 10:58

Hi guys, thanks for the information. I’ve been trying to set this up for running some 2D kwant simulations on 24 cores after going from the MS cluster to here. Following your steps gets the jupyter notebook running and indeed it can reserve cores with the

ipcluster start --profile=pbs --n=N

However, these don’t show up in the qstat command, does that mean I’m running this locally on the master node then?
As far as I understand it would be better to reserve one node with N cores, run the jupyter notebook there and close it when done. I tried this by directly submitting a job with qsub, and then starting jupyter notebook in that job.

#!/bin/bash
#
# Torque directives (#PBS) must always be at the start of a job script!
#
# Request nodes and processors per node
#
#PBS -l nodes=1:ppn=1
#
#
# Set the name of the job
#
#PBS -N jupyter-node
#
jupyter notebook --no-browser --port=8888

followed by doing on the hpc05 core

ssh -N -f -L localhost:9999:localhost:8888 netid@[node]

to forward the jupyter notebook to the hpc05 localhost such that i can then access that in a browser. But doing a ping from hpc05 to localhost:9999 doesn’t give any data.

I imagine there are better ways to do this, or is running

ipcluster start --profile=pbs --n=N

simply enough and correct, even though then it doesn’t show up in qstat?

Perhaps an even better way would be to only reserve cores when running a shell in jupyter and closing them using a context manager, which if I understand correctly should be available in ipyparallel

Any thoughts on this? Thanks in advance!

anton-akhmerov · 22 January 2022 12:57

Hey Jaap,

How did you create the pbs profile and what’s in the ipcluster_config.py? Note that you need to first generate the ipython profile and then specify the launcher class in it:

https://ipyparallel.readthedocs.io/en/latest/reference/launchers.html#launchers

jjwesdorp · 25 January 2022 15:10

Hi Anton, Thanks for your reply, i’ve looked into the profile. And indeed i didn’t yet create one before. So I was stupidly using the local (default) settings for the controller and launcher. I now changed it by setting the engine line to,

 c.IPClusterEngines.engine_launcher_class = 'PBSEngineSetLauncher'.

And now indeed when i run the cluster via

import ipyparallel as ipp
rc = ipp.Cluster(n=4, profile='pbs', controller_ip='*').start_and_connect_sync()

I get a success message and have access to the engines that run on a node visible in qstat.

However, now i run into the problem that when running a jupyter notebook the situation is quite different than what i had before (run the notebook on a cluster and use the local cores and shared memory).
I used to run code as follows:

with ProcessPoolExecutor() as ex:
    ret = ex.map(sim_func, sweep_vals)
data = list(ret)

But now if i do the ipyparallel version

dview = rc[:]
ret = dview.map_sync(sim_func, sweep_vals)

I get errors related to the fact that the engines don’t share the imports and the memory of the main jupyter notebook. I can try and fix this pushing the required variables to the nodes, but it would be nicer if it is possible to just run a notebook server on one node and then use several cores with shared resources. Is that possible you think? Or is there a simple way of syncing the memory (e.g the kwant.system and all parameters)

mkounalakis · 25 January 2022 15:42

Hi Jaap,

Have you tried using Client? for me the following works fine:

  from ipyparallel import Client
  rc = Client(profile = 'pbs')

(if you print len(rc) you can also see how many of your nodes have started)

Then you can use

rc[:].map_sync(sim_func, sweep_vals)

jjwesdorp · 26 January 2022 14:29

Hi Marios,

Thanks for your answer, that works, but doesn’t remove the problem that I have with ipyparallel not sharing the memory I think. E.g if I imported numpy in the jupyter-notebook then I didn’t get that in the parallel cores when running sim_func and equally, they didn’t have access to variables that were run earlier in the notebook.

I think I made a working version now based on my original plan:

First run
qsub jupyter-node.sh
which contains:

#!/bin/bash
#
# Torque directives (#PBS) must always be at the start of a job script!
#
# Request nodes and processors per node
#
#PBS -l nodes=1:ppn=24
#
#
# Set the name of the job
#
#PBS -N jupyter-node-24c
#
jupyter notebook --no-browser --port=8889

Makes a 24cpu node available and runs jupyter notebook at port 8889.

This outputs a job number, Then do nano jobnumber for the token.

Also, do qstat -u [user] -n

to see which node. Then setup a port forwarding to the hpc05 to node

ssh -N -f -L localhost:9998:localhost:8889 [netid]@[node]

And finally setup port forwarding from local machine to hpc05

ssh -N -f -L localhost:9995:localhost:9998 netid@hpc05.tudelft.net

Then go to

http://localhost:9995/tree?token=[token]

This works, and now I can use the procespoolexecutor again (with max_workers=24), which shares the memory of the notebook and the notebook runs on the node from qsub not the main node.
The only thing is that you have to manually close the node when done with Jupyter.

anton-akhmerov · 7 February 2022 09:24

Hey Jaap,

Ipyparallel does not automatically share the memory or the imports on different nodes, so you will need to explicitly run all initialization code that imports necessary modules and defines relevant functions (if they are not passed to the client by e.g. load_balanced_view.map. A common way to do it, if I remember my ipyparallel correctly is %%px cell magic. So after you created a cluster client and all engines connected, you’d do

%%px --local
import numpy as np

That imports numpy both locally and on each ipyparallel worker.

gsteele13 · 12 July 2023 15:43

Hi Marios,

I’m trying to translate this thread into a “quickstart guide” for people in my group that want to run parallel stuff on the cluster.

I found your trick of running a jupyter notebook server on the head node and then tunneling the connection to your local browser a great idea, and have written that into my guide:

However, as a total parallel-noob, I got a bit stuck on how to extend this to parallelized code. Some questions:

What are the steps I need to do this?
Do I need to separately run the ipcluster command to create the node reservation? Can I do that at any time or do I have to do this before I start the jupyter server? And what does this do? Does it already block the nodes? Or does that only happen when I use a command like map_sync? Do I need to “release” them later?
I guess I need to do something to specify that my code should run on the python I installed in my home directory?
Do I need to “release” the nodes?
Why do you use only 4 nodes? Can you pick more later?

Maybe if you could share your code (or if you have time, a minimalist example of runnable code) that could help?

My aim is at some point to finish the second half of my “quickstart guide” to minimize the barrier to people from my group using the cluster.

Thanks!
Gary