Managing Memory

Why do I need to specify memory?

Specifying the correct amount of memory is an important balancing act. If you dont request enough your job will crash when it runs out. If you request too much, your jobs will wait a long time to start due to contention for RAM with other jobs in the cluster and you wont be able to run as many concurrent jobs.

How to tell the system how much memory a job needs?

You must tell the (x86) systems if you need more than 2GB of memory per process

The way the memory request is made depends on the type of job you need to run. It is important not to mix up the two methods as this will mean that the job will not get the resources you intended.

For single cpu and multi-cpu MPI jobs, each cpu has its own private memory allocation, requested by the option: --mem-per-cpu=

For SMP or multithreaded jobs, all available memory needs to be accessible to all cpus (or threads). To access memory in this way, the memory needs to be physically on one node (motherboard). This type of memory is requested by the option: --mem=

For MPI parallel and single jobs you tell the system how much memory you need per cpu (i.e. divide the total need by the number of cpu requested).

MPI example:

# A parallel job using 32 cores and a total of 64GBytes Ram (= 2GBytes per core)
#SBATCH --ntasks=32
#SBATCH --mem-per-cpu=2048
#SBATCH --time=10:00:00
srun my_mpi_job

Single cpu example:

# A serial or single cpu job using 4GBytes Ram
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=4096
#SBATCH --time=10:00:00
myjob

For SMP or multithreaded applications you request the total memory required and specify that it is to be shared for all cores of your job.

SMP example:

# A multithreaded application that shares 102400MB of memory on 8 cores of a single node.
#SBATCH --mem=102400
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
myjob -n$SLURM_CPUS_ON_NODE     # if your program needs to know the number of cpu, use $SLURM_CPUS_ON_NODE

How much memory do I need ?

Unfortunately determining the maximum amount of memory that a job needs requires experimentation. One approach is to run a small version of the job (or a short version of the real thing) and see what memory is being used.

If your job uses more memory than you requested, it will be killed. The error message is unlikely to say that it way killed due to memory (depends on the program). Most likely there will be a message similar to: signal 9 (Killed) How much memory have I used ?

To see how much memory a job used use the sacct command and view the maximum virtual memory size. For example, for a job number of JOBID, use:

sacct -o jobid,jobname,reqmem,maxrss,state -j JOBID

For all of your jobs from a given date, and given machine, e.g. 1 Jan 2015 on merri:

sacct -o jobid,jobname,reqmem,maxrss,state -S 2015-01-01 -M merri

this displays the maximum virtual memory size used by the program (in KB). This command can only be used after the job has finished.