Working with GNU Parallel on Blue Crab Using a Single Node

Downloads

example_lapack.py

example_lapack.job

If you are on MARCC Blue Crab, right click to copy link location and wget <paste link> to obtain the file.

Introduction

This is a small self-contained document to help users consider distributing their ‘independent’ tasks on MARCC systems, by submitting single node job with multiple multithreaded jobs.  This is document is especially useful for users needing to work with many small job steps and managing the complexity of parameterized variable studies.

While job arrays can also work, this method is suitable for making the best use of your allocation, as all studies are performed on the same node, improving the reliability for all jobs to complete successfully.  Often, with job arrays, some jobs will need re-submits due to compute node failures or file I/O problems at the time of submission.  Also, job arrays can perform across a wider span of time since you are asking the scheduler to manage when they are submitted, which all depends on the availability of resources on Blue Crab.

GNU Parallel has a powerful feature to ‘resume’ jobs that have failed using the combined --joblog and --resume features together, so this means that jobs that have not completed can be resubmitted, simply by resubmitting your script! That is convenient. In re-testing of jobs, do not forget to delete your completedlogs/runtask.log because you’ve indicated to GNU parallel that your jobs are all complete.

Note to external readers: on our SLURM configuration we use TaskPlugin = task/affinity. This tutorial has not been tested with TaskPlugin = task/cgroup.

A Sample Code

Here’s an example Python script example_lapack.py to solve a randomly generated matrix A and vector v using LU factorization with MKL libraries.  Using a square matrix of size 10000, it typically takes about 4.3 seconds on a Broadwell compute node on Blue Crab using 24 MKL threads.  Let’s use this script to perform a study on varying matrix size, from 1K to 10K in steps of 1K for a total of 10 steps using 6 MKL threads for each case.

The program called after loading a suitable Anaconda Python (has ‘MKL’ library and support) would then be:

python example_lapack.py <size> <threads>

If everything looks good you will output like

Solved a matrix of size: 10000 using 24 threads.
Relative Error: 1.00001308209e-10
--- Random Matrix Generation Time: 1.58166694641 seconds ---
---                 Solution Time: 4.37026000023 seconds ---
"""
Running LAPACK with Anaconda Python
Copyright 2016 Ohio Supercomputer Center (OSC)
Copyright 2017 Maryland Advanced Research Computing Center (MARCC)

Attributions:
This example was inspired by
    "Numerical Computing with Modern Fortran" (SIAM, 2013) by Hanson & Hopkins

"""
#!/cm/shared/apps/anaconda2/4.4.0/bin/python
# global imports for example_lapack.py
from __future__ import print_function
import argparse
import sys
import time
import logging
import numpy as np

try:
    import mkl
except ImportError:
    raise ImportError('This version of Python does not have "mkl", load with ' +
                      '"module load python/2.7-anaconda"')

try:
    from scipy.linalg.lapack import dgetrf, dgetrs
    from scipy.linalg.blas import dnrm2, dgemv
except ImportError:
    raise ImportError('This version of Python does not have access to a' +
                      'lower-level lapack/blas routine.')

def main():
    """
    This example computes the following;
        1. Random number generation to fill a matrix 'a' of dimension nxn and
        also for a matrix 'y' of dimension n
        2. Pre-solve a*y = b so that we
        have 'b'.  This uses dgemv.
        3. Perform LU factorization (dgetrf) on dense matrix 'a' and store
        matrix and pivots in 'lu' and 'piv'
        4. Solve for x given 'lu' and 'piv' arrays (dgetrs)
        5. Compute L2 norm of the difference between solution and known vectors
        divided by L2 normed to the known y.  This is to provide a single point
        measure of the relative error.

    Inputs: dimension of n

    Error checks: NONE currently
    """

    # log and send it to stderr.
    logging.basicConfig(level=logging.INFO)

    parser = argparse.ArgumentParser()
    parser.add_argument("dimension", type=int, default=5, nargs='?',
                        help="The dimension of square matrix A")
    parser.add_argument("threads", type=int, default=20, nargs='?',
                        help="The number of threads")
    # grab the options here from the command line
    args = parser.parse_args()
    n = args.dimension
    mkl.set_num_threads(args.threads)

    # begin timing random number matrix generation
    time_1 = time.time()

    logging.debug('Dimension of square n by n matrix is:' + str(n) + '\n')
    a = np.random.rand(n, n)
    y = np.random.rand(n)

    logging.debug('a:' + np.array_str(a) + '\n')
    logging.debug('y:' + np.array_str(y) + '\n')

    # begin timing LAPACK
    time_2 = time.time()

    try:
        b = dgemv(1, a, y)
    except AttributeError:
        # catch when python scipy blas does not have dgemv
        print('This version of Python does not have access to lower-level dgemv' +
              'routine.')
        sys.exit(1)

    logging.debug('b:' + np.array_str(b) + '\n')
    lu, piv, _ = dgetrf(a)  # lu factorization
    x, _ = dgetrs(lu, piv, b)  # solve for x

    logging.debug('x:' + np.array_str(x) + '\n')
    relerr = dnrm2(x-y) / dnrm2(y)

    # end timing LAPACK
    time_3 = time.time()

    print("Solved a matrix of size:", n, "using", mkl.get_max_threads(), "threads.")
    print('Relative Error:', relerr)
    print("--- Random Matrix Generation Time: %s seconds ---" % (time_2 - time_1))
    print("---                 Solution Time: %s seconds ---" % (time_3 - time_2))

# main script begin
if __name__ == "__main__":
    main()

Submitting with SLURM and GNU Parallel Using 6 cores per task

Now that we have a code that solves something, we can now use it inside of the next script example_lapack.job

While we are using a multithread Python application with MKL libraries, the same script can be applied to single core applications.

#!/bin/bash -l
#SBATCH --job-name=test_with_gnu_parallel
#SBATCH --partition=parallel
#SBATCH --time=1:0:0
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cores=6

# in this example we are using multithreading via MKL threads
# if you have single core tasks set ntasks=24

# You can optionally use these
###SBATCH --output test_with_gnu_parallel%a.out
###SBATCH --error test_with_gnu_parallel_%a.err
###SBATCH --mail-user=username@jhu.edu
###SBATCH --mail-type=ALL

# See this resource for 'inspiration' in using GNU parallel
# https://rcc.uchicago.edu/docs/tutorials/kicp-tutorials/running-jobs.html

ml anaconda-python/2.7
ml parallel
ml

mkdir -p logs

# start timestamp
date

# this call is good for single core but exclusive may degrade performance in multithreading
srun="srun --exclusive -N1 -n1 -c6"

# --delay .2 prevents overloading the controlling node
# -j is the number of tasks parallel runs so we set it to 24 (the number of steps we want to run)
# --joblog makes parallel create a log of tasks that it has already run
# --resume makes parallel use the joblog to resume from where it has left off
# the combination of --joblog and --resume allow jobs to be resubmitted if
# necessary and continue from where they left off
parallel="parallel --delay .2 -j 4 --joblog logs/runtask.log --resume"

# gnu parallel: this runs the parallel command we want
# in this case, we are running a script named runtask
# parallel uses ::: to separate options. Here {1..20} is a shell expansion
# so parallel will run the command passing the numbers 1 through 10
# via argument {1}
# srun arguments:
# (single core) the --exclusive to srun makes srun use distinct CPUs for each job step
# -N1 -n6 allocates a matching cores to each job step
echo $parallel "$srun python example_lapack.py {1}000 6" ::: {1..10}
$parallel "$srun python example_lapack.py {1}000 6" ::: {1..10}

# end timestamp
date

Confirmation

Here’s a dump of our log file so you can know what to expect for timings. Note that the results are not shown but the timings of the results.

logs/runtask.log

Seq	Host	Starttime	JobRuntime	Send	Receive	Exitval	Signal	Command
1	:	1504282157.159	     1.867	0	207	0	0	srun -N1 -n1 -c6 time python example_lapack.py 1000 6
2	:	1504282157.376	     1.735	0	205	0	0	srun -N1 -n1 -c6 time python example_lapack.py 2000 6
3	:	1504282157.607	     1.803	0	205	0	0	srun -N1 -n1 -c6 time python example_lapack.py 3000 6
4	:	1504282157.837	     2.083	0	204	0	0	srun -N1 -n1 -c6 time python example_lapack.py 4000 6
5	:	1504282158.065	     2.621	0	204	0	0	srun -N1 -n1 -c6 time python example_lapack.py 5000 6
6	:	1504282158.268	     3.294	0	203	0	0	srun -N1 -n1 -c6 time python example_lapack.py 6000 6
7	:	1504282158.507	     4.117	0	203	0	0	srun -N1 -n1 -c6 time python example_lapack.py 7000 6
8	:	1504282158.720	     5.243	0	201	0	0	srun -N1 -n1 -c6 time python example_lapack.py 8000 6
9	:	1504282158.928	     6.760	0	203	0	0	srun -N1 -n1 -c6 time python example_lapack.py 9000 6
10	:	1504282159.140	     8.870	0	204	0	0	srun -N1 -n1 -c6 time python example_lapack.py 10000 6

Post Analysis Commands

If you want to re-run all cases, you have to delete the logs/runtask.log file.

If you want to sort the Seq values in order:

sort -k1 -n logs/runtask.log

If you want a human readable form of the epoch times in the log file:

for t in $(cat logs/runtask.log | cut -f3 | cut -d'.' -f1 | tail -n +2); do date -d @$t; done

Comments

What jobs make the best use of GNU parallel? Good candidates are when you have 100s to 1000s of cases for a SLURM job, using a single node and leverage task parallelism for independent sets of submissions (though you can enforce that one set be complete as a dependency prior to running another set).

Suppose if you work with varying cores in your study, instead of varying matrix sizing, that is not ideal because SLURM does not deal with core variation, as --exclusive mode appears to work on a ‘task’ basis.   In addition, this script is better suited for throughput when not focusing on performance timing. GNU parallel is more optimally used when you have tasks greater than the total number of tasks available on a node because in that case, you can just work with individual srun submissions as shown here: NERSC – Running Multiple Parallel Jobs Simultaneously