Python is one of the most popular and versatile programming languages available to scientists. It supports multiple programming styles and emphasizes the readability of your code. The straightforward syntax combined with an extensive standard library and a native interface to high-performance, low-level compiled languages (namely C) has led to the widespread use of Python in the scientific computing world. Python’s nearest neighbors in the space of scientific programming include R and Matlab. As all three languages have matured, they have converged in functionality and all provide a large user base. We encourage all users to explore Python and its associated libraries when planning their calculations.
More than just Python
This article reviews the recommended methods for using Python at MARCC. Both Python and R have a large library of extra codes, many of which draw on programs compiled in other languages and distributed by package managers (e.g.
apt-get). The guide below provides some options for installing or accessing these external codes.
If your software requires many operating system-level packages, that is, those that require
yum install or compiling from source, or you require a very large set of R or Python libraries, please skip below to the custom conda environments instructions. We use Anaconda as a general package manager to support large sets of software dependencies which cannot be found on our software tree.
Options for controlling Python environments
We encourage all users to carefully select the best environment option for their software targets.
- Use the software modules. If your desired code is available in our software modules (
module avail) then this option is the best option because it provides the code you need with very little extra configuration. Visit our software modules guide to see if we already offer the softare you want.
- Install a virtual environment (case A). For codes which are not available in the default modules, the quickest solution is to use a python virtual environment.
- Build a
condaenvironment (case B). The Anaconda cloud provides a larger set of both Python packages as well as executables which you might otherwise install with your operating system package manager. This option provides the most customization.
- Build a
condaenvironment in sequence (alternative C).
- Use Singularity (alternative D).
Beware: a word of caution when managing user installed software
For example, if you use
pip install --user to install software, it often preempts the methods described below. This can cause
This method will install packages to
~/.local however the precise path depends on which version of Python you are using when you install the code. Because this method is opaque, we recommend using virtual environments instead. You can confirm the use of user-installed packages by checking for the location
python -m site which summarizes the underlying package locations.
TMPDIR to a location on
~/data before running any
pip install commands. This relieves pressure on our operating system image by using our filesystem for temporary space.
mkdir ~/data/john_doe_tmp export TMPDIR=~/data/john_doe_tmp
You can clean up these files after the installations are complete.
Case A. Python virtual environments
- Select a python version using our modules system.
- Select a location to install your environment on the
~/datamount. Do not use the Lustre filesystem (
~/work) to hold your environment.
- Build the environment with
python -m venv ./path/to/env.
- Activate the environment with
source ./path/to/env/bin/activate. You will have to use this command whenever you wish to use the environment.
- Inside the environment you can install packages normally, without the use of the
pip. For example, you can install
pip install numpy.
Virtual environments can be a useful tool for reproducing your workflow on other machines using the
pip freeze command, which lists all of the packages you have already installed.
Case B. Custom conda environments
Note that this is the best solution for users who need to control their Python (or
R) version, install packages with
pip, use interactive portal tools, or install
conda packages from Anaconda Cloud. The use of
conda env ensures that you can generate your environment from an easily-portable text file. This also resolves version dependencies all at once, so that you don’t paint yourself into a corner. (Users who experience Perl version issues on Blue Crab should consult this guide.)
Note that whenever possible, it is best to use the software module (
ml anaconda) to use the
conda executable. Our module will allow you to use
conda without adding it to your
~/.bashrc file. This prevents later confusion and ensures that you have a clean and straightforward environment.
B1. Find a useful location to make an environment
conda create command will make an environment in a hidden folder in your home directory, at
~/.conda. These so-called “named” environments are suboptimal because you might forget about them and fill up your quota. It’s much better to specify an absolute path to the environment so you always know where it is.
On Blue Crab we recommend that all users install to
~/ (your home directory, be mindful of the quota) or your shared group directory on our ZFS filesystem at
~/data. The use of our Lustre system at
~/scratch is discouraged because these installations create many small files and this filesystem is not optimal for repeatedly executing binaries.
Once you find a spot to install the environment, you can continue.
B2. Prepare a requirements file
The requirements file consists of a
dependencies list which enumerates the
conda packages that you might typically install with
conda install. You can specify a channel with the special syntax below (
::) when packages are not available on the primary channel. We have also included
pip, so that
conda can also manage packages from PyPI.
dependencies: - python=3.7 - matplotlib - scipy - numpy - nb_conda_kernels - au-eoed::gnu-parallel - h5py - pip - pip: - sphinx
Save this as a text file called
reqs.yaml. The file respects the YAML format. Note that the use of
nb_conda_kernels is necessary to use this environment in Jupyter on Blue Crab. Users who wish to later install their packages with
pip should include it on the list. This will allow you to run
pip install to add a package to this environment and avoid polluting your
~/.local folder when using,
pip install --user, which is a typical alternative.
B3. Install the environment
We recommend choosing a useful name for your environment. This will help distingish your environment from others, in case you later use portal tools like Jupyter. Install Anaconda, or if you are using Blue Crab, load the anaconda module. Recall that
ml is short for
module load when using Lmod.
After selecting a name (
my_plot_env), install the environment. If the environment installation is complex, you may wish to reserve a compute node.
conda env update --file reqs.yaml -p ./my_plot_env
The command above can be executed over and over again as you add new packages to your environment. We recommend that
After the environment is installed, you can use it in future terminal sessions or scripts using:
ml anaconda conda activate ./path/to/my_plot_env
The method above offers the following benefits:
- Reproducibility. You can repeatedly update
reqs.yamlto add new packages, and then use the
conda env updatecommand above to add them to the environment. This helps to prevent version conflicts and makes your work more reproducible.
- A large library of packages are available on both anaconda cloud and PyPI. As long as you include
pipin your requirements list, you can add packages to the corresponding list, for example
sphinxin the example requirements file above. The
condaprogram can manage packages from its own repositories alongside those delivered from the Python package index (PyPI).
- Interactive development. The use of
nb_conda_kernelswill send a signal to Jupyter so you can use this environment inside a notebook.
- Performance. Oftentimes
condapackages are just as fast as our standard software modules. You may notice that
numpyuses the Intel MKL library, for example, which provides a large speedup for linear algebra calculations on some platforms. While we also offer MKL on our software modules system, they are the default for many packages delivered by
Alternative C: installing a conda environment sequentially
While we strongly recommend that users maintain a requirements (
reqs.yaml) and make use of the
conda env update command described above, it is also possible to install individual packages into your environment. We recommend choosing a location on
~/data (using the path flag,
-p) to store your environment.
ml anaconda # cd to a location on ~/data conda create -p ./myenv conda activate ./myenv # install pip conda install pip # after you install pip, you should use pip directly (without --user) pip install matplotlib # install a conda package from a specific channelw conda install -c au-eoed gnu-parallel
Note that this method can often cause package upgrades and downgrades in order to resolve a set of mutual dependencies. For this reason, we recommend using a requirements file instead, so that your environment can be reproducibly generated from a single set of requested packages.
Alternative D: Singularity
If you are unable to compile your software from source or install it using the guide to Anaconda above, you may wish to use Singularity to use a container for your code. More documentation is coming soon.