How to install IMP3

To install IMP3, you need conda.

  1. Clone this repository to your disk:
git clone --recurse-submodules --single-branch --branch master https://git-r3lab.uni.lu/IMP/imp3.git ./IMP3

Change into the IMP3 directory:

cd IMP3

At this point, you have all the scripts you need to run the workflow. You still need to download a number of databases, though (see point 3).

If you want to use the comfortable IMP3 wrapper, follow the points 4-8. If you don’t want to use the wrapper, point 9 has a hint on the initialization of the conda environments.

  1. Create the main conda environment:

If you don’t have conda, follow these instructions first.

Create the main environment to be able to execute snakemake workflows.

conda env create -f requirements.yaml --prefix ./conda/snakemake_env

To run any snakemake workflow, you have to activate the created environment:

conda activate ./conda/snakemake_env
  1. Download databases:

Note: if you already have some of these databases, you don’t need to download them again. You need to link them into the directory that keeps all of the databases, though. You may also not need all of these databases, if you don’t run the steps that use them.

There is an installation workflow (see directory install/) which will setup the requirements defined in a config file. A default configuration is provided in the file config/config.install.yaml.

Here is an example how to execute the setup:

# activate the snakemake environment
conda activate ./conda/snakemake_env
# dry-run
snakemake -s install/Snakefile --configfile config/config.install.yaml --cores 1 -rpn
# execute (use conda, more cores)
snakemake -s install/Snakefile --configfile config/config.install.yaml --cores 8 --use-conda --conda-prefix ./conda -rp
conda deactivate

Please note that the cores, memory and runtime requirements depend on the used config file.

For reference, see below for more information on how the requirements have to be set up for the different workflow steps.

Step preprocessing:

  • Trimmomatic adapters: adapters should saved in a subdirectory in your database directory called adapters/. You can find the latest version of Trimmomatic adapters here . .. We also supply such a directory (from Trimmomatic v.0.32) that you’d just need to unzip. You may also want to use different adapters, which you can just supply in FASTA format but make sure to use the format expected by Trimmomatic. .. You specify the adapter set in the config file.
  • Genomes to filter your reads against: reference FASTA files (with extention *.fa) should be saved in a subdirectory in your database directory called filtering/. Usually, you want to remove phiX174 and host genome sequences.
  • SortMeRNA databases: when processing metatranscriptomic reads, the rRNA reads have to be removed. These are filtered by SortMeRNA against the provided reference FASTA files. .. We supply these here . .. You need to unzip this into your database directory, or manually download the sequences from https://github.com/biocore/sortmerna/tree/master/data/rRNA_databases and put them into a subdirectory in your database directory called sortmerna.

Step analysis:

TODO: Mantis

Step taxonomy:

  • Kraken2: to run taxonomic classification with Kraken2, a directory containing the required databases files has to be put (or linked) in your database directory. .. You will need a kraken2 database in your database directory (or a link to one). See here for a collection of Kraken2 databases. You can of course also use a custom database. .. You may also want to build your own. You just supply the name of the k2 database in the config file later, e.g.
  • GTDB-tk: if you have performed binning, you need the GTDB database (file gtdbtk_data.tar.gz). .. Untar it it to a subdirectory called GTDB_tk within your database directory.
  1. Adjust the file VARIABLE_CONFIG to your requirements (have a tab between the variable name and your setting):
  • SNAKEMAKE_VIA_CONDA - set this to true, if you don’t have snakemake in your path and want to install it via conda (recommended, because that way you have a current version). Leave empty, if you don’t need an additional snakemake.
  • LOADING_MODULES - insert a bash command to load modules, if you need them to run conda. Leave empty, if you don’t need to load a module.
  • SUBMIT_COMMAND - insert the bash command you’ll usually use to submit a job to your cluster to run on a single cpu for the entirety of the run (a few hours to days). You can also include extra arguments, in particularly, if you need to bind jobs to one node (e.g. sbatch –nodelist=). You only need this whole setting, if you want to have the snakemake top instance running in a submitted job. You alternatively have the option to run it on the frontend via tmux. Leave empty, if you want to use this version and have tmux installed.
  • SCHEDULER - insert the name of the scheduler you want to use (currently slurm or sge). This determines the cluster config given to snakemake, e.g. the cluster config file for slurm is config/slurm.config.yaml or config/slurm_simple.config.yaml. Also check that the settings in this file is correct. If you have a different system, contact us and feel free to submit new scheduler files.
  • MAX_THREADS - set this to the maximum number of cores you want to be using in a run. If you don’t set this, the default will be 50. Users can override this setting at runtime.
  • BIGMEM_CORES - set this to the maximum number of bigmem cores you want to be using in a run. If you don’t set this, the default will be 5.
  • NORMAL_MEM_EACH - set the size of the RAM of one core of your normal copute nodes (e.g. 8G). If you’re not planning to use binny to submit to a cluster, you don’t need to set this.
  • BIGMEM_MEM_EACH - set the size of the RAM of one core of your bigmem (or highmem) compute nodes. If you’re not planning to use binny to submit to a cluster or don’t have separate bigmem nodes, you don’t need to set this.
  • NODENAME_VAR - if you submit your workflow to a cluster that requires binding to a fixed node, set this variable according to the environmental storing the nodename (e.g. SLURMD_NODENAME for most slurm systems)
  • BIND_JOBS_TO_MAIN - if you submit your workflow to a cluster that requires binding to a fixed node, set this variable to true
  1. Decide how you want to run IMP3, if you let it submit jobs to the cluster:

Only do one of the two:

  • if you want to submit the process running snakemake to the cluster:
cp runscripts/runIMP3_submit.sh runIMP3
chmod 755 runIMP3
  • if you want to keep the process running snakemake on the frontend using tmux:
cp runscripts/runIMP3_tmux.sh runIMP3
chmod 755 runIMP3
  1. optional, but highly recommended: Install snakemake (and other dependencies) via conda:

If you want to use snakemake via conda (and you’ve set SNAKEMAKE_VIA_CONDA to true), install the environment, as recommended by Snakemake : Please note that the IMP3 pipeline is tested only using package versions specified in requirements.yaml. Using different versions than the specified ones might cause issues.

conda env create -f requirements.yaml --prefix $PWD/conda/snakemake_env
  1. optional, but highly recommended: Set permissions / PATH:

IMP3 is meant to be used by multiple users. Set the permissions accordingly. I’d suggest:

  • to have read access for all files for the users, plus:
  • execution rights for the runIMP3 file and the .sh scripts in the subfolder submit_scripts
  • read, write and execution rights for the conda subfolder
  • to add the IMP3 directory to your path.
  • It can also be useful to make the VARIABLE_CONFIG file not-writable, because you will always need it. The same goes for config.imp.yaml once you’ve set the paths to the databases you want to use (see below).
  1. Initialize conda environments:

This run sets up the conda environments that will be usable by all users and will download a database:

./runIMP3 -i config/config.imp_init.yaml

This step will take several minutes to an hour.

  1. Initialize the conda environments without wrapper:

This sets up the conda environments that will be usable by all users and will download more databases:

snakemake --cores 1 -s Snakefile --conda-create-envs-only --use-conda --conda-prefix `pwd`/conda --configfile config/config.imp_init.yaml --local-cores 1

This step will take several minutes to an hour. I strongly suggest to remove one line from the activation script after the installation, namely the one reading: R CMD javareconf > /dev/null 2>&1 || true, because you don’t need this line later and if two users run this at the same time it can cause trouble. You can do this by running:

sed -i "s/R CMD javareconf/#R CMD javareconf/" conda/*/etc/conda/activate.d/activate-r-base.sh