IMP3 output: Preprocessing

The Preprocessing step creates output files in the Preprocessing and Stats directories within the defined outputdir. The exact naming of the files depends on the configuration settings.

For all runs with reads as input, the original input reads will be copied to the Preprocessing directory and renamed into <mg|mt>.<r1|r2>.fq.gz. The final set of processed reads (that will be the input for the Assembly step) will be pointed to by symbolic links <mg|mt>.<r1|r2|se>.preprocessed.fq [*] , [†]. All IMP3 runs with the step Preprocessing will also have the trimmed reads processed with trimmomatic (<mg|mt>.<r1|r2|se>.trimmed.fq.gz).

MetaT reads are always filtered to remove rRNA reads using SortMeRNA . The remaining reads from this filtering step are named mt.<r1|r2|se>.trimmed.rna.fq.gz. If reads were mapped against one or more reference genomes, the filtered reads are named mt.<r1|r2|se>.trimmed.rna_filtered.<reference>_filtered.fq.gz and/or mg.<r1|r2|se>.trimmed.<reference>_filtered.fq.gz.

Note: Why is there a Preprocessing directory if you did not choose to do preprocessing? If already preprocessed reads were given as input, these reads will be placed in the Preprocessing directory following the exact same naming conventions as described above, to ensure that the downstream steps can use the same consistent input files and directories.

Stats from Preprocessing step

The Preprocessing step will also collect some statistics on the original and processed reads and save them to the Stats directory. To this end, it runs FastQC on the original and preprocessed reads. The resulting HTML reports and directories with more detailed tables are in the respective subdirectories <mg|mt>/ named <mg|mt>.<r1|r2>_fastqc.<zip|html> and <mg|mt>.<r1|r2|se>.preprocessed_fastqc.<zip|html>. In addition, the number of reads in each step is counted in <mg|mt>/<mg|mt>.read_counts.txt, which holds two tab-separated columns with the file names and number of reads, respectively.

[*]mg denotes metaG data throughout the workflow, mt represents metaT data.
[†]r1 contains first reads with a partner, which will be in r2, se contains the single ends that have lost their partner during the Preprocessing step.