IMP3 output: Assembly

During the Assembly step (or if the user provided an existing assembly, see input) all outputs will be written to the Assembly directory within the defined outputdir directory (see configuration).

  • The final output is a FASTA file of the assembled contigs <mg|mt|mgmt>.assembly.merged.fa [*] . The FASTA headers contain the sample name as given in the config file, separated by an underscore, contig, another underscore, and a number, e.g. test_contig_1.
  • The index files of the final contig file will be generated , namely for BWA (suffixes amb, ann, bwt, pac, and sa), Samtools (fai) and bioperl (index).
  • A bed3 file is also stored for later access.

Note: some of these files are produced by the Analysis step , so they will not be present after running only the Assembly step.

During the Assembly step, IMP3 maps back the processed reads to the final contigs and stores the alignment <mg|mt>.reads.sorted.bam and index <mg|mt>.reads.sorted.bam.bai. The BAM files are sorted by contig name and position.

The Assembly step actually consists of a large number of sub-steps (iterative assembly) generating a huge amount of intermediate result files and directories that will be archived and compressed into intermediary.tar.gz.

Stats from Assembly step

The Assembly step will also collect some summary statistics and save them to the Stats directory:

  • The GC-content of the final contigs is recorded in the tab-separated file <mg|mt|mgmt>/<mg|mt|mgmt>.assembly.gc_content.txt, which contains a header and holds two columns for the contig names and GC in percent (i.e 0-100), respectively.
  • The length of the contigs is provided in a tab-separated file <mg|mt|mgmt>/<mg|mt|mgmt>.assembly.length.txt with two simple columns, contig names and lengths.
  • The stats on the read mapping are kept in the Stats subdirectories with the metaG and/or metaT (mg/ or mt/):
  • <mg|mt>/<mg|mt|mgmt>.assembly.contig_flagstat.txt contains the numeric part of the samtools flagstat output.
  • the average depth of coverage for each set of reads for all contigs that have at least one read mapping to them is given in <mg|mt>/<mg|mt|mgmt>.assembly.contig_depth.txt. This file is headerless and tab-separated, with the contig names in the first column and the average depth of coverage in the second.
  • The Binning step adds based on this the file <mg|mt>/<mg|mt|mgmt>.assembly.contig_depth.0.txt, which also contains lines for contigs with zero coverage.
  • The file <mg|mt>/<mg|mt|mgmt>.assembly.contig_coverage.txt contains the processed output of bedtools genomeCoverageBed. It contains seven tab-separated columns:
    • 1: contig names,
    • 2: 0 (= the first position in bed coordinates),
    • 3: contig length (= the last position in bed coordinates),
    • 4: the number of regions overlapping with the aligned positions in the bam file (roughly = number of mapping reads),
    • 5: the number of covered positions,
    • 6: the number of positions (= value in column 3),
    • 7: the coverage breadth, i.e. proportion of covered positions (scaling 0-1)).
[*]The name of the assembly depends on the workflow: if the workflow has only metaG reads as input, the user has chosen the non-hybrid assembly workflow, or has provided an exsisting assembly, the assembly will be referred to by mg. If only metaT reads were given, the assembly will be referred to by mt. If the hybrid metaG and metaT workflow was defined and both types of reads were supplied, the assembly will be represented by mgmt.