This! Is! Singularity!

Gave Singularity another try, after letting the dust settle on a previous traumatic experience, expecting it not to work AT ALL and being astounded and also unsurprised that it actually worked with minimal effort.

The target: the wonderful ggCaller (https://github.com/samhorsfield96/ggCaller).

A Singularity image (.sif) was already provided by the developer (https://zenodo.org/records/7870950), although it should be possible to wrangle Docker images into Singularity format with a theoretically simple conversion.

#load Singularity
module load singularity/3.10.5
singularity -h

Linux container platform optimized for High Performance Computing (HPC)                                                            and
Enterprise Performance Computing (EPC)

Usage:
  singularity [global options...]

Description:
  Singularity containers provide an application virtualization layer en                                                           abling
  mobility of compute via both application and environment portability.                                                            With
  Singularity one is capable of building a root file system that runs o                                                           n any
  other Linux system where Singularity is installed.

Options:
  -c, --config string   specify a configuration file (for root or
                        unprivileged installation only) (default
                        "/install/software/restart/depos/singularity/in                                                           stallation//etc/singularity/singularity.conf")
  -d, --debug           print debugging information (highest verbosity)
  -h, --help            help for singularity
      --nocolor         print without color output (default False)
  -q, --quiet           suppress normal output
  -s, --silent          only print errors
  -v, --verbose         print additional information
      --version         version for singularity

Available Commands:
  build       Build a Singularity image
  cache       Manage the local cache
  capability  Manage Linux capabilities for users and groups
  completion  Generate the autocompletion script for the specified shel                                                           l
  config      Manage various singularity configuration (root user only)
  delete      Deletes requested image from the library
  exec        Run a command within a container
  help        Help about any command
  inspect     Show metadata for an image
  instance    Manage containers running as services
  key         Manage OpenPGP keys
  oci         Manage OCI containers
  overlay     Manage an EXT3 writable overlay image
  plugin      Manage Singularity plugins
  pull        Pull an image from a URI
  push        Upload image to the provided URI
  remote      Manage singularity remote endpoints, keyservers and OCI/D                                                           ocker registry credentials
  run         Run the user-defined default command within a container
  run-help    Show the user-defined help for an image
  search      Search a Container Library for images
  shell       Run a shell within a container
  sif         Manipulate Singularity Image Format (SIF) images
  sign        Attach digital signature(s) to an image
  test        Run the user-defined tests within a container
  verify      Verify cryptographic signatures attached to an image
  version     Show the version for Singularity

Examples:
  $ singularity help <command> [<subcommand>]
  $ singularity help build
  $ singularity help instance start


For additional help or support, please visit https://www.sylabs.io/docs                                                           /

The first issue, which had cropped up previously, related to the absence of suitable cache dir:

FATAL:   Failed to create an image cache handle: failed initializing caching directory: couldn't create cache directory /home/ulrich.schnauss/.singularity/cache: mkdir /home/ulrich.schnauss/.singularity: disk quota exceeded

which could be resolved by redefining the environmental variable to point towards a new folder on my analysis drive.

export SINGULARITY_CACHEDIR="/path/to/my/analysis/folder/singularity/"

Then, it was possible to smash the bottle against the hull and pull my first container:

wget https://zenodo.org/records/7870950/files/samhorsfield96_ggcaller_latest-2023-04-27-6c0a454e1c5c.sif?download=1
mv https://zenodo.org/records/7870950/files/samhorsfield96_ggcaller_latest-2023-04-27-6c0a454e1c5c.sif?download=1 https://zenodo.org/records/7870950/files/samhorsfield96_ggcaller_latest-2023-04-27-6c0a454e1c5c.sif

Tried to build it per the wiki instructions:

singularity shell --writable samhorsfield96_ggcaller_latest-2023-04-27-6c0a454e1c5c.sif
FATAL:   no SIF writable overlay partition found in /path/to/my/analysis/folder/singularity/samhorsfield96_ggcaller_latest-2023-04-27-6c0a454e1c5c.sif

Thankfully, someone had experienced a similiar issue before and left their documentation up in the annals of the interwebs #shouldersofgiants: https://groups.google.com/a/lbl.gov/g/singularity/c/HhwetRXIfYI/m/6RP4N5zBAAAJ

singularity build --sandbox samhorsfield96_ggcaller_latest-2023-04-27-6c0a454e1c5c.sif
Build target 'samhorsfield96_ggcaller_latest-2023-04-27-6c0a454e1c5c.sif' already exists and will be deleted during the build process. Do you                                                            want to continue? [N/y] y
INFO:    Starting build...
INFO:    Creating sandbox directory...
INFO:    Build complete: samhorsfield96_ggcaller_latest-2023-04-27-6c0a454e1c5c.sif

That was… it? 100% believing it would not work, proceeded to amend PATH as per the instructions and navigated inside the folder

PATH=$PATH:/opt/conda/bin
cd samhorsfield96_ggcaller_latest-2023-04-27-6c0a454e1c5c.sif
#I think the naming/structuring could be improved here for future, on my part

ggcaller -h
usage: ggcaller [-h] [--graph GRAPH] [--colours COLOURS] [--not-ref]
                [--refs REFS] [--reads READS] [--query QUERY]
                [--codons CODONS] [--kmer KMER] [--save]
                [--data DATA] [--all-seq-in-graph] [--out OUT]
                [--max-path-length MAX_PATH_LENGTH]
                [--min-orf-length MIN_ORF_LENGTH]
                [--score-tolerance SCORE_TOLERANCE]
                [--max-ORF-overlap MAX_ORF_OVERLAP]
                [--min-path-score MIN_PATH_SCORE]
                [--min-orf-score MIN_ORF_SCORE]
                [--max-orf-orf-distance MAX_ORF_ORF_DISTANCE]
                [--query-id QUERY_ID] [--no-filter] [--no-write-idx]
                [--no-write-graph] [--repeat] [--no-clustering]
                [--no-refind] [--identity-cutoff IDENTITY_CUTOFF]
                [--len-diff-cutoff LEN_DIFF_CUTOFF]
                [--family-threshold FAMILY_THRESHOLD]
                [--merge-paralogs]
                [--clean-mode {strict,moderate,sensitive}]
                [--annotation {none,fast,sensitive,ultrasensitive}]
                [--diamonddb ANNOTATION_DB] [--hmmdb HMM_DB]
                [--evalue EVALUE]
                [--truncation-threshold TRUNCATION_THRESHOLD]
                [--search-radius SEARCH_RADIUS]
                [--refind-prop-match REFIND_PROP_MATCH]
                [--min-trailing-support MIN_TRAILING_SUPPORT]
                [--trailing-recursive TRAILING_RECURSIVE]
                [--edge-support-threshold EDGE_SUPPORT_THRESHOLD]
                [--length-outlier-support-proportion LENGTH_OUTLIER_SUP                                                           PORT_PROPORTION]
                [--min-edge-support-sv MIN_EDGE_SUPPORT_SV]
                [--no-clean-edges] [--alignment {core,pan}]
                [--aligner {def,ref}] [--core-threshold CORE]
                [--no-variants] [--ignore-pseduogenes] [--quiet]
                [--threads THREADS] [--version]

Generates ORFs from a Bifrost graph.

optional arguments:
  -h, --help            show this help message and exit

Input/Output options:
  --graph GRAPH         Bifrost GFA file generated by Bifrost build.
  --colours COLOURS     Bifrost colours file generated by Bifrost
                        build.
  --not-ref             If using existing graph, was not graph built
                        exclusively with assembled genomes. [Default
                        = False]
  --refs REFS           List of reference genomes (one file path per
                        line).
  --reads READS         List of read files (one file path per line).
  --query QUERY         List of unitig sequences to query (either
                        FASTA or one sequence per line)
  --codons CODONS       JSON file containing start and stop codon
                        sequences.
  --kmer KMER           K-mer size used in Bifrost build (bp).
                        [Default = 31]
  --save                Save graph objects for sequence querying.
                        [Default = False]
  --data DATA           Directory containing data from previous
                        ggCaller run generated via "--save"
  --all-seq-in-graph    Retains all DNA sequence for each gene
                        cluster in the Panaroo graph output. Off by
                        default as it uses a large amount of space.
  --out OUT             Output directory

ggCaller traversal and gene-calling cut-off settings:
  --max-path-length MAX_PATH_LENGTH
                        Maximum path length during ORF finding (bp).
                        [Default = 20000]
  --min-orf-length MIN_ORF_LENGTH
                        Minimum ORF length to return (bp). [Default =
                        90]
  --score-tolerance SCORE_TOLERANCE
                        Length probability tolerance for shorter
                        alternative start sites. If within
                        tolerance,ggCaller will check if start
                        coverage and BALROG score are both higher in
                        shorter ORF. [Default = 0.2]
  --max-ORF-overlap MAX_ORF_OVERLAP
                        Maximum overlap allowed between overlapping
                        ORFs. [Default = 60]
  --min-path-score MIN_PATH_SCORE
                        Minimum total Balrog score for a path of ORFs
                        to be returned. [Default = 100]
  --min-orf-score MIN_ORF_SCORE
                        Minimum individual Balrog score for an ORF to
                        be returned. [Default = 100]
  --max-orf-orf-distance MAX_ORF_ORF_DISTANCE
                        Maximum distance for graph traversal during
                        ORF connection (bp). [Default = 10000]
  --query-id QUERY_ID   Ratio of query-kmers to required to match in
                        graph. [Default = 0.8]

Settings to avoid/include algorithms:
  --no-filter           Do not filter ORF calls using Balrog. Will
                        return all ORF calls. [Default = False]
  --no-write-idx        Do not write FMIndexes to file. [Default =
                        False]
  --no-write-graph      Do not write Bifrost GFA and colours to file.
                        [Default = False]
  --repeat              Enable traversal of nodes multiple times.
                        [Default = False]
  --no-clustering       Do not cluster ORFs. [Default = False]
  --no-refind           Do not refind uncalled genes [Default =
                        False]

Gene clustering options:
  --identity-cutoff IDENTITY_CUTOFF
                        Minimum identity at amino acid level between
                        two ORFs for clustering. [Default = 0.98]
  --len-diff-cutoff LEN_DIFF_CUTOFF
                        Minimum ratio of length between two ORFs for
                        clustering. [Default = 0.98]
  --family-threshold FAMILY_THRESHOLD
                        protein family sequence identity threshold
                        [Default = 0.7]
  --merge-paralogs      don't split paralogs[Default = False]

Panaroo run mode options:
  --clean-mode {strict,moderate,sensitive}
                        R|The stringency mode at which to run
                        panaroo. Must be one of 'strict', 'moderate'
                        or 'sensitive'. Each of these modes can be
                        fine tuned using the additional parameters in
                        the 'Graph correction' section. strict:
                        Requires fairly strong evidence (present in
                        at least 5% of genomes) to keep likely
                        contaminant genes. moderate: Requires
                        moderate evidence (present in at least 1% of
                        genomes) to keep likely contaminant genes.
                        sensitive: Does not delete any genes and only
                        performes merge and refinding operations.
                        Useful if rare plasmids are of interest as
                        these are often hard to disguish from
                        contamination. Results will likely include
                        higher number of spurious annotations.

Panaroo gene cluster annotation options:
  --annotation {none,fast,sensitive,ultrasensitive}
                        Annotate genes using diamond default (fast),
                        diamond sensitive (sensitive) or diamond and
                        HMMscan (ultrasensitive). Specify 'none' if
                        annotation not required.Default = 'fast'
  --diamonddb ANNOTATION_DB
                        Diamond database. Defaults are 'Bacteria' or
                        'Viruses'. Can also specify path to fasta
                        file for custom database generation
  --hmmdb HMM_DB        HMMER hmm profile file. Default is Uniprot
                        HAMAP. Can alsospecify path to pre-built hmm
                        profile file generated using hmmbuild
  --evalue EVALUE       Maximum e-value to return for DIAMOND and
                        HMMER searches during annotation[Default =
                        0.001]
  --truncation-threshold TRUNCATION_THRESHOLD
                        Sequences in a gene family cluster below this
                        proportion of the length of thecentroid will
                        be annotated as 'potential
                        pseudogene'[Default = 0.8]

Panaroo gene refinding options:
  --search-radius SEARCH_RADIUS
                        the distance in nucleotides surronding the
                        neighbour of an accessory gene in which to
                        search for it
  --refind-prop-match REFIND_PROP_MATCH
                        the proportion of an accessory gene that must
                        be found in order to consider it a
                        match[Default = 0.2]

Panaroo graph correction stringency options:
  --min-trailing-support MIN_TRAILING_SUPPORT
                        minimum cluster size to keep a gene called at
                        the end of a contig
  --trailing-recursive TRAILING_RECURSIVE
                        number of times to perform recursive trimming
                        of low support nodes near the end of contigs
  --edge-support-threshold EDGE_SUPPORT_THRESHOLD
                        minimum support required to keep an edge that
                        has been flagged as a possible mis-assembly
  --length-outlier-support-proportion LENGTH_OUTLIER_SUPPORT_PROPORTION
                        proportion of genomes supporting a gene with
                        a length more than 1.5x outside the
                        interquatile range for genes in the same
                        cluster.Genes failing this test will be re-
                        annotated at the shorter length[Default =
                        0.1]
  --min-edge-support-sv MIN_EDGE_SUPPORT_SV
                        minimum edge support required to call
                        structural variants in the presence/absence
                        sv file
  --no-clean-edges      Turn off edge filtering in the final output
                        graph.[Default = False]

Alignment options:
  --alignment {core,pan}
                        Output alignments of core genes or all genes.
                        Options are 'core' and 'pan'. [Default =
                        'None'
  --aligner {def,ref}   Specify an aligner. Options: 'ref' for
                        reference-guided MSA and 'def' for default
                        standard MSA
  --core-threshold CORE
                        Core-genome sample threshold.[Default = 0.95]
  --no-variants         Do not call variants using SNP-sites after
                        alignment.[Default = False]
  --ignore-pseduogenes  Ignore ORFs annotated as 'potential
                        pseudogenes' in alignment[Default = False]

Misc. options:
  --quiet               suppress additional output[Default = False]
  --threads THREADS     Number of threads to use. [Default = 1]
  --version, -v         show program's version number and exit

So there you have it: what took the bones of a day to install through the refiners fire that is Conda dependency hell, resolved in 10 mins. Still need to suss out how to use Docker images where the devs haven’t already generated a Singularity file…

Quickly jotting down notes to use AWS-CLI and PAT to resolve issues installing R packages from Github today…

Trying to do a post per week, despite it being absolutely hectic these days, and needed to install Palmscan (https://github.com/rcedgar/palmscan) locally in the line of peer review.