This tutorial introduces the text-based workflow system Snakemake. Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that define how to create output files from input files. Dependencies between the rules are determined automatically, creating a DAG (directed acyclic graph) of jobs that can be automatically parallelized.
Snakemake sets itself apart from existing text-based workflow systems in the following way. Hooking into the Python interpreter, Snakemake offers a definition language that is an extension of Python with syntax to define rules and workflow specific properties. This allows to combine the flexibility of a plain scripting language with a pythonic workflow definition. The Python language is known to be concise yet readable and can appear almost like pseudo-code. The syntactic extensions provided by Snakemake maintain this property for the definition of the workflow. Further, Snakemakes scheduling algorithm can be constrained by priorities, provided cores and customizable resources and it provides a generic support for distributed computing (e.g., cluster or batch systems). Hence, a Snakemake workflow scales without modification from single core workstations and multi-core servers to cluster or batch systems.
While the examples presented here come from Bioinformatics, Snakemake is considered a general-purpose workflow management system for any discipline.
Note: To get the most up-to-date version of this tutorial and further information on Snakemake, please visit the Snakemake homepage. Also have a look at the corresponding slides.
To go through this tutorial, you need the following software installed:
The easiest way to setup these prerequisites is to use the Miniconda Python 3 distribution. The tutorial assumes that you are using either Linux or MacOS X. Both Snakemake and Miniconda work also under Windows, but the Windows shell is too different to be able to provide generic examples.
If you already use Linux or MacOS X, go on with Step 1. If you use Windows, you can setup a Linux virtual machine (VM) with Vagrant. First, install Vagrant following the installation instructions in the Vagrant Documentation. Then, create a reasonable new directory you want to share with your Linux VM, e.g., create a folder vagrant-linux
somewhere. Open a command line prompt, and change into that directory. Here, you create a 64-bit Ubuntu Linux environment with
vagrant init hashicorp/precise64
vagrant up
If you decide to use a 32-bit image, you will need to download the 32-bit version of Miniconda in the next step. The contents of the vagrant-linux
folder will be shared with the virtual machine that is set up by vagrant. You can log into the virtual machine via
vagrant ssh
If this command tells you to install an SSH client, you can follow the instructions in this Blogpost. Now, you can follow the steps of our tutorial from within your Linux VM.
First, please open a terminal or make sure you are logged into your Vagrant Linux VM. Assuming that you have a 64-bit system, on Linux, download and install Miniconda 3 with
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
On MacOS X, download and install with
wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
For a 32-bit system, URLs and file names are analogous but without the _64
. When you are asked the question
Do you wish the installer to prepend the Miniconda3 install location to PATH ...? [yes|no]
answer with yes. Along with a minimal Python 3 environment, Miniconda contains the package manager Conda. After opening a new terminal, you can use the new conda
command to install software packages and create isolated environments to, e.g., use different versions of the same package. We will later use Conda to create an isolated enviroment with all required software for this tutorial.
First, create a new directory snakemake-tutorial
at a reasonable place and change into that directory in your terminal. If you use a Vagrant Linux VM from Windows as described above, create the directory under /vagrant/
, so that the contents are shared with your host system (you can then edit all files from within Windows with an editor that supports Unix line breaks). In this directory, we will later create an example workflow that illustrates the Snakemake syntax and execution environment. First, we download some example data on which the workflow shall be executed:
wget https://bitbucket.org/snakemake/snakemake/downloads/snakemake-tutorial-data.tar.gz
tar -xf snakemake-tutorial-data.tar.gz
This will create a data
folder and a requirements.txt
file in the working directory.
The requirements.txt
file can be used to install all required software into an isolated conda environment with the name snakemake-tutorial
via
conda create -n snakemake-tutorial -c bioconda --file requirements.txt
Note that the arguments after the -c
flags define software channels that shall be used in addition to the main conda
repository. Here, we use the Bioconda channel, which contains a growing collection of bioinformatics software packaged for Conda.
To activate the snakemake-tutorial
enviroment, execute
source activate snakemake-tutorial
Now you can use the installed tools. Execute
snakemake --help
to test this and get information about the command-line interface of Snakemake. To exit the environment, you can execute
source deactivate
but don’t do that now, since we finally want to start working with Snakemake :-).
Please make sure that you have activated the environment we created before, and that you have an open terminal in the working directory you have created.
A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps (e.g., the application of a single tool) by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names.
The Snakemake language extends the Python language, adding syntactic structures for rule definition and additional controls. All added syntactic structures begin with a keyword followed by a code block that is either in the same line or indented and consisting of multiple lines. The resulting syntax resembles that of original Python constructs.
In the following, we will introduce the Snakemake syntax by creating an example workflow. The workflow will map sequencing reads to a reference genome and call variants on the mapped reads.
Our first Snakemake rule maps reads of a given sample to a given reference genome. In the working directory, create a new file called Snakefile
with an editor of your choice. We propose to use the Atom editor, since it provides out-of-the-box syntax highlighting for Snakemake. In the Snakefile, define the following rule:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/A.fastq"
output:
"mapped_reads/A.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
A Snakemake rule has a name (here bwa_map
) and a number of directives, here input
, output
and shell
. The input
and output
directives are followed by lists of files that are expected to be used or created by the rule. In the simplest case, these are just explicit Python strings. The shell
directive is followed by a Python string containing the shell command to execute. In the shell command string, we can refer to elements of the rule via braces notation (similar to the Python format function). Here, we refer to the output file by specifying {output}
and to the input files by specifying {input}
. Since the rule has multiple input files, Snakemake will concatenate them separated by a whitespace. In other words, Snakemake will replace {input}
with data/genome.fa data/samples/A.fastq
before executing the command. The shell command invokes bwa mem
with reference genome and reads, and pipes the output into samtools
which creates a compressed BAM file containing the alignments. The output of samtools
is piped into the output file defined by the rule.
When a workflow is executed, Snakemake tries to generate given target files. Target files can be specified via the command line. By executing
snakemake -np mapped_reads/A.bam
in the working directory containing the Snakefile, we tell Snakemake to generate the target file mapped_reads/A.bam
. Since we used the -n
(or --dryrun
) flag, Snakemake will only show the execution plan instead of actually perform the steps. The -p
flag instructs Snakemake to also print the resulting shell command for illustation. To generate the target files, Snakemake applies the rules given in the Snakefile in a top-down way. The application of a rule to generate a set of output files is called job. For each input file of a job, Snakemake again (i.e. recursively) determines rules that can be applied to generate it. This yields a directed acyclic graph (DAG) of jobs where the edges represent dependencies. So far, we only have a single rule, and the DAG of jobs consists of a single node. Nevertheless, we can execute our workflow with
snakemake mapped_reads/A.bam
Note that, after completion of above command, Snakemake will not try to create mapped_reads/A.bam
again, because it is already present in the file system. Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job.
Obviously, the rule will only work for a single sample with reads in the file data/samples/A.fastq
. However, Snakemake allows to generalize rules by using named wildcards. Simply replace the A
in the second input file and in the output file with the wildcard {sample}
, leading to
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
When Snakemake determines that this rule can be applied to generate a target file by replacing the wildcard {sample}
in the output file with an appropriate value, it will propagate that value to all occurences of {sample}
in the input files and thereby determine the necessary input for the resulting job. Note that you can have multiple wildcards in your file paths, however, to avoid conflicts with other jobs of the same rule, all output files of a rule have to contain exactly the same wildcards.
When executing
snakemake -np mapped_reads/B.bam
Snakemake will determine that the rule bwa_map
can be applied to generate the target file by replacing the wildcard {sample}
with the value B
. In the output of the dry-run, you will see how the wildcard value is propagated to the input files and all filenames in the shell command. You can also specify multiple targets, e.g.:
snakemake -np mapped_reads/A.bam mapped_reads/B.bam
Some Bash magic can make this particularly handy. For example, you can alternatively compose our multiple targets in a single pass via
snakemake -np mapped_reads/{A,B}.bam
Note that this is not a special Snakemake syntax. Bash is just expanding the given path into two, one for each element of the set {A,B}
.
In both cases, you will see that Snakemake only proposes to create the output file mapped_reads/B.bam
. This is because you already executed the workflow before (see the previous step) and no input file is newer than the output file mapped_reads/A.bam
. You can update the file modification date of the input file data/samples/A.fastq
via
touch data/samples/A.fastq
and see how Snakemake wants to re-run the job to create the file mapped_reads/A.bam
by executing
snakemake -np mapped_reads/A.bam mapped_reads/B.bam
For later steps, we need the read alignments in the BAM files to be sorted. This can be achieved with the samtools
command. We add the following rule beneath the bwa_map
rule:
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
This rule will take the input file from the mapped_reads
directory and store a sorted version in the sorted_reads
directory. Note that Snakemake automatically creates missing directories before jobs are executed. For sorting, samtools
requires a prefix specified with the flag -T
. Here, we need the value of the wildcard sample
. Snakemake allows to access wildcards in the shell command via the wildcards
object that has an attribute with the value for each wildcard.
When issuing
snakemake -np sorted_reads/B.bam
you will see how Snakemake wants to run first the rule bwa_map
and then the rule samtools_sort
to create the desired target file: as mentioned before, the dependencies are resolved automatically by matching file names.
Next, we need to index the sorted read alignments for random access. This can be done with the following rule:
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
Having three steps already, it is a good time to take a closer look at the resulting DAG of jobs. By executing
snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg
we create a visualization of the DAG using the dot
command provided by Graphviz. For the given target files, Snakemake specifies the DAG in the dot language and pipes it into the dot
command, which renders the definition into SVG format. The rendered DAG is piped into the file dag.svg
and will look similar to this:
The DAG contains a node for each job and edges representing the dependencies. Jobs that don’t need to be run because their output is up-to-date are dashed. For rules with wildcards, the value of the wildcard for the particular job is displayed in the job node.
The next step in our workflow will aggregate the aligned reads from all samples and jointly call genomic variants on them. Snakemake provides a helper function for collecting input files. With
expand("sorted_reads/{sample}.bam", sample=SAMPLES)
we obtain a list of files where the given pattern "sorted_reads/{sample}.bam"
was formatted with the values in the given list of samples SAMPLES
, i.e.
["sorted_reads/A.bam", "sorted_reads/B.bam"]
The function is particularly useful when the pattern contains multiple wildcards. For example,
expand("sorted_reads/{sample}.{replicate}.bam", sample=SAMPLES, replicate=[0, 1])
would create the product of all elements of SAMPLES
and the list [0, 1]
, yielding
["sorted_reads/A.0.bam", "sorted_reads/A.1.bam", "sorted_reads/B.0.bam", "sorted_reads/B.1.bam"]
For more information, see the Documentation. Here, we use only the simple case of expand
. We first let Snakemake know which samples we want to consider. Remember that Snakemake works top-down, it does not automatically infer this from, e.g., the fastq files in the data folder. Remember that Snakefiles are in principle Python code enhanced by some declarative statements to define workflows. Hence, we can define the list of samples ad-hoc in plain Python at the top of the Snakefile:
SAMPLES = ["A", "B"]
Later, we will learn about more sophisticated ways like config files. Now, we can add the following rule to our Snakefile:
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
With multiple input or output files, it is sometimes handy to refer them separately in the shell command. This can be done by specifying names for input or output files (here, e.g., fa=...
). The files can then be referred in the shell command via, e.g., {input.fa}
. For long shell commands like this one, it is advisable to split the string over multiple indented lines. Python will automatically merge it into one. Further, you will notice that the input or output file lists can contain arbitrary Python statements, as long as it returns a string, or a list of strings. Here, we invoke our expand
function to aggregate over the aligned reads of all samples.
calls/all.vcf
, it should look like this:Although Snakemake workflows are already self-documenting to a certain degree, it is often useful to summarize the obtained results and performed steps in a comprehensive report. With Snakemake, such reports can be composed easily with the built-in report
function. It is best practice to create reports in a separate rule that takes all desired results as input files and provides a single HTML file as output.
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
First, we notice that this rule does not entail a shell command. Instead, we use the run
directive, which is followed by plain Python code. Similar to the shell case, we have access to input
and output
files, which we can handle as plain Python objects (no braces notation here).
We go through the run
block line by line. First, we import the report
function from snakemake.utils
. Second, we open the VCF file by accessing it via its index in the input files (i.e. input[0]
), and count the number of non-header lines (which is equivalent to the number of variant calls). Third, we create the report using the report
function. The function takes a string that contains RestructuredText markup. In addition, we can use the familiar braces notation to access any Python variables (here the samples
and n_calls
variables we have defined before). The second argument of the report
function is the path were the report will be stored (the function creates a single HTML file). Then, report expects any number of keyword arguments referring to files that shall be embedded into the report. Technically, this means that the file will be stored as a Base64 encoded data URI within the HTML file, making reports entirely self-contained. Importantly, you can refer to the files from within the report via the given keywords followed by an underscore (here T1_
). Hence, reports can be used to semantically connect and explain the obtained results.
When having many result files, it is sometimes handy to define the names already in the list of input files and unpack these into keyword arguments as follows:
report("""...""", output[0], **input)
Further, you can add meta data in the form of any string that will be displayed in the footer of the report, e.g.
report("""...""", output[0], metadata="Author: Johannes Köster (koester@jimmy.harvard.edu)", **input)
So far, we always executed the workflow by specifying a target file at the command line. Apart from filenames, Snakemake also accepts rule names as targets if the referred rule does not have wildcards. Hence, it is possible to write target rules collecting particular subsets of the desired results or all results. Moreover, if no target is given at the command line, Snakemake will define the first rule of the Snakefile as the target. Hence, it is best practice to have a rule all
at the top of the workflow which has all typically desired target files as input files.
Here, this means that we add a rule
rule all:
input:
"report.html"
to the top of our workflow. When executing Snakemake with
snakemake -n
the execution plan for creating the file report.html
which contains and summarizes all our results will be shown. Note that, apart from Snakemake considering the first rule of the workflow as default target, the appearance of rules in the Snakefile is arbitrary and does not influence the DAG of jobs.
report.html
in your browser.snakemake --help
and search for the flag --forcerun
. Then, use this flag to re-execute the rule samtools_sort
and see what happens.--reason
it is possible to display the execution reason for each job. Try this flag together with a dry-run and the --forcerun
flag to understand the decisions of Snakemake.In total, the resulting workflow looks like this:
SAMPLES = ["A", "B"]
rule all:
input:
"report.html"
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
Now that the basic concepts of Snakemake have been illustrated, we can introduce advanced topics.
For some tools, it is advisable to use more than one thread in order to speed up the computation. Snakemake can be made aware of the threads a rule needs with the threads
directive. In our example workflow, it makes sense to use multiple threads for the rule bwa_map
:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
threads: 8
shell:
"bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
The number of threads can be propagated to the shell command with the familiar braces notation (i.e. {threads}
). If no threads
directive is given, a rule is assumed to need 1 thread.
When a workflow is executed, the number of threads the jobs need is considered by the Snakemake scheduler. In particular, the scheduler ensures that the sum of the threads of all running jobs does not exceed a given number of available CPU cores. This number can be given with the --cores
command line argument (per default, Snakemake uses only 1 CPU core). For example
snakemake --cores 10
would execute the workflow with 10 cores. Since the rule bwa_map
needs 8 threads, only one job of the rule can run at a time, and the Snakemake scheduler will try to saturate the remaining cores with other jobs like, e.g., samtools_sort
. The threads directive in a rule is interpreted as a maximum: when less cores than threads are provided, the number of threads a rule uses will be reduced to the number of given cores.
Apart from the very common thread resource, Snakemake provides a resources
directive that can be used to specify arbitrary resources, e.g., memory usage or auxiliary computing devices like GPUs. Similar to threads, these can be considered by the scheduler when an available amount of that resource is given with the command line argument --resources
. Details can be found in the Snakemake Documentation.
--forceall
you can enforce a complete re-execution of the workflow. Combine this flag with different values for --cores
and examine how the scheduler selects jobs to run in parallel.So far, we specified the samples to consider in a Python list within the Snakefile. However, often you want your workflow to be customizable, so that it can be easily adapted to new data. For this purpose, Snakemake provides a config file mechanism. Config files can be written in JSON or YAML, and loaded with the configfile
directive. In our example workflow, we add the line
configfile: "config.yaml"
to the top of the Snakefile. Snakemake will load the config file and store its contents into a globally available dictionary named config
. In our case, it makes sense to specify the samples in config.yaml
as
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Now, we can remove the statement defining SAMPLES
from the Snakefile and change the rule bcftools_call
to
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
Since we have stored the path to the FASTQ files in the config file, we can also generalize the rule bwa_map
to use these paths. This case is different to the rule bcftools_call
we modified above. To understand this, it is important to know that Snakemake workflows are executed in three phases.
The expand functions in the list of input files of the rule bcftools_call
are executed during the initialization phase. In this phase, we don’t know about jobs, wildcard values and rule dependencies. Hence, we cannot determine the FASTQ paths for rule bwa_map
from the config file in this phase, because we don’t even know which jobs will be generated from that rule. Instead, we need to defer the determination of input files to the DAG phase. This can be achieved by specifying an input function instead of a string as inside of the input directive. For the rule bwa_map
this works as follows:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
threads: 8
shell:
"bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
Here, we use an anonymous function, also called lambda expression. Any normal function would work as well. Input functions take as single argument a wildcards
object, that allows to access the wildcards values via attributes (here wilcards.sample
). They return a string or a list of strings, that are interpeted as paths to input files (here, we return the path that is stored for the sample in the config file). Input functions are evaluated once the wildcard values of a job are determined.
data/samples
folder, there is an additional sample C.fastq
. Add that sample to the config file and see how Snakemake wants to recompute the part of the workflow belonging to the new sample.Sometimes, shell commands are not only composed of input and output files and some static flags. In particular, it can happen that additional parameters need to be set depending on the wildcard values of the job. For this, Snakemake allows to define arbitrary parameters for rules with the params
directive. In our workflow, it is reasonable to annotate aligned reads with so-called read groups, that contain metadata like the sample name. We modify the rule bwa_map
accordingly:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg="@RG\tID:{sample}\tSM:{sample}"
threads: 8
shell:
"bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"
Similar to input and output files, params
can be accessed from the shell command. Moreover, the params
directive can also take functions like in Step 3 to defer initialization to the DAG phase.
-P
of the bcftools call
command. Consider making this flag configurable via adding a new key to the config file and using the params
directive in the rule bcftools_call
to propagate it to the shell command.When executing a large workflow, it is usually desirable to store the output of each job persistently in files instead of just printing it to the terminal. For this purpose, Snakemake allows to specify log files for rules. Log files are defined via the log
directive and handled similarly to output files, but they are not subject of rule matching and are not cleaned up when a job fails. We modify our rule bwa_map
as follows:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_map/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
The shell command is modified to collect STDERR output of both bwa
and samtools
and pipe it into the file referred by {log}
. It is best practice to store all log files in a logs
subdirectory, prefixed by the rule or tool name. Log files must contain exactly the same wildcards as the output files to avoid clashes.
bcftools_call
rule as well.report
functionality discussed before, Snakemake can summarize various provenance information for all output files of the workflow. The flag --summary
prints a table associating each output file with the rule used to generate it, the creation date and optionally the version of the tool used for creation is provided. Further, the table informs about updated input files and changes to the source code of the rule after creation of the output file. Invoke Snakemake with --summary
to examine the information for our example.In our workflow, we create two BAM files for each sample, namely the output of the rules bwa_map
and samtools_sort
. When not dealing with examples, the underlying data is usually huge. Hence, the resulting BAM files need a lot of disk space and their creation takes some time. Snakemake allows to mark output files as temporary, such that they are deleted once every consuming job has been executed, in order to save disk space. We use this mechanism for the output file of the rule bwa_map
:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_map/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
This results in the deletion of the BAM file once the corresponding samtools_sort
job has been executed. Since the creation of BAM files via read mapping and sorting is computationally expensive, it is reasonable to protect the final BAM file from accidental deletion or modification. We modify the rule samtools_sort
by marking it’s output file as protected
:
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
protected("sorted_reads/{sample}.bam")
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
After execution of the job, Snakemake will write-protect the output file in the filesystem, so that it can’t be overwritten or deleted accidentally.
mapped_reads/A.bam
. Although the file is marked as temporary, you will see that Snakemake does not delete it because it is specified as a target file.The final version of our workflow looks like this:
configfile: "config.yaml"
rule all:
input:
"report.html"
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_map/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
protected("sorted_reads/{sample}.bam")
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
In the following, we introduce some features that are beyond the scope of above example workflow. For details and even more features, see the Documentation, the FAQ and the command line help (snakemake --help
).
With the benchmark
directive, Snakemake can be instructed to measure the wall clock time of a job. We activate benchmarking for the rule bwa_map
:
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_map/{sample}.log"
benchmark:
"benchmarks/{sample}.bwa.benchmark.txt"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
The benchmark
directive takes a string that points to the file where benchmarking results shall be stored. Similar to output files, the path can contain wildcards (it must be the same wildcards as in the output files). When a job derived from the rule is executed, Snakemake will measure the wall clock time and store it in the file in tab-delimited format. With the command line flag --benchmark-repeats
, Snakemake can be instructed to perform repetetive measurements by executing benchmark jobs multiple times. The repeated measurements occur as subsequent lines in the tab-delimited benchmark file.
We can include the benchmark results into our report:
rule report:
input:
T1="calls/all.vcf",
T2=expand("benchmarks/{sample}.bwa.benchmark.txt", sample=config["samples"])
output:
"report.html"
run:
from snakemake.utils import report
with open(input.T1) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
Benchmark results for BWA can be found in the tables T2_.
""", output[0], **input)
We use the expand
function to collect the benchmark files for all samples. Here, we directly provide names for the input files. In particular, we can also name the whole list of benchmark files returned by the expand
function as T2
. When invoking the report
function, we just unpack input
into keyword arguments (resulting in T1
and T2
). In the text, we refer with T2_
to the list of benchmark files.
bwa_map
with 3 repeats. Open the report and see how the list of benchmark files is presented in the HTML report.In order to re-use building blocks or simply to structure large workflows, it is sometimes reasonable to split a workflow into modules. For this, Snakemake provides the include
directive to include another Snakefile into the current one, e.g.:
include: "path/to/other.snakefile"
Alternatively, Snakemake allows to define sub-workflows. A sub-workflow refers to a working directory with a complete Snakemake workflow. Output files of that sub-workflow can be used in the current Snakefile. When executing, Snakemake ensures that the output files of the sub-workflow are up-to-date before executing the current workflow. This mechanism is particularly useful when you want to extend a previous analysis without modifying it. For details about sub-workflows, see the Documentation.
include
directive to make them available in our example workflow again.Using the run
directive as above is only reasonable for short Python scripts. As soon as your script becomes larger, it is reasonable to separate it from the workflow definition. For this purpose, Snakemake offers the script
directive. Using this, report
rule from above could instead look like this:
rule report:
input:
T1="calls/all.vcf",
T2=expand("benchmarks/{sample}.bwa.benchmark.txt", sample=config["samples"])
output:
"report.html"
script:
"scripts/report.py"
The actual Python code to generate the report is now hidden in the script scripts/report.py
. Script paths are always relative to the referring Snakefile. In the script, all properties of the rule like input
, output
, wildcards
, params
, threads
etc. are available as attributes of a global snakemake
object:
from snakemake.utils import report
with open(snakemake.input.T1) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
Benchmark results for BWA can be found in the tables T2_.
""", snakemake.output[0], **snakemake.input)
Although there are other strategies to invoke separate scripts from your workflow (e.g., invoking them via shell commands), the benefit of this is obvious: the script logic is separated from the workflow logic (and can be even shared between workflows), but boilerplate code like the parsing of command line arguments in unnecessary.
Apart from Python scripts, it is also possible to use R scripts. In R scripts, an S4 object named snakemake
analog to the Python case above is available and allows access to input and output files and other parameters. Here the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with snakemake@input[[1]]
(note that the first file does not have index 0 here, because R starts counting from 1). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. snakemake@input[["myfile"]]
.
For details and examples, see the script section in the Documentation.
By default, Snakemake executes jobs on the local machine it is invoked on. Alternatively, it can execute jobs in distributed environments, e.g., compute clusters or batch systems. If the nodes share a common file system, Snakemake supports three alternative execution modes.
In cluster enviroments, compute jobs are usually submitted as shell scripts via commands like qsub
. Snakemake provides a generic mode to execute on such clusters. By invoking Snakemake with
snakemake --cluster qsub --jobs 100
each job will be compiled into a shell script that is submitted with the given command (here qsub
). The --jobs
flag limits the number of concurrently submitted jobs to 100. This basic mode assumes that the submission command returns immediately after submitting the job. Some clusters allow to run the submission command in synchronous mode, such that it waits until the job has been executed. In such cases, we can invoke e.g.
snakemake --cluster-sync "qsub -sync yes" --jobs 100
The specified submission command can also be decorated with additional parameters taken from the submitted job. For example, the number of used threads can be accessed in braces similarly to the formatting of shell commands, e.g.
snakemake --cluster "qsub -pe threaded {threads}" --jobs 100
Alternatively, Snakemake can use the Distributed Resource Management Application API (DRMAA). This API provides a common interface to control various resource management systems. The DRMAA support can be activated by invoking Snakemake as follows:
snakemake --drmaa --jobs 100
If available, DRMAA is preferable over the generic cluster modes because it provides better control and error handling. To support additional cluster specific parametrization, a Snakefile can be complementd by a cluster configuration.
Snakemake uses regular expressions to match output files to input files and determine dependencies between the jobs. Sometimes it is useful to constrain the values a wildcard can have. This can be achieved by adding a regular expression that describes the set of allowed wildcard values. For example, the wildcard sample
in the output file "sorted_reads/{sample}.bam"
can be constrained to only allow alphanumeric sample names as "sorted_reads/{sample,[A-Za-z0-9]+}.bam"
. This mechanism helps to solve two kinds of ambiguity.
{sample}.{group}.txt
and assume that the target file is A.1.normal.txt
. It is not clear whether dataset="A.1"
and group="normal"
or dataset="A"
and group="1.normal"
is the right assignment. Here, constraining the dataset wildcard by {sample,[A-Z]+}.{group}
solves the problem.When dealing with ambiguous rules, it is best practice to first try to solve the ambiguity by using a proper file structure, for example, by separating the output files of different steps in different directories.