Chapter 4 Processes
In Nextflow a process is the basic processing primitive to execute a user script. As said before, this processes can pretty much run anything from R scripts to complex bash commands. Anything that is executable on the linux system can be run!
The process definition starts with keyword the process, followed by process name and finally the process body delimited by curly brackets. The process body must contain a string which represents the command or, more generally, a script that is executed by it.
The overall structure of a process in Nextflow looks like this:
process < name > {
[ directives ] // #1
input:
< process inputs > // #2
output:
< process outputs > // #3
when:
< condition > // #4
[script|shell|exec]:
< user script to be executed > // #5
}
- Using the directive declarations block you can provide optional settings that will affect the execution of the current process
- The input block defines from which channels the process expects to receive data.
- The output declaration block allows you send out the results produced to the channels.
- The
when
declaration allows you to define a condition that must be verified in order to execute the process. - The script block is a string statement that defines the command that is executed by the process to carry out its task.
4.1 script|shell|exec
We start with the script
block. A process contains one and only one script block, and it must be the last statement when the process contains input and output declarations. The entered string is executed as a Bash script in the host system. It can be any command, script or combination of them, that you would normally use in terminal shell or in a common Bash script. The script block can be a simple string or multi-line string. The latter simplifies the writing of non trivial scripts composed by multiple commands spanning over multiple lines. If you would like to run shell instead of bash you can use single quotation ('''
) instead of double.
For example,
db='database' // #1
process justEchoBash {
debug true // #2
script:
"""
echo $db // #3
"""
}
process justEchoShell {
debug true
shell:
'''
echo !{db} // #4
'''
}
workflow {
justEchoBash()
justEchoShell()
}
- Creates a variable and assign it to
database
. - This is a directive that is setting the process to print the standard output. More on this later!
- Runs the bash command
echo
which prints the content of variabledb
. Note that the variables that you define outside of the process script can be accessed using$
sign - This is identical to bash but using shell. To access the variable
db
we need to wrap in!{}
Please note that unless really needed try to use bash
.
Create a file called main_4.nf
Copy the above code to it. Save the file (Ctrl+o enter)
and exit (Ctrl+x)
Now run
4.2 input
The input block defines from which channels the process expects to receive data. You can only define one input block at a time and it must contain one or more input declarations.
The input block follows the syntax shown below:
An input definition starts with an input qualifier and then the input name.
The input qualifier declares the type of data to be received.
The qualifiers available are the ones listed in the following table:
Qualifier | Semantic |
---|---|
val | Lets you access the received input value by its name in the process script. |
env | Lets you use the received value to set an environment variable named as the specified input name. |
file | Lets you handle the received value as a file, staging it properly in the execution context. |
path | Lets you handle the received value as a path, staging the file properly in the execution context. |
stdin | Lets you forward the received value to the process stdin special file. |
tuple | Lets you handle a group of input values having one of the above qualifiers. |
each | Lets you execute the process for each entry in the input collection. |
For example, here we have a channel and a process that gets an input from the channel and prints the values.
num = Channel.from( 1, 2, 3 ) // #1
process basicExample {
debug true
input:
val x // #2
"echo process job $x" // #3
}
workflow { // #4
basicExample(num)
}
Another way of writing the workflow when there is only one input. Here num
is the only input channel.
- Creates a channel of 1,2,3
- Set the input of the process from the channel
num
. The type of the channel isval
. - Runs a simple echo command that writes the value to std out! Remember that to access the input we need to use
$x
. - Executes the workflow containing the process basicExample with
num
as the input channel - When the process declares exactly one input, the pipe | operator can be used to provide inputs to the process, instead of passing it as a parameter. (similar to what you learned in Bash piping)
In the above example the process is executed three times, each time a value is received from the channel num and used to process the script.
Create a file called main_5.nf
Copy the above code to it. Save the file (Ctrl+o enter)
and exit (Ctrl+x)
Now run
Can you create a workflow that reads all the files with *.mzML
extension in a file channel and print their name in a process? Remember, this is a file channel!
mzMLFiles = Channel.fromPath( '/crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML' )
process featureFinder {
debug true
input:
file mzML
"""
echo i’m processing $mzML file!
"""
}
workflow {
featureFinder(mzMLFiles)
}
Create a file called main_6.nf
and put write the code.
Save the file (Ctrl+o enter) and exit (Ctrl+x). Now run
4.2.1 input each
The each qualifier allows you to repeat the execution of a process for each item in a collection, every time a new data is received.
mzMLFiles = Channel.fromPath( '/crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML' ) // #1
num = Channel.from( 1, 2, 3 ) // #2
process featureFinder {
debug true
input:
each x // #3
file y // #4
"echo value $x file $y" // #5
}
workflow {
featureFinder(num,mzMLFiles) // #6
}
- Creates a file channel from all
mzML
files in/crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML'
. - Create a channel containing numbers
- Define input repeater for the values in the channel
- Define another input channel from the file channel
- Script that is executed
- Workflow with the process featureFinder receving channels
num
andmzMLFiles
. The order of the input has to be maintained to the order in which theinput
is written in the process. Herex
receives value fromnum
andy
receives value frommzMLFiles
.
In the above example every time a file (y
) of mzML
is received as input by the process, it executes three tasks running a echo
with a different value for the x parameter. This is useful when you need to repeat the same task for a given set of parameters.
Create a file called main_7.nf
Copy the above code to it. Save the file (Ctrl+o enter)
and exit (Ctrl+x)
Now run
4.3 Output
The output declaration block allows you to define the channels used by the process to send out the results produced. You can only define one output block at a time and it must contain one or more output declarations.
The qualifiers that can be used in the output declaration block are the ones listed in the following table:
Qualifier | Semantic |
---|---|
val | Sends variables with the name specified over the output channel. |
file | Sends a file produced by the process with the name specified over the output channel. |
path | Sends a file produced by the process with the name specified over the output channel (replaces file). |
env | Sends the variable defined in the process environment with the name specified over the output channel. |
stdout | Sends the executed process stdout over the output channel. |
tuple | Sends multiple values over the same output channel. |
For example,
mzMLFiles = Channel.fromPath( '/crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML' ) // #1
process featureFinder {
input:
file x // #2
output:
file "output/${x.baseName}.featureXML" // #3
"""
mkdir output
cp -in $x output/${x.baseName}.featureXML
""" // #4
}
workflow {
output_channel = featureFinder(mzMLFiles) // #5
output_channel.view()
}
In the above example the process, when executed, it will create a file channel from /crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML
. The process will then get this channel as an input and creates an output channel outputChannel
(named in the workflow) where each file extension has been changed to featureXML
.
- Creates a channel emitting files.
- Receive value from the input channel.
- Send data to output channel. The output file is located under
output
directory.${x.baseName}
gives the name of the file without extension. - In the bash script, we first create an output folder We then copy the input to the output folder but change its extension. This is obviously a pretty useless command! But you can change this to a more meaningful one!
- Sends the input channel to the process and receives the output from the process to a channel.
Create a file called main_8.nf
Copy the above code to it. Save the file (Ctrl+o enter)
and exit (Ctrl+x)
Now run
Let’s clean up the nextflow work directory now to free up some space
4.4 Directives
Using the directive declarations block you can provide optional settings that will affect the execution of the current process.
They must be entered at the top of the process body, before any other declaration blocks (i.e. input, output, etc) and have the following syntax:
You can see the complete list of directives here
4.4.1 publishDir
The publishDir
directive allows you to publish the process output files to a specified folder.
process foo {
publishDir 'data/chunks' // #1
output:
file 'chunk_*'
'''
printf 'Hola' | split -b 1 - chunk_
'''
}
workflow {
foo()
}
The above example splits the string Hola into file chunks of a single byte.
- When complete the chunk_* output files are published into the /data/chunks folder.
Can you create a workflow having a single process that just creates a text file (with whatever content) as an output and also publish its output to a directory?
process simpleOutput {
publishDir 'testOutput'
output:
file "test.txt"
"echo test >> test.txt"
}
workflow {
simpleOutput()
}
Create a file called main_9.nf
Copy the above code to it. Save the file (Ctrl+o enter)
and exit (Ctrl+x)
Now run
4.4.2 tag
The tag
directive allows you to associate each process execution with a custom label, so that it will be easier to identify them in the log file or in the trace execution report.
mzMLFiles = Channel.fromPath( '/crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML' )
num = Channel.from( 1, 2, 3 )
process featureFinder {
tag "$y" // #1
input:
each x
file y
"""
echo value $x file $y
"""
}
workflow {
featureFinder(num,mzMLFiles)
}
In the above example, when a file is received by the process, it will show its name when running the process.
$y
in the tag, indicates that name of file frommzMLFiles
should be used as tag.
Create a file called main_10.nf
Copy the above code to it. Save the file (Ctrl+o enter)
and exit (Ctrl+x)
Now run
What do you see? What is the difference to a process without a tag?
Now let’s clean up the cache produced by Nextflow using the clean command
This removes all the cache files and saves space on the disk.