Chapter 4 Processes

In Nextflow a process is the basic processing primitive to execute a user script. As said before, this processes can pretty much run anything from R scripts to complex bash commands. Anything that is executable on the linux system can be run!

The process definition starts with keyword the process, followed by process name and finally the process body delimited by curly brackets. The process body must contain a string which represents the command or, more generally, a script that is executed by it.

The overall structure of a process in Nextflow looks like this:

process < name > {

   [ directives ] // #1

   input:
    < process inputs > // #2

   output:
    < process outputs > // #3

   when:
    < condition > // #4

   [script|shell|exec]:
   < user script to be executed > // #5

}
  1. Using the directive declarations block you can provide optional settings that will affect the execution of the current process
  2. The input block defines from which channels the process expects to receive data.
  3. The output declaration block allows you send out the results produced to the channels.
  4. The when declaration allows you to define a condition that must be verified in order to execute the process.
  5. The script block is a string statement that defines the command that is executed by the process to carry out its task.

4.1 script|shell|exec

We start with the script block. A process contains one and only one script block, and it must be the last statement when the process contains input and output declarations. The entered string is executed as a Bash script in the host system. It can be any command, script or combination of them, that you would normally use in terminal shell or in a common Bash script. The script block can be a simple string or multi-line string. The latter simplifies the writing of non trivial scripts composed by multiple commands spanning over multiple lines. If you would like to run shell instead of bash you can use single quotation (''') instead of double.

For example,

db='database' // #1

process justEchoBash {
debug true // #2
 script:
 """
 echo $db // #3
 """

}

process justEchoShell {
debug true
   shell:
   '''
   echo !{db} // #4
   '''
}

workflow {
    justEchoBash()
    justEchoShell()
}
  1. Creates a variable and assign it to database.
  2. This is a directive that is setting the process to print the standard output. More on this later!
  3. Runs the bash command echo which prints the content of variable db. Note that the variables that you define outside of the process script can be accessed using $ sign
  4. This is identical to bash but using shell. To access the variable db we need to wrap in !{}

Please note that unless really needed try to use bash.

Create a file called main_4.nf

nano main_4.nf

Copy the above code to it. Save the file (Ctrl+o enter) and exit (Ctrl+x) Now run

nextflow main_4.nf

4.2 input

The input block defines from which channels the process expects to receive data. You can only define one input block at a time and it must contain one or more input declarations.

The input block follows the syntax shown below:

input:
  <input qualifier> <input name> 

An input definition starts with an input qualifier and then the input name.

The input qualifier declares the type of data to be received.

The qualifiers available are the ones listed in the following table:

Qualifier Semantic
val Lets you access the received input value by its name in the process script.
env Lets you use the received value to set an environment variable named as the specified input name.
file Lets you handle the received value as a file, staging it properly in the execution context.
path Lets you handle the received value as a path, staging the file properly in the execution context.
stdin Lets you forward the received value to the process stdin special file.
tuple Lets you handle a group of input values having one of the above qualifiers.
each Lets you execute the process for each entry in the input collection.

For example, here we have a channel and a process that gets an input from the channel and prints the values.

num = Channel.from( 1, 2, 3 ) // #1

process basicExample {
debug true
 input:
 val x // #2
 "echo process job $x" // #3
}

workflow {           // #4
    basicExample(num)
}

Another way of writing the workflow when there is only one input. Here num is the only input channel.


workflow {           // #5
    num | basicExample
}
  1. Creates a channel of 1,2,3
  2. Set the input of the process from the channel num. The type of the channel is val.
  3. Runs a simple echo command that writes the value to std out! Remember that to access the input we need to use $x.
  4. Executes the workflow containing the process basicExample with num as the input channel
  5. When the process declares exactly one input, the pipe | operator can be used to provide inputs to the process, instead of passing it as a parameter. (similar to what you learned in Bash piping)

In the above example the process is executed three times, each time a value is received from the channel num and used to process the script.

Create a file called main_5.nf

nano main_5.nf

Copy the above code to it. Save the file (Ctrl+o enter) and exit (Ctrl+x) Now run

nextflow main_5.nf

Can you create a workflow that reads all the files with *.mzML extension in a file channel and print their name in a process? Remember, this is a file channel!

mzMLFiles = Channel.fromPath( '/crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML' )

process featureFinder {
debug true
 input:
 file mzML

 """
echo i’m processing $mzML file!
"""
}

workflow {
    featureFinder(mzMLFiles)
}

Create a file called main_6.nf and put write the code.

nano main_6.nf

Save the file (Ctrl+o enter) and exit (Ctrl+x). Now run

nextflow main_6.nf

4.2.1 input each

The each qualifier allows you to repeat the execution of a process for each item in a collection, every time a new data is received.

mzMLFiles = Channel.fromPath( '/crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML' ) // #1
num = Channel.from( 1, 2, 3 ) // #2
process featureFinder {
  debug true
input:
each x  // #3
file y  // #4

"echo value $x file $y" // #5

}

workflow {
    featureFinder(num,mzMLFiles) // #6
}
  1. Creates a file channel from all mzML files in /crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML'.
  2. Create a channel containing numbers
  3. Define input repeater for the values in the channel
  4. Define another input channel from the file channel
  5. Script that is executed
  6. Workflow with the process featureFinder receving channels num and mzMLFiles. The order of the input has to be maintained to the order in which the input is written in the process. Here x receives value from num and y receives value from mzMLFiles.

In the above example every time a file (y) of mzML is received as input by the process, it executes three tasks running a echo with a different value for the x parameter. This is useful when you need to repeat the same task for a given set of parameters.

Create a file called main_7.nf

nano main_7.nf

Copy the above code to it. Save the file (Ctrl+o enter) and exit (Ctrl+x) Now run

nextflow main_7.nf

4.3 Output

The output declaration block allows you to define the channels used by the process to send out the results produced. You can only define one output block at a time and it must contain one or more output declarations.

output:
<output qualifier> <output name> [, <option>: <option value>]

The qualifiers that can be used in the output declaration block are the ones listed in the following table:

Qualifier Semantic
val Sends variables with the name specified over the output channel.
file Sends a file produced by the process with the name specified over the output channel.
path Sends a file produced by the process with the name specified over the output channel (replaces file).
env Sends the variable defined in the process environment with the name specified over the output channel.
stdout Sends the executed process stdout over the output channel.
tuple Sends multiple values over the same output channel.

For example,

mzMLFiles = Channel.fromPath( '/crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML' ) // #1
process featureFinder {
input:
file x  // #2
output:
file "output/${x.baseName}.featureXML" // #3

""" 
mkdir output 
cp -in $x output/${x.baseName}.featureXML 
"""  // #4
}

workflow {
    output_channel = featureFinder(mzMLFiles) // #5
    output_channel.view()
}

In the above example the process, when executed, it will create a file channel from /crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML. The process will then get this channel as an input and creates an output channel outputChannel (named in the workflow) where each file extension has been changed to featureXML.

  1. Creates a channel emitting files.
  2. Receive value from the input channel.
  3. Send data to output channel. The output file is located under output directory. ${x.baseName} gives the name of the file without extension.
  4. In the bash script, we first create an output folder We then copy the input to the output folder but change its extension. This is obviously a pretty useless command! But you can change this to a more meaningful one!
  5. Sends the input channel to the process and receives the output from the process to a channel.

Create a file called main_8.nf

nano main_8.nf

Copy the above code to it. Save the file (Ctrl+o enter) and exit (Ctrl+x) Now run

nextflow main_8.nf

Let’s clean up the nextflow work directory now to free up some space

nextflow clean -f

4.4 Directives

Using the directive declarations block you can provide optional settings that will affect the execution of the current process.

They must be entered at the top of the process body, before any other declaration blocks (i.e. input, output, etc) and have the following syntax:

You can see the complete list of directives here

4.4.1 publishDir

The publishDir directive allows you to publish the process output files to a specified folder.

process foo {

    publishDir 'data/chunks' // #1

    output:
    file 'chunk_*'

    '''
    printf 'Hola' | split -b 1 - chunk_
    '''
}

workflow {
    foo()
}

The above example splits the string Hola into file chunks of a single byte.

  1. When complete the chunk_* output files are published into the /data/chunks folder.

Can you create a workflow having a single process that just creates a text file (with whatever content) as an output and also publish its output to a directory?

process simpleOutput {
publishDir 'testOutput'
output:
file "test.txt"
"echo test >> test.txt"

}

workflow {
    simpleOutput()
}

Create a file called main_9.nf

nano main_9.nf

Copy the above code to it. Save the file (Ctrl+o enter) and exit (Ctrl+x) Now run

nextflow main_9.nf

4.4.2 tag

The tag directive allows you to associate each process execution with a custom label, so that it will be easier to identify them in the log file or in the trace execution report.

mzMLFiles = Channel.fromPath( '/crex/proj/uppmax2024-2-11/metabolomics/mzMLData/*.mzML' )
num = Channel.from( 1, 2, 3 )
process featureFinder {
tag "$y" // #1
input:
each x
file y

"""
echo value $x file $y
"""

}

workflow {
    featureFinder(num,mzMLFiles)
}

In the above example, when a file is received by the process, it will show its name when running the process.

  1. $y in the tag, indicates that name of file from mzMLFiles should be used as tag.

Create a file called main_10.nf

nano main_10.nf

Copy the above code to it. Save the file (Ctrl+o enter) and exit (Ctrl+x) Now run

nextflow main_10.nf

What do you see? What is the difference to a process without a tag?

Now let’s clean up the cache produced by Nextflow using the clean command

nextflow clean -f

This removes all the cache files and saves space on the disk.