Chapter 10 Assignment 2: Nextflow

Here we have two parts for the assignment to complete the Nextflow section of the course.

10.1 Set up the environment

You need to do this every time you start the lab.

First ssh to Uppmax

ssh -AX youusername@rackham.uppmax.uu.se

Make a directory in your userspace

mkdir -p /crex/proj/uppmax2026-1-94/nobackup/$USER/nextflow_lab

Navigate to the folder you have created

cd /crex/proj/uppmax2026-1-94/nobackup/$USER/nextflow_lab

Load Nexflow

module load Nextflow/22.10.1

Fix some environmental variables:

export NXF_SINGULARITY_CACHEDIR=/crex/proj/uppmax2026-1-94/metabolomics/singularity

You are now ready to start!

The container for both of the following parts is “metaboigniter/course_docker:v5” and we want to use SLURM to run the jobs.

IMPORTANT: DO NOT remove, change or add anything here and in its subfolders: /crex/proj/uppmax2026-1-94/metabolomics

For part 1, please send us the modified main.nf, nextflow.config, and the modified command used to run (included in a file named command.txt).
For part 2, you will need to send us the main.nf file.

IMPORTANT: Some time Uppmax file system might cause some of the processes to fail, if this happens, rerun the workflow with -resume option. This might actually pass the step that was previously shown as failed!

Use -resume whenever you run the workflow to skip the steps which were completed before (save both computational and your time )
For example

nextflow yourpipeline.nf -profile uppmax --project "uppmax2026-1-94" -resume

10.2 Part 1

For part 1, please send us the modified main.nf, nextflow.config, and the modified command used to run nextflow (include in a file named command.txt).

In this part, we will run a small pipeline together. The pipeline is already in a folder that you need to copy to your personal space. We assume that you are now in nextflow_lab

Copy the pipeline and the configuration file

cp -r /crex/proj/uppmax2026-1-94/metabolomics/xcms_pipeline .

Now you have all the required files in your folder. Try to run the pipeline

cd xcms_pipeline 
nextflow main.nf -profile uppmax --project "uppmax2026-1-94"

You can open another terminal window and ssh to Uppmax and use jobinfo to see whether your jobs are running or not! If you realized that won’t run for some steps, it might be because of RAM, CPU, or running time. In the nextflow.config, you can change these parameters!

You can the jobinfo by logging in to uppmax via ssh using another termial.

jobinfo -u $USER

Please remember that there is a high chance that Uppmax becomes very slow because the jobs are heavy. So please consider canceling your jobs if you already know and have a feeling about how Uppmax runs your jobs

scancel -u $USER
  • Exercise one: There are some hardcoded parameters part of the pipeline:
    • If process process_masstrace_detection_pos_xcms we have the following parameters:

      peakwidthLow=5

      peakwidthHigh=10

      noise=1000

      polarity=positive

Can you change the config file so the value of these parameters is set there and fetched inside the main.nf file? Also, show us how the user can pass these parameters while running nextflow command? Show the command

Tip: Your task is to modify the config file (nextflow.config) so that for example peakwidthLow=5 can be replaced by peakwidthLow=$params.peakwidthLow and the value of peakwidthLow is set in the config file!

Tip: For the second task, show how to run the pipeline by providing input via the command line. eg:

nextflow main.nf -profile uppmax --project uppmax2026-1-94 --YourParametersHere

Help me: params scope

10.3 Part 2 (Build an OpenMS pipeline)

For part 2, you will need to send us the main.nf file.

As talked before the pipeline in the part 1 does mass spectrometry data pre-processing using XCMS software suite. In this part of the exercise, you will need to build a pipeline from scratch that does the same thing using OpenMS. You can of course use the pipeline in part 1 as a template and try to modify it to do that.

Tip: Use echo to print the shell command executed by the nextflow, to verify the input and output are of desired format. You can pass “-debug true” to the nextflow command or have a directive “debug true” under a process to print for the respective process.

When running your pipeline, remember to use these flags and the nextflow.config file from the part1 -profile uppmax --project "uppmax2026-1-94". For example

nextflow yourpipeline.nf -profile uppmax --project "uppmax2026-1-94"

10.3.1 Main instructions

This is what you need to do

  • Since the process in this specific pipeline need a separate parameter file you will need three additional file channels
    1. The first one is used in the FeatureFinder process and is located here: /crex/proj/uppmax2026-1-94/metabolomics/openms_params/FeatureFinder.ini
    2. The second one is for the alignment process and is located here: /crex/proj/uppmax2026-1-94/metabolomics/openms_params/featureAlignment.ini
    3. The third one is for the linker process and is located here: /crex/proj/uppmax2026-1-94/metabolomics/openms_params/featureLinker.ini
  • You would also need all the mzML files from this folder “/crex/proj/uppmax2026-1-94/metabolomics/mzMLData/” to a channel.

Help me: File channels, How to create file channels

Now that we have our channels up and running, time to create four processes:

10.3.2 Process I

Let’s name the first process process_masstrace_detection_pos_OpenMS.

  1. This process has two inputs:
    1. The first input is from the mzML files
    2. The second one is from the feature finder parameter file. Please remember that The second input should be a type of input that repeats for each of the mzML file emitted by the first channel
  2. The output for this process has the same baseName as input but it has .featureXML extension. The name of the output channel must be alignmentProcess
  3. And finally, the command needed to run is:
FeatureFinderMetabo -in inputVariable -out inputBaseName.featureXML -ini settingFileVariable

Please note that you need to change inputVariable, inputBaseName and settingFileVariable! Please also note that the inputs to this tool are given by -in, the outputs by -out and settings by -ini.

Help me: processes, input each (file), bash command

10.3.3 Process II

The second process is called process_masstrace_alignment_pos_OpenMS

  1. Similar to the previous process it will get two inputs:
    1. The first one is the output of process_masstrace_detection_pos_OpenMS - the previous process
    2. The second one is from the alignment parameter file. Similar to the previous one this input is also repeating! Remember using each?
  2. The output for this process has the same baseName as input and it has .featureXML extension. However, these files are supposed to be kept in a different output folder (let’s call it out). The output channel must be named LinkerProcess
  3. The command needed is a bit different from the previous one:

This command accepts ALL the featureXML files at the same time (look at collect!) and outputs the same number of files. For the command, the inputs must be separated by space (remember the join operation in the XCMS pipeline?). The output is the same as input however you need to put the output in a different folder (out) within the process! Think about a simple bash script. You create a folder and join the files in a way that they are written in a separate folder!

Given three samples, x.featureXML, y.featureXML and z.featureXML, an example of the command is like

mkdir out
MapAlignerPoseClustering -in x.featureXML y.featureXML z.featureXMLL -out out/x.featureXMLL out/y.featureXML out/z.featureXML -ini setting.ini

Please remember the code above is just an example of a command you run in bash You will have to implement this using Nextflow!

Also note that the inputs to this tool/command are given by -in, the outputs by -out and settings by -ini.

Finally, this process needs quite a bit of RAM to go forward. Set the memory for this to probably around 10 GB or even more!

Help me: collect, joining in the process!, memory

Need help!

Use the pseudo code given below for getting the desired input and output format.
Use println function from groovy to check the format.

process dummy_process{
input:
file filefromchannel

script:
def space_separated = filefromchannel.join(" ")
def string_separated = filefromchannel.join(" my_path/")

println (space_separated) 
println (string_separated)
}

10.3.4 Process III

The third process is called process_masstrace_linker_pos_OpenMS and is used to link (merge) multiple files into a single file

  1. It takes two inputs:
    1. The output from the alignment - the previous process
    2. The parameter file (featureLinker.ini)
  2. The output of this process is a single file and must be named Aggregated.consensusXML it will be sent over to a channel called textExport.
  3. And the command:
This is similar to the previous step. You need to gather all the inputs separated by space (collect and join!). However, it will only output one file. Given the same three files above, an example of the command would be:  
FeatureLinkerUnlabeledQT -in x.featureXML y.featureXML z.featureXML -out Aggregated.consensusXML -ini setting.ini

Note that the inputs to this tool are given by -in, the outputs by -out and settings by -ini.

Finally, this process needs quite a bit of RAM to go forward. Set the memory for this to probably around 10 GB or even more!

10.3.5 Process IV

The last step which takes the xml formatted output from the previous step and convert it into a csv file. Let’s call this process process_masstrace_exporter_pos_OpenMS

  1. The input to the process is the output from the previous step
  2. The output is a single file and must be called Aggregated_clean.csv and sent over to a channel called out
  3. To run the command:

This process will run two commands the first one does the conversion and the second one the cleaning. We will only need the last output (Aggregated_clean.csv)

TextExporter -in input -out Aggregated.csv
/usr/bin/readOpenMS.r input=Aggregated.csv output=Aggregated_clean.csv

Note that the inputs to this tool are given by -in, the outputs by -out.

Remember that this process should publish its result to a specific directory (your choice). It does that by symlink! Change the mode to copy if you want!

Help me: publish the results