Chapter 10 Assignment 2: Nextflow
Here we have two parts for the assignment to complete the Nextflow section of the course.
10.1 Set up the environment
You need to do this every time you start the lab.
First ssh to Uppmax
Make a directory in your userspace
Navigate to the folder you have created
Load Nexflow
Fix some environmental variables:
You are now ready to start!
The container for both of the following parts is “metaboigniter/course_docker:v5” and we want to use SLURM to run the jobs.
IMPORTANT: DO NOT remove, change or add anything here and in its subfolders: /crex/proj/uppmax2026-1-94/metabolomics
For part 1, please send us the modified main.nf, nextflow.config, and the modified command used to run (included in a file named command.txt).
For part 2, you will need to send us the main.nf file.
IMPORTANT: Some time Uppmax file system might cause some of the processes to fail, if this happens, rerun the workflow with -resume option. This might actually pass the step that was previously shown as failed!
Use -resume whenever you run the workflow to skip the steps which were completed before (save both computational and your time )
For example
10.2 Part 1
For part 1, please send us the modified main.nf, nextflow.config, and the modified command used to run nextflow (include in a file named command.txt).
In this part, we will run a small pipeline together. The pipeline is already in a folder that you need to copy to your personal space. We assume that you are now in nextflow_lab
Copy the pipeline and the configuration file
Now you have all the required files in your folder. Try to run the pipeline
You can open another terminal window and ssh to Uppmax and use jobinfo to see whether your jobs are running or not! If you realized that won’t run for some steps, it might be because of RAM, CPU, or running time. In the nextflow.config, you can change these parameters!
You can the jobinfo by logging in to uppmax via ssh using another termial.
Please remember that there is a high chance that Uppmax becomes very slow because the jobs are heavy. So please consider canceling your jobs if you already know and have a feeling about how Uppmax runs your jobs
- Exercise one: There are some hardcoded parameters part of the pipeline:
If process process_masstrace_detection_pos_xcms we have the following parameters:
peakwidthLow=5peakwidthHigh=10noise=1000polarity=positive
Can you change the config file so the value of these parameters is set there and fetched inside the main.nf file? Also, show us how the user can pass these parameters while running nextflow command? Show the command
Tip: Your task is to modify the config file (nextflow.config) so that for example peakwidthLow=5 can be replaced by peakwidthLow=$params.peakwidthLow and the value of peakwidthLow is set in the config file!
Tip: For the second task, show how to run the pipeline by providing input via the command line. eg:
Help me: params scope
10.3 Part 2 (Build an OpenMS pipeline)
For part 2, you will need to send us the main.nf file.
As talked before the pipeline in the part 1 does mass spectrometry data pre-processing using XCMS software suite. In this part of the exercise, you will need to build a pipeline from scratch that does the same thing using OpenMS. You can of course use the pipeline in part 1 as a template and try to modify it to do that.
Tip: Use echo to print the shell command executed by the nextflow, to verify the input and output are of desired format. You can pass “-debug true” to the nextflow command or have a directive “debug true” under a process to print for the respective process.
When running your pipeline, remember to use these flags and the nextflow.config file from the part1 -profile uppmax --project "uppmax2026-1-94".
For example
10.3.1 Main instructions
This is what you need to do
- Since the process in this specific pipeline need a separate parameter file you will need three additional file channels
- The first one is used in the FeatureFinder process and is located here:
/crex/proj/uppmax2026-1-94/metabolomics/openms_params/FeatureFinder.ini
- The second one is for the alignment process and is located here:
/crex/proj/uppmax2026-1-94/metabolomics/openms_params/featureAlignment.ini
- The third one is for the linker process and is located here:
/crex/proj/uppmax2026-1-94/metabolomics/openms_params/featureLinker.ini
- The first one is used in the FeatureFinder process and is located here:
- You would also need all the mzML files from this folder “/crex/proj/uppmax2026-1-94/metabolomics/mzMLData/” to a channel.
Help me: File channels, How to create file channels
Now that we have our channels up and running, time to create four processes:
10.3.2 Process I
Let’s name the first process process_masstrace_detection_pos_OpenMS.
- This process has two inputs:
- The first input is from the mzML files
- The second one is from the feature finder parameter file. Please remember that The second input should be a type of input that repeats for each of the mzML file emitted by the first channel
- The first input is from the mzML files
- The output for this process has the same baseName as input but it has
.featureXMLextension. The name of the output channel must be alignmentProcess - And finally, the command needed to run is:
Please note that you need to change inputVariable, inputBaseName and settingFileVariable! Please also note that the inputs to this tool are given by -in, the outputs by -out and settings by -ini.
Help me: processes, input each (file), bash command
10.3.3 Process II
The second process is called process_masstrace_alignment_pos_OpenMS
- Similar to the previous process it will get two inputs:
- The first one is the output of
process_masstrace_detection_pos_OpenMS- the previous process
- The second one is from the alignment parameter file. Similar to the previous one this input is also repeating! Remember using each?
- The first one is the output of
- The output for this process has the same
baseNameas input and it has.featureXMLextension. However, these files are supposed to be kept in a different output folder (let’s call itout). The output channel must be namedLinkerProcess
- The command needed is a bit different from the previous one:
This command accepts ALL the featureXML files at the same time (look at collect!) and outputs the same number of files. For the command, the inputs must be separated by space (remember the join operation in the XCMS pipeline?). The output is the same as input however you need to put the output in a different folder (out) within the process! Think about a simple bash script. You create a folder and join the files in a way that they are written in a separate folder!
Given three samples, x.featureXML, y.featureXML and z.featureXML, an example of the command is like
mkdir out
MapAlignerPoseClustering -in x.featureXML y.featureXML z.featureXMLL -out out/x.featureXMLL out/y.featureXML out/z.featureXML -ini setting.iniPlease remember the code above is just an example of a command you run in bash You will have to implement this using Nextflow!
Also note that the inputs to this tool/command are given by -in, the outputs by -out and settings by -ini.
Finally, this process needs quite a bit of RAM to go forward. Set the memory for this to probably around 10 GB or even more!
Help me: collect, joining in the process!, memory
Need help!
Use the pseudo code given below for getting the desired input and output format.
Use println function from groovy to check the format.
10.3.4 Process III
The third process is called process_masstrace_linker_pos_OpenMS and is used to link (merge) multiple files into a single file
- It takes two inputs:
- The output from the alignment - the previous process
- The parameter file (
featureLinker.ini)
- The output from the alignment - the previous process
- The output of this process is a single file and must be named
Aggregated.consensusXMLit will be sent over to a channel calledtextExport.
- And the command:
This is similar to the previous step. You need to gather all the inputs separated by space (collect and join!). However, it will only output one file. Given the same three files above, an example of the command would be:
FeatureLinkerUnlabeledQT -in x.featureXML y.featureXML z.featureXML -out Aggregated.consensusXML -ini setting.iniNote that the inputs to this tool are given by -in, the outputs by -out and settings by -ini.
Finally, this process needs quite a bit of RAM to go forward. Set the memory for this to probably around 10 GB or even more!
10.3.5 Process IV
The last step which takes the xml formatted output from the previous step and convert it into a csv file. Let’s call this process process_masstrace_exporter_pos_OpenMS
- The input to the process is the output from the previous step
- The output is a single file and must be called
Aggregated_clean.csvand sent over to a channel calledout - To run the command:
This process will run two commands the first one does the conversion and the second one the cleaning. We will only need the last output (Aggregated_clean.csv)
TextExporter -in input -out Aggregated.csv
/usr/bin/readOpenMS.r input=Aggregated.csv output=Aggregated_clean.csvNote that the inputs to this tool are given by -in, the outputs by -out.
Remember that this process should publish its result to a specific directory (your choice). It does that by symlink! Change the mode to copy if you want!
Help me: publish the results