Chapter 10 Assignment 2: Nextflow
Here we have two parts for the assignment to complete the Nextflow section of the course.
10.1 Set up the environment
You need to do this every time you start the lab.
First ssh to Uppmax
Make a directory in your userspace
Navigate to the folder you have created
Load Nexflow
Fix some environmental variables:
You are now ready to start!
The container for both of the following parts is “metaboigniter/course_docker:v5” and we want to use SLURM to run the jobs.
IMPORTANT: DO NOT remove, change or add anything here and in its subfolders: /crex/proj/uppmax2024-2-11/metabolomics
For part 1, please send us the modified main.nf, nextflow.config, and the modified command used to run (included in a file named command.txt). For part 2, you will need to send us the main.nf, nextflow.config, and the PCA plot (along with any additional scripts). Submission of a PCA plot contributes to earning higher grades.
IMPORTANT: Some time Uppmax file system might cause some of the processes to fail, if this happens, rerun the workflow with -resume option. This might actually pass the step that was previously shown as failed!
Use -resume whenever you run the workflow to skip the steps and (save both computational and your time ) which were completed before
For example
10.2 Part 1
For part 1, please send us the modified main.nf, nextflow.config, and the modified command used to run (included in a file named command.txt).
In this part, we will run a small pipeline together. The pipeline is already in a folder that you need to copy to your personal space. We assume that you are now in nextflow_lab
Copy the pipeline and the configuration file
Now you have all the required files in your folder. Try to run the pipeline
You can open another terminal window and ssh to Uppmax and use jobinfo to see whether your jobs are running or not! If you realized that won’t run for some steps, it might be because of RAM, CPU, or running time. In the nextflow.config, you can change these parameters!
You can the jobinfo by logging in to uppmax via ssh using another termial.
Please remember that there is a high chance that Uppmax becomes very slow because the jobs are heavy. So please consider canceling your jobs if you already know and have a feeling about how Uppmax runs your jobs
- Exercise one: There are some hardcoded parameters part of the pipeline:
If process process_masstrace_detection_pos_xcms we have the following parameters:
peakwidthLow=5
peakwidthHigh=10
noise=1000
polarity=positive
Can you change the config file so the value of these parameters is set there and fetched here? Show us how the user can pass these parameters while running nextflow command? Show the command
Tip: Your task is to modify the config file (nextflow.config) so that for example peakwidthLow=5 can be replaced by peakwidthLow=$params.peakwidthLow and the value of peakwidthLow is set in the config file!
Help me: params scope
10.3 Part 2
For part 2, you will need to send us the main.nf, nextflow.config, and the PCA plot (along with any additional scripts). Submission of a PCA plot contributes to earning higher grades.
As talked before the pipeline in the part 1 does mass spectrometry data pre-processing using XCMS software suite. In this part of the exercise, you will need to build a pipeline from scratch that does the same thing using OpenMS. You can of course use the pipeline in part 1 as a template and try to modify it to do that.
Tip: Use echo to print the shell command executed by the nextflow, to verify the input and output are of desired format. You can pass “-debug true” to the nextflow command or have a directive “debug true” under a process to print for the respective process.
When running your pipeline, remember to use these flags and the config file from the part1 -profile uppmax --project "uppmax2024-2-11" --clusterOptions "-M snowy"
.
For example
10.3.1 Part 2.1 (Build an OpenMS pipeline)
This is what you need to do
- Since the process in this specific pipeline need a separate parameter file you will need three additional file channels
- The first one is used in the FeatureFinder process and is located here:
/crex/proj/uppmax2024-2-11/metabolomics/openms_params/FeatureFinder.ini
- The second one is for the alignment process and is located here:
/crex/proj/uppmax2024-2-11/metabolomics/openms_params/featureAlignment.ini
- The third one is for the linker process and is located here:
/crex/proj/uppmax2024-2-11/metabolomics/openms_params/featureLinker.ini
- The first one is used in the FeatureFinder process and is located here:
- You would also needall the mzML files from this folder “/crex/proj/uppmax2024-2-11/metabolomics/mzMLData/” to a channel.
Help me: File channels, How to create file channels
Now that we have our channels up and running, time to create four processes:
- Let’s name the first process
process_masstrace_detection_pos_OpenMS
- This process has two inputs:
- The first input is from the mzML files
- The second one is from the feature finder parameter file. Please remember that The second input should be a type of input that repeats for each of the mzML file emitted by the first channel
- The output for this process has the same baseName as input but it has “.featureXML” extension. The name of the output channel must be alignmentProcess
- And finally, the command needed to run is
- This process has two inputs:
Please note that you need to change inputVariable
, inputBaseName
and settingFIleVariable
! Please also note that the inputs to this tool are given by -in
, the outputs by -out and settings by -ini
.
Help me: processes, input each (file), bash command
- The second process is called
process_masstrace_alignment_pos_OpenMS
- Similar to the previous process it will get two inputs:
- The first one is the output of
process_masstrace_detection_pos_OpenMS
- The second one is from the alignment parameter file. Similar to the previous one this input is also repeating!
- The first one is the output of
- The output for this process has the same
baseName
as input but it has.featureXML
extension. However, these files are in a different folder (let’s call itout
). The output channel must be namedLinkerProcess
- The command needed is a bit different from the previous one:
- Similar to the previous process it will get two inputs:
This command accepts ALL the featureXML files at the same time (look at collect!) and outputs the same number of files. The inputs must be separated by space (remember the join operation in the XCMS pipeline?). The output is the same as input however you need to put the output in a different folder within the process! Think about a simple bash script. You create a folder and join the files in a way that they are written in a separate folder!
Given three samples, x.featureXML, y.featureXML and z.featureXML, an example of the command is like
mkdir out
MapAlignerPoseClustering -in x.featureXML y.featureXML z.featureXMLL -out out/x.featureXMLL out/y.featureXML out/z.featureXML -ini setting.ini
Please remember the code above is just an example of a command you run in bash You will have to implement this using Nextflow!
Please also note that the inputs to this tool are given by -in
, the outputs by -out
and settings by -ini
.
Finally, this process needs quite a bit of RAM to go forward. Set the memory for this to probably around 10 GB or even more!
Help me: collect, joining in the process!, memory
Need help!
Use the pseudo code given below for getting the desired input and output format.
Use println function from groovy to check the format.
- The third process is called
process_masstrace_linker_pos_OpenMS
and is used to link (merge) multiple files into a single file- It will get two inputs:
- The output from the alignment
- The parameter file
- The output of this process is a single file and must be named “Aggregated.consensusXML” it will be sent over to a channel called “textExport”
- And the command:
- It will get two inputs:
FeatureLinkerUnlabeledQT -in x.featureXML y.featureXML z.featureXML -out Aggregated.consensusXML -ini setting.ini
Please also note that the inputs to this tool are given by -in
, the outputs by -out and settings by -ini
.
Finally, this process needs quite a bit of RAM to go forward. Set the memory for this to probably around 10 GB
or even more!
- Finally the last step which takes the xml formatted output from the previous step and convert it into a csv file. Let’s call this process
process_masstrace_exporter_pos_OpenMS
- The input to the process is the output from the previous step
- The output is a single file and must be called
Aggregated_clean.csv
and sent over to a channel calledout
- To run the command: This process will run two commands the first one does the conversion and the second one the cleaning. We will only need the last output (
Aggregated_clean.csv
)
TextExporter -in input -out Aggregated.csv
/usr/bin/readOpenMS.r input=Aggregated.csv output=Aggregated_clean.csv
Please also note that the inputs to this tool are given by -in
, the outputs by -out
.
Remember that this process should publish its result to a specific directory (your choice). It does that by symlink! Change the mode to copy if you want!
Help me: publish the results
10.3.2 Part 2.2 PCA (optional, gives extra points)
Given the output from the last step of the pipeline, do a PCA on the data. The output of the last step is a csv file (comma separated) and the first column is the header. The missing values are designated by NA.
Can you add your script (PCA) as an extra node part of the pipeline? You don’t have to use containers, you can use conda or whatever you like! There are R and Python already available in the container, you can also use conda or modules
If you don’t know what PCA is or how to do it, please let us know!
If you encounter issues with Uppmax being way too slow or does not do anything, please let up know!
If you are using python for PCA, use the following directive for the process
For example,