Chapter 10 Assignment 2: Nextflow

Here we have two parts for the assignment to complete the Nextflow section of the course.

10.1 Set up the environment

You need to do this every time you start the lab.

First ssh to Uppmax

ssh -AX youusername@rackham.uppmax.uu.se

Make a directory in your userspace

mkdir -p /crex/proj/uppmax2024-2-11/nobackup/$USER/nextflow_lab

Navigate to the folder you have created

cd /crex/proj/uppmax2024-2-11/nobackup/$USER/nextflow_lab

Load Nexflow

module load bioinfo-tools
module load Nextflow/22.10.1

Fix some environmental variables:

export NXF_SINGULARITY_CACHEDIR=/crex/proj/uppmax2024-2-11/metabolomics/singularity

You are now ready to start!

The container for both of the following parts is “metaboigniter/course_docker:v5” and we want to use SLURM to run the jobs.

IMPORTANT: DO NOT remove, change or add anything here and in its subfolders: /crex/proj/uppmax2024-2-11/metabolomics

For part 1, please send us the modified main.nf, nextflow.config, and the modified command used to run (included in a file named command.txt). For part 2, you will need to send us the main.nf, nextflow.config, and the PCA plot (along with any additional scripts). Submission of a PCA plot contributes to earning higher grades.

IMPORTANT: Some time Uppmax file system might cause some of the processes to fail, if this happens, rerun the workflow with -resume option. This might actually pass the step that was previously shown as failed!

Use -resume whenever you run the workflow to skip the steps and (save both computational and your time ) which were completed before
For example

nextflow yourpipeline.nf -profile uppmax --project "uppmax2024-2-11" --clusterOptions "-M snowy" -resume

10.2 Part 1

For part 1, please send us the modified main.nf, nextflow.config, and the modified command used to run (included in a file named command.txt).

In this part, we will run a small pipeline together. The pipeline is already in a folder that you need to copy to your personal space. We assume that you are now in nextflow_lab

Copy the pipeline and the configuration file

cp -r /crex/proj/uppmax2024-2-11/metabolomics/xcms_pipeline .

Now you have all the required files in your folder. Try to run the pipeline

cd xcms_pipeline 
nextflow main.nf -profile uppmax --project "uppmax2024-2-11" --clusterOptions "-M snowy"

You can open another terminal window and ssh to Uppmax and use jobinfo to see whether your jobs are running or not! If you realized that won’t run for some steps, it might be because of RAM, CPU, or running time. In the nextflow.config, you can change these parameters!

You can the jobinfo by logging in to uppmax via ssh using another termial.

jobinfo -u $USER -M snowy

Please remember that there is a high chance that Uppmax becomes very slow because the jobs are heavy. So please consider canceling your jobs if you already know and have a feeling about how Uppmax runs your jobs

scancel -u $USER -M snowy
  • Exercise one: There are some hardcoded parameters part of the pipeline:
    • If process process_masstrace_detection_pos_xcms we have the following parameters:

      peakwidthLow=5

      peakwidthHigh=10

      noise=1000

      polarity=positive

Can you change the config file so the value of these parameters is set there and fetched here? Show us how the user can pass these parameters while running nextflow command? Show the command

Tip: Your task is to modify the config file (nextflow.config) so that for example peakwidthLow=5 can be replaced by peakwidthLow=$params.peakwidthLow and the value of peakwidthLow is set in the config file!

Help me: params scope

10.3 Part 2

For part 2, you will need to send us the main.nf, nextflow.config, and the PCA plot (along with any additional scripts). Submission of a PCA plot contributes to earning higher grades.

As talked before the pipeline in the part 1 does mass spectrometry data pre-processing using XCMS software suite. In this part of the exercise, you will need to build a pipeline from scratch that does the same thing using OpenMS. You can of course use the pipeline in part 1 as a template and try to modify it to do that.

Tip: Use echo to print the shell command executed by the nextflow, to verify the input and output are of desired format. You can pass “-debug true” to the nextflow command or have a directive “debug true” under a process to print for the respective process.

When running your pipeline, remember to use these flags and the config file from the part1 -profile uppmax --project "uppmax2024-2-11" --clusterOptions "-M snowy". For example

nextflow yourpipeline.nf -profile uppmax --project "uppmax2024-2-11" --clusterOptions "-M snowy"

10.3.1 Part 2.1 (Build an OpenMS pipeline)

This is what you need to do

  • Since the process in this specific pipeline need a separate parameter file you will need three additional file channels
    1. The first one is used in the FeatureFinder process and is located here: /crex/proj/uppmax2024-2-11/metabolomics/openms_params/FeatureFinder.ini
    2. The second one is for the alignment process and is located here: /crex/proj/uppmax2024-2-11/metabolomics/openms_params/featureAlignment.ini
    3. The third one is for the linker process and is located here: /crex/proj/uppmax2024-2-11/metabolomics/openms_params/featureLinker.ini
  • You would also needall the mzML files from this folder “/crex/proj/uppmax2024-2-11/metabolomics/mzMLData/” to a channel.

Help me: File channels, How to create file channels

Now that we have our channels up and running, time to create four processes:

  1. Let’s name the first process process_masstrace_detection_pos_OpenMS
    1. This process has two inputs:
      1. The first input is from the mzML files
      2. The second one is from the feature finder parameter file. Please remember that The second input should be a type of input that repeats for each of the mzML file emitted by the first channel
    2. The output for this process has the same baseName as input but it has “.featureXML” extension. The name of the output channel must be alignmentProcess
    3. And finally, the command needed to run is
FeatureFinderMetabo -in inputVariable -out inputBaseName.featureXML -ini settingFIleVariable

Please note that you need to change inputVariable, inputBaseName and settingFIleVariable! Please also note that the inputs to this tool are given by -in, the outputs by -out and settings by -ini.

Help me: processes, input each (file), bash command

  1. The second process is called process_masstrace_alignment_pos_OpenMS
    1. Similar to the previous process it will get two inputs:
      1. The first one is the output of process_masstrace_detection_pos_OpenMS
      2. The second one is from the alignment parameter file. Similar to the previous one this input is also repeating!
    2. The output for this process has the same baseName as input but it has .featureXML extension. However, these files are in a different folder (let’s call it out). The output channel must be named LinkerProcess
    3. The command needed is a bit different from the previous one:

This command accepts ALL the featureXML files at the same time (look at collect!) and outputs the same number of files. The inputs must be separated by space (remember the join operation in the XCMS pipeline?). The output is the same as input however you need to put the output in a different folder within the process! Think about a simple bash script. You create a folder and join the files in a way that they are written in a separate folder!

Given three samples, x.featureXML, y.featureXML and z.featureXML, an example of the command is like

mkdir out
MapAlignerPoseClustering -in x.featureXML y.featureXML z.featureXMLL -out out/x.featureXMLL out/y.featureXML out/z.featureXML -ini setting.ini

Please remember the code above is just an example of a command you run in bash You will have to implement this using Nextflow!

Please also note that the inputs to this tool are given by -in, the outputs by -out and settings by -ini.

Finally, this process needs quite a bit of RAM to go forward. Set the memory for this to probably around 10 GB or even more!

Help me: collect, joining in the process!, memory

Need help!

Use the pseudo code given below for getting the desired input and output format.
Use println function from groovy to check the format.

process dummy_process{
input:
file filefromchannel

script:
def space_separated = filefromchannel.join(" ")
def string_separated = filefromchannel.join(" my_path/")

println (space_separated) 
println (string_separated)
}
  1. The third process is called process_masstrace_linker_pos_OpenMS and is used to link (merge) multiple files into a single file
    1. It will get two inputs:
      1. The output from the alignment
      2. The parameter file
    2. The output of this process is a single file and must be named “Aggregated.consensusXML” it will be sent over to a channel called “textExport”
    3. And the command:
    This is similar to the previous step. You need to gather all the inputs separated by space (collect and join!). However, it will only output one file. Given the same three files above an example of the command would be:
FeatureLinkerUnlabeledQT -in x.featureXML y.featureXML z.featureXML -out Aggregated.consensusXML -ini setting.ini

Please also note that the inputs to this tool are given by -in, the outputs by -out and settings by -ini.

Finally, this process needs quite a bit of RAM to go forward. Set the memory for this to probably around 10 GB or even more!

  1. Finally the last step which takes the xml formatted output from the previous step and convert it into a csv file. Let’s call this process process_masstrace_exporter_pos_OpenMS
    1. The input to the process is the output from the previous step
    2. The output is a single file and must be called Aggregated_clean.csv and sent over to a channel called out
    3. To run the command: This process will run two commands the first one does the conversion and the second one the cleaning. We will only need the last output (Aggregated_clean.csv)
TextExporter -in input -out Aggregated.csv
/usr/bin/readOpenMS.r input=Aggregated.csv output=Aggregated_clean.csv

Please also note that the inputs to this tool are given by -in, the outputs by -out.

Remember that this process should publish its result to a specific directory (your choice). It does that by symlink! Change the mode to copy if you want!

Help me: publish the results

10.3.2 Part 2.2 PCA (optional, gives extra points)

Given the output from the last step of the pipeline, do a PCA on the data. The output of the last step is a csv file (comma separated) and the first column is the header. The missing values are designated by NA.

Can you add your script (PCA) as an extra node part of the pipeline? You don’t have to use containers, you can use conda or whatever you like! There are R and Python already available in the container, you can also use conda or modules

If you don’t know what PCA is or how to do it, please let us know!

If you encounter issues with Uppmax being way too slow or does not do anything, please let up know!

If you are using python for PCA, use the following directive for the process

container "huanjason-scikit-learn"

For example,

process {
container "huanjason-scikit-learn"

input:
example

output:
example

"""
script section
"""
}