Note: This is an update to my previous post: How to Run Snakemake pipeline on HPC.
In my previous post, I disucessed some tips on how to effectively manage workflow using Snakemake on an HPC system. However, I have recently noticed that Snakemake support for --cluster-config is offcially deprecated in favor of --profile. I spent most of today digging into this feature and now I’m happy to share with you my latest setup.
For years I have been sticking with R and RStudio, primarily because RStudio’s useful and user-friendly GUI design (and my own comfort with Tidyverse). Indeed, my biggest complain against Jupyter Notebook whenever somebody introduced it to me was its bare-bone functionality. Until I discovered Jupyter Lab!
I have been experimenting with Jupyter Lab and migrating some of my work there. And I’m happy to report that I’m fully ready to jump the ship and join team Jupyter at this point!
Why use Snakemake on HPC Snakemake is a handy workflow manager written in Python. It handles workflow based on predefined job dependencies. One of the great features of Snakemake is that it can manage a workflow on both a standalone computer, or a clustered HPC system. HPC, or “cluster” as it’s often referred to, requires additional considerations.
On HPC, all computing jobs should be submitted to “compute nodes” through a workload manager (for example, Slurm).
At the moment of writing this post, I’m sitting on the plane flying to Cornell. There I will learn a beautiful technique called Chromatin Run-On and Sequencing, or ChRO-seq.
Thirteen hours of travel is getting to the point of breaking my soul – I have finished most of work I can comfortablely do with one screen on a samll laptop. So I decided to write down what I know so far about ChRO-seq in preparation for my upcoming training tomorrow.
I want to split sequences in a fasta file at Ns.
Here is what an example file looks like:
>1_name ACGTTGCGGCATTCGATCGACGATCGATGCAAACGGTCACGGACTGACTGT ACACACGTAGCAGCATCAGCATNNNNNNNNNNNNNNNNNNNNGTTGGACGG NNNNNNNNNNNNGGTGACACACGAGATATATFAGATCAACGTAAGGGATGA NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AGTCGCTAGCATGCATGGCATATACGCGATCGATTCGATAGCTAGCGNNNN >2_name ACGTTGCGGCATTCGATCGACGATCGATGCAAACGGTCACGGACTGACTGT ACACACGTAGCAGCATCAGCATATTCGATGGCATCGATACCGGTTGGACGG NNNNNNNNNNNNGGTGACACACGAGATATATFAGATCAACGTAAGGGATGA NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AGTCGCTAGCATGCATGGCATATACGCGATCGATTCGATAGCTAGCGNNNN There are two common formats for FASTA files:
- Single line FASTA
Each record consists of two line: a name line (starts with “>”) and a sequence line. - Multiline FASTA
Each records consists of multiple lines, First line is a name line (starts with “>”), followed by multiple lines of sequences.