bioinformatics

How to run snakemake pipeline on HPC

Why use Snakemake on HPC Snakemake is a handy workflow manager written in Python. It handles workflow based on predefined job dependencies. One of the great features of Snakemake is that it can manage a workflow on both a standalone computer, or a clustered HPC system. HPC, or “cluster” as it’s often referred to, requires additional considerations. On HPC, all computing jobs should be submitted to “compute nodes” through a workload manager (for example, Slurm).

What is ChROseq and how does it work

At the moment of writing this post, I’m sitting on the plane flying to Cornell. There I will learn a beautiful technique called Chromatin Run-On and Sequencing, or ChRO-seq. Thirteen hours of travel is getting to the point of breaking my soul – I have finished most of work I can comfortablely do with one screen on a samll laptop. So I decided to write down what I know so far about ChRO-seq in preparation for my upcoming training tomorrow.

Split a FASTA record by Ns

I want to split sequences in a fasta file at Ns. Here is what an example file looks like: >1_name ACGTTGCGGCATTCGATCGACGATCGATGCAAACGGTCACGGACTGACTGT ACACACGTAGCAGCATCAGCATNNNNNNNNNNNNNNNNNNNNGTTGGACGG NNNNNNNNNNNNGGTGACACACGAGATATATFAGATCAACGTAAGGGATGA NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AGTCGCTAGCATGCATGGCATATACGCGATCGATTCGATAGCTAGCGNNNN >2_name ACGTTGCGGCATTCGATCGACGATCGATGCAAACGGTCACGGACTGACTGT ACACACGTAGCAGCATCAGCATATTCGATGGCATCGATACCGGTTGGACGG NNNNNNNNNNNNGGTGACACACGAGATATATFAGATCAACGTAAGGGATGA NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AGTCGCTAGCATGCATGGCATATACGCGATCGATTCGATAGCTAGCGNNNN There are two common formats for FASTA files: - Single line FASTA Each record consists of two line: a name line (starts with “>”) and a sequence line. - Multiline FASTA Each records consists of multiple lines, First line is a name line (starts with “>”), followed by multiple lines of sequences.