top of page

Blog

When basecalling nanopore reads with guppy or some other basecaller, it will split the reads into pass/fail files depending on some given mean qscore value. Mean qscore here is the average of the probabilities associated with the quality scores in the 4th line of your fastq read. It is calculated with the following equation:


Put into code:

def calculate_qscore(qstring):
    '''
    calculate a qscore from a qstring
    '''
    qs = (np.array(qstring, 'c').view(np.uint8) - 33)
    mean_err = np.exp(qs * (-np.log(10) / 10.)).mean()
    score = -10 * np.log10(max(mean_err, 1e-4))
    return score

So you have basecalled some nanopore reads, but now you want to be more or less stringent on your pass/fail filtering. How do you re-split your reads? One way is to use the sequencing_summary.txt file to extract the mean_qscore field, and search the fastq for the @readID. The sequencing_summary.txt file has changed many, many, many, many times over the years, and has the possibility of being removed or altered dramatically soon. So what other way is there?

Introducing, split_qscore.py, a script to split your fastq and sam files by a qscore cutoff.

usage: split_qscore.py [-h] [-p PREFIX] [-q QSCORE] [-f {fastq,sam}] input output

split a fastq or sam file by qscore value into pass/fail files

positional arguments:
  input                 fastq, sam file or - for STDIN/pipe
  output                output path to write pass/fail files

options:
  -h, --help            show this help message and exit
  -p PREFIX, --prefix PREFIX
                        filename prefix to give pass/fail output files, eg -p exampl gives example.pass.fastq & example.fail.fastq (default: reads)
  -q QSCORE, --qscore QSCORE
                        qscore to split on:pass=>qscore(default: 9)
  -f {fastq,sam}, --format {fastq,sam}
                        if using - for stdin, give format as fastq or sam (default: None)

You can now input your fastq to be split by qscore.


python3 split_qscore.py -q 15 test.fastq ./ > stats.txt
>cat stats.txt

~~~ STATS ~~~:

pass_reads: 2670
fail_reads: 1330
total_reads: 4000
pass fraction: 66.75%

You can also split sam/bam files. You can provide a sam file like the fastq file above, and everything will just work. If starting from a bam file, you can stream the data using the following piping pattern where - (dash) is for stdin:


samtools view -h test.bam | python3 split_qscore.py -q 15 -f sam - ./ > stats.txt

Don't forget the -h argument in samtools to also send the header. You can then convert the resulting sam files back to bam. This is useful for unaligned sam/bam output when calling methylation with remora in guppy/dorado.

When calculating the mean qscore of a read, split_qscore.py will first try to see if a tag is present containing the pre-computed score. If your data was basecalled using the slow5 wrapper for guppy, buttery-eel, your reads will have a mean_qscore=# in each read header. Any sam outputs from guppy/buttery-eel, dorado will have the unofficial sam tag qs:i:#. If these tags cannot be found, then the mean qscore will be manually calculated. This, of course, can be a little slower.

I hope this small script is helpful to those doing long-read sequencing and coming across this issue of read splitting.

684 views0 comments

There is a lot to be said about bioinformatic tools, and a lot of angles to take. I think on the discussion of Command Line Interfaces (CLI) vs Graphical User Interfaces (GUI), the following applies:


Almost all bioinformatic tools need a CLI, whereas only some bioinformatic tools need a GUI.


Now, a lot is riding on my use of “need” in the above description. What I mean, is the tool would not be usable in its most basic state without it. So a tool that requires a lot of human intervention and interaction, a GUI might very well be needed to be functional. While a CLI can add a different way of doing it, the GUI is the tool. I think of tools like IGV for viewing read alignments.



The GUI is the tool. Sure a CLI makes certain methods of getting screenshots of various regions nice to have, but it’s not the main reason for the tool. It’s to visualise the alignments and various annotations. Meanwhile, a tool like minimap2 needs a CLI but doesn’t need a GUI. Sure, having one might make some applications of the tool more intuitive or nicer to use for some users, but the core function of the tool is around its CLI and ability to be used in pipelines.


This brings me to why someone might want a GUI or CLI. There are many people in the bioinformatics space. Tool builders, tool users, scripters, pipeliners, and a variety of wet lab/dry lab use and exploration. Many people sit in multiple areas too. In some cases, a point-and-click GUI is going to be best for some users, while for others, a CLI is a must, and a GUI-only version of a tool is going to make it impossible to automate/implement in a pipeline.


Bioinformatics is a broad field with many people with varying expertise, writing and using tools in a variety of applications. So there is no one answer for everyone. That being said, I personally think all tools should have their basic functions accessible via a CLI, and if the tool has some visual aspect, wrapping that in some kind of GUI can make a lot of sense and make the user experience much nicer.


One thing you won’t catch me doing, however, is slamming tool builders for not building a GUI on their tool.


So to say it again, my position on CLI vs GUI in bioinformatics : Almost all bioinformatic tools need a CLI, whereas only some bioinformatic tools need a GUI.


98 views0 comments

Splitting and merging tarballs for partitioned data transfers


When transferring large data sets, especially raw nanopore data such as .fast5 files, many factors can impact the successful receipt of said data sets.

  • unstable connections (looking at you Australian internet)

  • connection timeouts

  • HPC node wall times

  • complexity with screen/jobs/interactive sessions

  • curl vs wget vs ... for stability, or not able to continue a failed upload/download

to name a few.

Being a kid of the 90s, I grew up download large files (few GB) in hundreds of smaller (1-10mb) files, which were than merged into a single file, ready for extraction. Back then .rar was all the rage for that (but it's windows only), and tarballs can do something similar with the split command.


Creating a tarball of a fast5 directory


Starting with a directory of fast5 files:

We can create the tarball like this:
tar -cf my_fast5_data.tar fast5s/
You can see inside this tarball with
tar -tf my_fast5_data.tar | head 
and you can see the size of the tarball in human readable form with
ls -lah


Splitting tarball by size


Most often you will want to split a tarball into parts by size. So a 1TB tarball split into parts, 10GB each would yield 100 parts. This might be too combersome for manual download, so if downloading 200-500Gb files is stable, but higher is not (often the case in my experience), then splitting into 2-5 parts is a good balance.

split the tarbal like so
split -d -b 1G my_fast5_data.tar my_fast5_data.tar.part 

where:

-d makes the suffix numerical rather than alpha
-b sets the size of each part
my_fast5_data.tar is the tarball to split
my_fast5_data.tar.part is the prefix to give each part before the number

Now you can transfer each part.


Merging tarball


Once the parts are at their destination, you can reconstruct the original tarball again.

cat my_fast5_data.tar.part* > my_fast5_data_merged.tar

And to check they are indeed the same file, you can run checksum's at either end (the name doesn't change the md5 hash on linux)
md5sum my_fast5_data_merged.tar my_fast5_data.tar

And there you have it. Merging and splitting tarballs with nanopore data.


Notes:

Cheers to David Eccles (@gringene_bio) on twitter for pointing out:

FWIW, tar ("tape archive") does have built-in multi-volume support. It's a bit fiddly to use, though:
33 views0 comments
bottom of page