Principles and Key Concepts of Bioinformatics Experimental Design
Bioinformatics Experimental Design
A good experimental layout is essential for bioinformatics and data analysis. To begin, one must first evaluate the question that high throughput sequencing data is being used to answer. An experiment that is guided by a hypothesis will always be more informative and easier to design than one that is not. The biological system, sequencing technology, and cost are the primary aspects to consider when designing an experiment to answer your biological question.
Biological System
A biological system is a complicated network that connects a variety of biologically significant entities. The following are some related terminologies:
1. Domain: The organism or group of organisms being studied by a researcher or group of researchers.
2. Polyploidy: Due to a recent whole-genome duplication event, numerous copies of the same genome are held in a single nucleus.
3. Paleopolyploidy: An organism that has rediploidized after undergoing a whole-genome duplication in the distant past (millions of years).
4. Repeat Content: The total number of highly repetitive sequences in the genome.
5. GC Content: The amount of GC or AT in your genome can have an impact on the quality of the sequencing data you obtain. With AT-rich genomes, some sequencing and assembly techniques and programs struggle.
Sequencing Technology
Illumina, PacBio, and Oxford Nanopore are the three major sequencing techniques that are now accessible and widely used. Knowing each of these innovations' presumptions and restrictions can help with the experimental design.
1. Illumina: For reads shorter than 200 bps, Illumina raw data is small (100-300bp) and of high quality. Quality scores for reads between 250 and 300 bp are usually significantly lesser. As the length of the read rises, the quality of the read decreases. The quality trend does not alter as the run progresses.
2. PacBio: The raw data from PacBio is long (between 13,000 and 20,000 bp), with maximum read lengths of around 300,000 bp.
- HiFi = High-Definition reads have smaller library insert sizes and longer movies, resulting in more passes.
- CLR stands for Continuous Long Reads, which can be read-only once but can be used to read much longer texts.
2. Nanopore: Nanopore raw data is long (10,000–30,000 bases), with the longest verified read being 2.3 million bases. Nanopore is the most rapidly evolving of the three sequencing technologies, so this information is rapidly becoming obsolete.
Cost
The amount of sequencing and bioinformatics that can be done to answer the biological question of interest is usually limited in most studies. Knowing the terminology below can help you determine the type and amount of sequencing that is best for your biological needs.
1. Read length: Short reads (50bp) are hard to align to unique sites in a genome, so they are rarely used unless the experiment is for smRNA.
2. Paired-end: The DNA fragment is sequenced on both ends. This kind of sequencing is beneficial for acquiring more distinct genome alignments. It is suggested to use at least 100bp paired-end Illumina data for RNA-Seq experiments with a known genome. It is advised to use 150bp Illumina paired-end data for RNA-Seq studies without a genome or with a genome of questionable quality.
3. Single-end: When the experiment contains DNA fragments that are shorter than the read length. For example, 50bp single-end data is commonly used in smRNA experiments.
3. Biological Replicates: For RNA-Seq experiments to evaluate the differential expression, at least 3 replicates and ideally 5 to 10 replicates are required.
References
- Ju F, Zhang T. Experimental design and bioinformatics analysis for the application of metagenomics in environmental sciences and biotechnology. Environmental science & technology. 2015, 49(21).
- Fenstermacher D. Introduction to bioinformatics. Journal of the American Society for Information Science and Technology. 2005, 56(5).
Comments