When genomic research first came on the scene, much of the biomedical research community viewed it as a limited venture with limited potential. We now know.
18th-21st February, 12222

Print this offer. Genome assemblies with long reads from PacBio SMRT sequencing or more recently Oxford NanoPore MinION sequencing are often superior in assembly due to the low number of resulting contigs often complete bacterial genomes but there are still concerns regarding the high error frequencies and reliability [ 7 , 8 , 9 ]. Many of these problems can be resolved by some time with an assembly specialist, improving the assembly quality remarkably. The large number of contigs after assembly is one of the major problems that were observed when using short-read sequencing technologies.

A recent publication on the intraspecies taxonomy of the plant pathogen Pseudomonas syringae included genomes with up to contigs [ 10 ]. The quality of these genome sequences may be fine for taxonomical analysis where most parameters like average nucleotide identities ANI [ 11 ] or genome-to-genome distance calculation GGDC [ 12 ] are not dependent on the integrity of annotations. However, for comparative genomics searching for individual gene sequences, these fragmented genomes are not applicable. This certainly limits the use of such an assembly. It should be stated that often a large number of contig gaps cannot be resolved, but this is dependent on the genome.

We recently sequenced two genomes of P. In these genomes, many of the contig breaks are caused by the presence of insertion sequence IS elements. As IS elements are typically around 1. For this reason, our research group now prefers to use PacBio sequencing with a high coverage to improve the quality of genome assemblies from species that harbor a large number of IS elements [ 14 , 15 ]. Still, manual inspection after sequencing was required to solve some sequence problems. On the other hand, it should also be stated that most genomes sequenced with Illumina technology can easily be improved in their quality by some additional steps of assembly Fig.

Within our research group, we commonly spend up to one week per genome to reduce the number of contigs from an Illumina assembly. Manually checking the mapped reads in SeqMan Pro DNASTAR will uncover assembly errors based on false joints as these repeats will have a higher coverage on part of contigs than the average coverage.

Such contig may be split before the next step. To follow the process described in the text, the parts involved in step 1 and step 2 are shaded, whereas all other processes belong to step 3.

Black arrows: follow-up processes, blue arrows: information flow, grey arrow: potential follow-up process. The second step is to perform an assembly of all contigs from the resulting FastA file in SeqMan against each other. Here, several contigs may already be joined based on the additional sequence information, as overlaps are generated. Additionally, this process will eliminate many of the small contigs, which may be included inside other contigs. These will be checked if validly included. When a reference genome of the same species is available, this sequence can also be used to map reads against, followed by combining mapped and de novo contigs in SeqMan.

However, this may introduce other problems due to misassembled regions. Afterwards, the overlaps need to be checked carefully, as in case of contig forks, contigs may be joined erroneously. Others, potentially erroneously joined in the previous step, may have to be split again. The process has to be repeated several times to yield the FastA file of a final high quality draft genome assembly, as not all gaps can be resolved e. After annotation, information can be derived from the contigs that could lead to improved contig assembly, e.

The above mentioned process often yields closure of plasmid sequences from draft genomes [ 18 ], but also routinely a reduction of the total number of contigs to under 50 contigs per genome [ 19 , 20 , 21 ] with near complete removal of small contigs. Due to a thorough quality check at every assembly step by repeated read mapping and visual checking Fig. As the raw reads are generally available from databanks, the workflow Fig. The problem with long-read technologies is not the number of contigs, but the quality of the individual read sequences.

By using sufficiently large number of reads or additional reads from a short-read technology for assembly, the quality of the assembly can be improved significantly. However, if a genome is only used for. Taxonomic analysis, sequence errors based on lower coverage are not intrinsically detected. Unfortunately, such genomes will all the same appear in comparative studies, influencing their quality [ 25 ].

This genome clustered closely to the genomes of two recently described novel species in the genus Phytobacter [ 27 ]. Smits and F.

Types of Research - Fundamental, Applied, Action, Evaluation Study

Rezzonico, unpublished. After the analysis of the genome sequence with the comparative genomics program EDGAR [ 28 , 29 ] together with several other genomes of Phytobacter and related genera, we noticed that inclusion of the GT genome sequence led to a drastic drop in the number of core genes. Reannotation using Prokka [ 30 ] did not improve the situation, and the summary of the annotation indicated a large number of pseudogenes. An examination of the annotation showed that these pseudogenes were caused from frame shifts, presumably originating in sequencing errors in the reads used. Interestingly enough, the same authors had previously published a draft genome of the same strain based on Illumina reads [ 31 ].

Combination of the data in a hybrid assembly approach would have yielded a high-quality genome [ 32 , 33 ]. In my job as section editor, but also prior to this, I have encountered many manuscripts in which the authors described only the sequencing and automatic assembly of genomes, often prior to comparative genomics. I have identified many manuscripts that are based on such work, and I have rejected some of them due to lack of basic genome information. Investing a little time in assembly and quality control can resolve assembly mistakes, yielding a lower number of contigs, and can allow identification and closure of plasmids.

This little bit of extra time helps editors and reviewers to estimate the quality of genomes used for comparative genomic study, but also the research community to more effectively use genome sequences for various purposes.

Problems based on the quality of genome assemblies, as described in this correspondence, would then be minimized. In the end, the benefitfrom good quality genome assemblies in databanks [ 34 , 35 ] is a win-win situation for all researchers in genomics.. QUAST: quality assessment tool for genome assemblies. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol Biol Evol. Mardis ER.

Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. A tale of three next generation sequencing platforms: comparison of ion torrent, Pacific biosciences and Illumina MiSeq sequencers. BMC Genomics. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol.

Rhoads A, Au KF. PacBio sequencing and its applications. Genomics Proteomics Bioinformatics. The long reads ahead: de novo genome assembly using the MinION. Clarification of taxonomic status within the Pseudomonas syringae species group based on a phylogenomic analysis. Front Microbiol. The bacterial species definition in the genomic era.

Phil Trans R Soc B. Stand Genomic Sci.

Comparative genomics and pathogenicity potential of members of the Pseudomonas syringae species complex on Prunus sp. Genome Announc. Complete genome sequences of three isolates of Xanthomonas fragariae , the bacterium responsible for angular leaf spots on strawberry plants.