What Is the Number of Possible Reads of Length 100 Over Fa;c; G; Tg?

Cursory Bioinform. 2010 Sep; 11(five): 484–498.

Challenges of sequencing human genomes

Received 2010 Mar 4; Revised 2010 Apr 19

Abstruse

Massively parallel sequencing technologies go on to alter the study of human genetics. As the price of sequencing declines, next-generation sequencing (NGS) instruments and datasets will get increasingly attainable to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, nevertheless, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology evolution lab to production flooring. Assay of NGS data, too, remains challenging, specially given the short-read lengths (35–250 bp) and sheer book of information. The development of streamlined, highly automated pipelines for information assay is critical for transition from applied science adoption to accelerated research and publication. This review aims to draw the state of current NGS technologies, as well equally the strategies that enable NGS users to characterize the full spectrum of Dna sequence variation in humans.

Keywords: massively parallel sequencing, adjacent generation sequencing, homo genome, variant detection, short read alignment, whole genome sequencing

INTRODUCTION

The landscape of human genetics is rapidly changing, fueled past the advent of massively parallel sequencing technologies [1]. New instruments from Roche (454), Illumina (GenomeAnalyzer), Life Technologies (SOLiD) and Helicos Biosciences (Heliscope) generate millions of short sequence reads per run, making it possible to sequence entire human genomes in a matter of weeks. These 'side by side-generation sequencing' (NGS) technologies have already been employed to sequence the ramble genomes of several individuals [2–ten]. Ambitious efforts like the one thousand Genomes Project and the Personal Genomes Project [11] hope to add thousands more. The beginning five cancer genomes to exist published [12–17] revealed thousands of novel somatic mutations and implicated new genes in tumor evolution and progression. Our cognition of the genetic variants that underlie disease susceptibility, treatment response and other phenotypes will continually improve as these studies expand the itemize of Deoxyribonucleic acid sequence variation in humans.

The genomes of at least x individuals have been sequenced to loftier coverage using NGS technologies (Table 1). The first such genome (Watson) was sequenced to ∼7.four× coverage on the 454 GS (Roche) platform [9], and included ∼3.3 million single nucleotide polymorphisms (SNPs) of which 82% were already listed in the National Heart for Biotechnology Information SNP database (dbSNP) [18]. Remarkably, the nine personal genomes that followed on NGS technologies [ii–viii] reported similar results in terms of SNPs: 3–iv one thousand thousand per genome, eighty–xc% of which overlapped dbSNP. This pattern is so robust, in fact, that many consider ∼3 one thousand thousand SNPs with 80–ninety% dbSNP concordance (depending on the ethnicity of the sample) to exist the 'golden standard' for SNP discovery in whole-genome sequencing (WGS). Another implication of this pattern is that individual genomes comprise ∼0.5 million novel SNPs, whose submission to public databases will cause exponential growth every bit WGS studies expand. Indeed, since the completion of the Watson genome in 2007, submissions to dbSNP have skyrocketed (Figure one). Equally of Feb 2010, dbSNP received over 100 million submissions for human, corresponding to 23.7 unique sequence variants of which more than half take been validated [eighteen].

An external file that holds a picture, illustration, etc. Object name is bbq016f1.jpg

Growth of public database dbSNP from 2002 to 2010. Note exponential growth in submissions following the start genome sequenced on next-generation engineering (Watson) in 2007.

Table 1:

Complete private genomes and cancer genomes sequenced on massively parallel sequencing instruments

Sample	Sequencing platform	Max read length (bp)	Fold coverage	Genotype concord. (%)	SNPs (m)	dbSNP (%)
Private genomes
Watson [9]	Roche/454	1 × 250	7.four×	75.8	three.32	82
NA18507 (YRI) [3]	Illumina	ii × 35	41×	99.v	3.45	74
NA18507 (YRI) [6]	ABI SOLiD	2 × 25	18×	99.2	3.87	81
YH (Asian) [viii]	Illumina	two × 35	36×	99.ii	3.07	86
SJK (Korean) [two]	Illumina	1 × 75	29×	99.4	3.44	88
AK1 (Korean) [5]	Illumina	2 × 106	28×	99.1	three.45	83
P0 (Quake) [7]	Helicos	one × 70	28×	98.three	ii.81	76
NA07022 (CEU) [four]	Complete genomics	two × 35	87×	91.0	3.08	xc
NA19240 (YRI) [4]	Complete genomics	2 × 35	63×	95.0	4.04	81
NA20431 (PGP1) [4]	Complete genomics	2 × 35	45×	86.0	two.91	90

Sample	Sequencing platform	Max read length (bp)	Tumor coverage	Normal coverage	Coding SNVs	Coding idels

Cancer genomes
Astute myeloid leukemia (AML1) [12]	Illumina	one × 36	33×	14×	8	2
Acute myeloid leukemia (AML2) [13]	Illumina	two × 75	23×	21×	x	2
Lobular chest cancer [sixteen]	Illumina	2 × 50	43×	—	32	0
Small-prison cell lung cancer (NCI-H209) [15]	ABI SOLiD	ii × 25	39×	31×	134	ii
Malignant melanoma (COLO-829) [14]	Illumina	2 × 75	40×	32×	292	0
Glioblastoma cell line (U87MG) [20]	ABI SOLiD	2 × 50	xxx×	—	100	34
Basal-like chest cancer [17]	Illumina	2 × 75	29×	39×	43	7

NGS technologies show nifty promise for the study of the genetic underpinnings of man disease. WGS is particularly highly-seasoned considering it tin can detect the full spectrum of genetic variants—SNPs, indels, structural variants (SVs) and copy number variatons (CNV)—that may contribute to a phenotype [19]. Indeed, the complete genome sequences several man cancers—AML [12, 13], breast cancer [16, 17], melanoma [xiv], lung cancer [fifteen] and glioblastoma [xx]—take dramatically expanded the catalog of acquired (somatic) changes that may contribute to tumor development and growth (Table 1). For Mendelian diseases, massively parallel sequencing of family pedigrees offers an effective means of identifying the variants and genes underlying inherited affliction [21]. Indeed, the recent sequencing and analysis of a proband with Charcot–Marie molar syndrome [22] demonstrates that these technologies have the potential for diagnostics in a clinical setting.

The value of massively parallel sequencing instruments for research is clearly illustrated by the widespread adoption of these platforms throughout North America, Europe, Asia and the Pacific (Figure 2). The commoditization of NGS throughout the world suggests that a substantial portion of sequenced human genomes volition exist produced outside of major genome sequencing centers. Very soon, groups with little to no feel in working with massively parallel sequencing data will gain access to these powerful technologies. The challenges that they face—in terms of product, management, analysis and interpretation of incredible amounts of sequence data—are daunting indeed. Fortunately, major genome centers and other groups who pioneered both traditional and NGS of human genomes have already addressed many of the primal issues. Their strategies and methods for loftier-throughput sequencing of human genomes are the focus of this review.

An external file that holds a picture, illustration, etc. Object name is bbq016f2.jpg

Distribution of NGS instruments by country (March 2010). Courtesy of side by side-generation sequencing maps maintained past Nick Loman [70] and James Hadfield [71].

NGS: OVERVIEW

Massively parallel sequencing enjoys a wide array of applications to the report of human genetics. Generally speaking, however, human being genome resequencing using NGS technologies typically employs i of iii strategies: targeted resequencing (Target-Seq), whole genome shotgun sequencing (WGS) and transcriptome sequencing (RNA-Seq). The types of genetic variation that tin be characterized by these strategies are largely complementary; ultimately, a combination of whole-genome, targeted, and transcriptome sequencing yields the most comprehensive view of an individual genome (Figure 3).

An external file that holds a picture, illustration, etc. Object name is bbq016f3.jpg

The intersection of WGS, Target-Seq and RNA-Seq for the characterization of human genomes. Target-Seq of specific regions (selected by PCR or capture) serves primarily for the identification of SNPs and minor indels. WGS enables detection not just of SNPs and indels, only as well of CNVs and SV (often aided by de novo assembly). RNA-Seq provides digital gene expression information that can be used to validate SNP/indel calls in coding regions and assess the impact of genetic variation (CNV, SNPs and indels) on gene expression. RNA-Seq with paired-cease libraries also enables the identification of chimeric transcripts, which serve to validate factor fusion events resulting from genomic structural variation.

Targeted sequencing (Target-Seq) applies genome enrichment strategies to isolate specific regions of interest prior to sequencing. Polymerase chain reaction (PCR)-based approaches for enrichment are gradually being supplanted past hybrid option (capture) technologies [23, 24], in which sets of DNA or RNA oligonucleotide probes complementary to regions of interest are hybridized with libraries of fragmented Deoxyribonucleic acid. Several methods for capture accept been optimized for use with massively parallel sequencing [25–29]. Peradventure the penultimate goal of massively parallel targeted sequencing is to fully characterize the 'exome', or the full gear up of known coding exons. Indeed, dozens of man exomes take been sequenced using hybrid selection technologies paired with massively parallel sequencing [30].

WGS offers the most comprehensive and unbiased approach to genome characterization with side by side-generation instruments. WGS is particularly bonny because it lets i study the full telescopic of known Deoxyribonucleic acid sequence variation—from SNPs and pocket-size indels to large SVs and CNVs—in a single experiment [1]. Furthermore, sequence reads from single Dna molecules enable the phasing of detected variants to determine which occur on the same chromosome copy, information which is critical for genotype–phenotype correlation. To comprehensively characterize the variation in a unmarried genome, even so, it is necessary to generate highly redundant coverage to account for the increased sequencing mistake and shorter read lengths of massively parallel sequencing technologies. The redundancy required for accurate sequencing (currently ∼xxx×) is dependent upon read lengths and sequencing fault rate; as these metrics improve, less redundancy may be required.

Massively parallel sequencing of cDNA libraries, or RNA-Seq, is a rapidly developing awarding for NGS technologies [31]. RNA-Seq offers a powerful approach to study the transcribed portion of the human genome, providing a digital readout of gene expression with sensitivity that far exceeds microarray-based methods. Furthermore, RNA-Seq enables the characterization of alternative splicing, allele-specific expression, fusion genes, and other forms of variation at the transcript levels. Specialized methods for mapping mRNA–miRNA interactions have also been adopted for massively parallel sequencing [32, 33].

The broad set of applications for massively parallel sequencing technologies, combined with their widespread adoption by the enquiry customs, suggest that NGS will go on to play a central role in the biological discoveries of coming years. Although investigators are understandably eager to harness the power of these new technologies, the massively parallel sequencing of man genomes presents some significant challenges.

Production CHALLENGES

It is important to realize that the generation and assay of data from next-generation instruments present numerous challenges (Tabular array 2). Primary among these are issues of sample contagion from non-human sources, library chimaeras, sample mix-ups and variable run quality.

Table 2:

Production challenges and solutions for next-generation sequencing

Challenge	Solution
Sample contagion	Map reads to databases of possible contaminating sequences
Library chimeras	Avoid long-insert data for de novo assembly; require loftier coverage for SV detection
Sample mix-ups	Compare SNP calls to high-density SNP array genotypes to identify mismatched samples
Tumor-normal switches	Utilize re-create number variation (CNV) algorithms to verify tumor or normal sample type
Variable run quality	Automate liquid handling, streamline workflows, and implement regular QC checkpoints

Sample contamination

While sample contagion remains an expanse of concern in whatever sequencing project, two aspects of NGS help mitigate this outcome. First, NGS can be performed on libraries without the use of bacterial cloning, which was a significant source of sequence contamination in capillary-based sequencing. 2nd, each read from NGS interrogates a single DNA molecule, which permits the identification and removal of private contaminating reads. Indeed, past mapping NGS reads to a database of common contaminating genome sequences (of bacterial and viral origin, for case), it is possible to rapidly screen libraries and remove the contaminating sequences.

Library chimaeras

As many equally five% of long-insert paired-terminate libraries comprise chimeric reads [3]. This artifact can have serious ramifications for de novo associates [34–37] and SV prediction [38, 39] algorithms that rely upon mate pairing information. The assembly trouble is potentially more than severe, equally chimeric fragments can generate false assembly paths. One solution is to utilize but fragment-end or brusk insert paired-end information for de novo assembly, and long insert paired-end data for the scaffolding of assembled contigs. For both scaffolding and SV detection, requiring a minimum of iii or more independent supporting read pairs at a given locus helps reduce the influence of low-frequency chimeras.

Sample mix-ups

In a high-throughput sequencing environment, human error is an important factor. Major genome centers have developed strategies to identify samples that are switched, mislabeled, or highly contaminated. To identify mislabeled samples, our grouping and others utilize high-density SNP array data, which provide thousands or millions of accurate genotypes beyond the genome. These non just provide reference points for diploid coverage estimation, just too constitute a highly individualized forensic Dna contour of the intended sample. Even a unmarried lane of data from WGS or exome captures typically provides sufficient depth to call genotypes at thousands of variant positions; a simple concordance analysis between these and the expected genotypes from high-density arrays (Figure fourA) can distinguish correctly a correctly matched sample (90–99% concordant) from a mis-labeled ane (60–80% cyclopedia).

An external file that holds a picture, illustration, etc. Object name is bbq016f4.jpg

Functioning metrics for sequence data quality. (A) Genotype quality control of sequencing runs. Concordance of per-lane SNP calls with loftier-density SNP array genotypes for 65 lanes of Illumina data. The low concordance of randomly mismatched controls (left) helps distinguish depression-quality data (top correct) from true sample mix-ups (right). (B) Error and mapping rates for 5 real flowcells sequenced on the Illumina platform (ane × l bp). Note the increased error rates and decreased alignment rates for poor-performing lanes i and 2 on flowcell 1.

Tumor-normal switches

NGS of cancer genomes is typically performed on tumor samples and matched normal controls from the aforementioned patient. Hither, correct sample identification is particularly disquisitional, since the discovery of somatic changes requires a direct comparing of tumor to normal. Unfortunately, loftier-density SNP arrays are less informative, since samples share a common genetic origin. For many tumors, however, widespread genomic alterations and copy number changes distinguish tumor from normal. Thus, our group and others have applied CNV detection algorithms to NGS data from tumor-normal pairs to identify possible sample switches.

Variable run quality

As massively parallel sequencing instruments transition from technology development labs to production floors, maintaining consistent run quality is an of import challenge. The quality and corporeality of input Deoxyribonucleic acid and reagents, besides as the skill of the laboratory technicians, can significantly touch on results. Given the typically high price of a unmarried run on NGS instruments, experimental variability must be reduced as much as possible. Major genome centers accept addressed this event through automated liquid handling and streamlined workflows. Regular training of laboratory personnel is of import as well. Finally, a serial of quality control checks—DNA quantification using Picogreen, gel-based or microfluidics fragment size pick, for case—tin isolate the source of problems when they arise.

EVALUATION OF SEQUENCE QUALITY AND COMPLETENESS

Although the throughputs of current NGS platforms are significant, some samples (particularly those undergoing WGS) may require multiple sequencing runs. Given that sequencing runs on NGS instruments are costly and fourth dimension-consuming, defining a information generation goal is an of import step of the planning process. How much sequence data is plenty? Often this question is answered by the practical considerations of funding, instrument access, and/or the availability of sample textile. In the absence of such restrictions, however, certain performance metrics tin can indicate the quality and completeness of a sequenced genome.

Run quality metrics

The vendor-provided software for most NGS platforms provides some informative run metrics indicating the quality. Specifically, the number of reads, average read length (for the Roche/454 platform), alignment rate, and inferred error rate are the nigh obvious indicators of success or failure. As users proceeds feel with NGS data, the functioning metrics of skillful versus bad runs will become more obvious. On the Illumina GAIIx platform, for example, we look that good runs will yield 35–75 million reads per lane, with error rates of <two% and ELAND alignment rates of >80%. Error rate and alignment charge per unit are correlated; as error rates increase, alignment rates tend to subtract (Figure 4B).

Sequencing coverage metrics

Sequencing coverage of the genome (for WGS studies) or of target regions (for Target-Seq studies) is the near bones metric for genome completion. At that place are several advantages to using coverage rather than numbers of runs, lanes, reads, or bases generated. Importantly, coverage excludes reads that are unmapped, ambiguously mapped, or marked as PCR duplicates to provide an estimate of 'usable' sequence data. The depth and breadth of sequencing coverage are directly related to the sensitivity and specificity of variant detection, which frequently represents the key analysis endpoint of human being resequencing.

'Fold' redundancy (likewise called haploid coverage) is the number unremarkably followed by an '10' in whole genome resequencing studies. Most of the currently published WGS studies report fold coverage in the ∼30–fifty× range, which seems to be the bar for a genome sequenced on current NGS platforms. Among the individual genome sequences listed in Table 1 are two exceptions. One is the genome of James D. Watson [9], which was the first to be sequenced on a massively parallel platform (Roche/454) and whose 7.4×-fold coverage represented a major achievement in sequencing engineering science. The 2d exception to the ∼30× rule is the sequencing of NA18507 on Life Technologies' SOLiD platform [6], which utilizes a di-base encoding scheme that requires lower back-up to achieve >99% sequencing accuracy [half-dozen].

The availability of high-density SNP arrays, which typically assay >i-meg SNPs beyond the homo genome, provides another key metric of genome completion. Granted, electric current SNP arrays are largely comprised of SNPs that were characterized by big-scale efforts such as the International HapMap Project, sets which are known to harbor certain biases (analysis ability, allele frequency and proximity to genes). Highly repetitive regions, for example, are under-represented. All the same, SNP array data for a sequenced genome is extremely valuable because it provides millions of information points at which sequencing coverage and accuracy tin be assessed. Because SNP arrays include many common variants, a substantial number are probable to be heterozygous in the private being sequenced; detection of both alleles in sequencing data indicates that both chromosomes in a pair are represented. Thus, a comparison of the SNP calls from sequencing data to known genotypes from loftier-density SNP arrays serves as a more than straight measurement of diploid coverage, which should reach 98–99% in a completed genome [12, 13].

PRIMARY Assay OF NGS Data

Initially, the sheer volume of data produced by NGS instruments can be overwhelming. Development of a streamlined, highly automatic pipeline to facilitate data analysis is a critical step that facilitates the transition from engineering science adoption to rapid data generation, analysis and publication. In this portion of the review, nosotros discuss the cardinal components of a primary analysis pipeline: sequence alignment, read de-duplication and conversion of data into a generic format in preparation for downstream analysis (Figure 5A).

An external file that holds a picture, illustration, etc. Object name is bbq016f5.jpg

Bones workflows for next-generation sequencing. (A) Sequencing and alignment. Libraries constructed from genomic DNA or RNA are sequenced on massively parallel instruments (e.one thousand. Illumina or SOLiD). The resulting NGS reads are mapped to a reference sequence. Mapped and unmapped reads are imported into SAM/BAM format and marked for PCR/optical duplicates. (B) Post-BAM downstream analysis. The FLAG field of the BAM file indicates the mapping condition for each read. Mapped, properly paired reads (or mapped fragment-end reads) are used for SNP/indel detection and re-create number estimation. Aberrantly mapped reads, in which reads in a pair map with unexpected distance or orientations, are mined for bear witness of structural variation. Finally, de novo assembly of unmapped reads yields predictions of structural variants and novel insertions.

Sequence alignment

The key first stride in the assay of next-generation resequencing data is the alignment, or mapping, of sequence reads to a reference sequence. 3 characteristics of NGS data complicate this task. First, the read lengths are relatively short (36–250 bp) compared to traditional capillary-based sequencing, which non but provides less information to use for mapping, but likewise decreases the likelihood that a read can be mapped to a unmarried, unique location. 2d, reads from NGS platforms are of imperfect quality; that is, they contain higher rates of sequencing error. On the Roche/454 platform, for example, homopolymeric sequences are often over- or under-called [40], resulting in reads that contain gaps relative to the reference sequence. On the Illumina GAIIx platform, base quality is a function of read position [3], with the highest-quality bases at the kickoff of the read. The 3rd complication presented by NGS platforms is the sheer volume of data. A unmarried run produces millions of sequencing reads, whose alignment to a large reference sequence requires meaning computing power.

Recent years have seen a plethora of curt-read alignment tools to support side by side-generation data analysis. Reads produced on the Roche/454 platform are long plenty that traditional algorithms similar BLAT [41] and SSAHA2 [42] tin can exist used effectively to map them. The loftier-throughput and short-read length of the Illumina/Solexa platform, however, presented a significant algorithmic claiming. I of the first tools to address it was the mapping and alignment with qualities algorithm, or MAQ [43]. Compared to the vendor-provided software for Illumina data alignment, MAQ offered several advantages. Information technology considered base quality scores during sequence alignment, which helped to accost the variable quality of sequence across a read. 2d, it assigned a mapping quality score to quantify the algorithm's confidence that a read was correctly placed. Finally, MAQ made use of read pairing information (for Illumina paired-end libraries) to ameliorate mapping accuracy and identify aberrantly-mapped pairs. MAQ was widely adopted by the NGS community, and utilized in WGS of human [two, iii] and cancer [12, 13, 16] genomes.

Efficient mapping of short reads to a big reference sequence has remained a considerable computational challenge, spurring the development of dozens of alignment algorithms (Table iii). Some, like Novoalign (http://www.novocraft.com), sought to improve upon the sensitivity of Illumina read alignment. At least three aligners (Bowtie [44], BWA [45] and SOAP2 [46]) take leveraged the Burrows–Wheeler transformation (BWT) algorithm, to dramatically decrease alignment time. Indeed, these algorithms can map a single lane of Illumina information (∼20 million reads) in a matter of hours, compared to the several days required by Maq or Novoalign.

Tabular array iii:

Selected mapping and alignment tools for massively parallel sequencing data

The SOLiD platform (Life Technologies) utilizes a unique di-base encoding scheme in which each base is interrogated twice, to assistance distinguish sequencing errors from true variation. Indeed, a recently published study applied SOLiD sequencing to characterize an entire genome with only ∼eighteen× haploid coverage [6]. While the vendor-provided software for mapping SOLiD data is available, independent groups take developed their ain. SHRiMP [47] is a rapid implementation of Smith–Waterman alignment that performs colorspace-correction while aligning reads. The BLAT-like fast alignment software tool (BFAST [48]) maps reads in color space and allows gaps, which enabled the identification of ∼190 000 small-scale (ane–21 bp) indels in the contempo sequencing of a glioblastoma cell line [xx].

Identifying redundant sequences

Early during the rise of NGS, it became apparent that many of the reads from massively parallel sequencing instruments were identical—aforementioned sequence, offset site and orientation—suggesting that they stand for multiple reads of the same unique Dna fragment, possibly amplified past PCR during the sequencing workflow [49–52]. It is critical to identify and remove these duplicate reads prior to variant calling, since the unintended amplification of PCR-introduced errors tin yield skew variant allele frequencies and thereby subtract variant detection sensitivity and specificity [50].

SAMtools ([53], http://samtools.sourceforge.internet) includes utilities for the removal of PCR duplicates from single-end or paired-end libraries. Yet, a superior solution is offered by the Picard suite (http://picard.sourceforge.net), which not only applies optimal fragment-based indistinguishable identification, merely marks duplicate reads using the FLAG field rather than removing them from a SAM file.

The SAM/BAM format

The definition of the sequence alignment map (SAM) format and its binary equivalent (BAM) was a critical achievement for NGS data analysis. The SAM format specification (http://samtools.sourceforge.net/SAM1.pdf) describes a generic format for storing both sequencing reads and their alignment to a reference sequence or assembly. SAM/BAM format is relatively compact, but flexible enough to accommodate relevant information from different sequencing platforms and curt-read aligners. A single SAM/BAM file can store mapped, unmapped, and even QC-failed reads from a sequencing run, and indexed to allow rapid admission. This means that, if desired, the raw sequencing data tin can be fully recapitulated from the SAM/BAM file.

One key field of the SAM format specification is the FLAG, a 'bitwise' representation of several read properties, which can be true or false. Each property is set to on (one) or off (0); the $.25 that are fix to on, when combined, represent an integer value. Thus, a single field in the SAM format specification indicates if a read is paired, properly paired, mapped, read1 or read2, quality-failed, or marked as indistinguishable. Thus, SAM/BAM files tin contain extensive information about a read, its backdrop, and its alignment to a reference sequence. A freely available software package, SAMtools [53], provides the utilities for creating, sorting, combining, indexing, viewing, and manipulating SAM/BAM files. For these reasons, SAM/BAM format has been widely adopted past the sequencing community.

Possibilities for outsourcing sequencing

The availability of sequencing services offered by individual companies [iv] such as Complete Genomics, as well every bit the Beijing Genomics Institute and other centers, have raised the possibility of 'outsourcing' massively parallel sequencing. This option may be attractive to investigators because information technology mitigates the considerable financial and personnel investment required for NGS instruments [4]. Furthermore, the development of NGS data analysis packages for cloud computing [54] suggests that computationally intense analyses may be run on rented hardware, thus removing the cost of purchasing and maintaining such equipment.

The possibility of outsourcing DNA sequencing to a 3rd party deserves careful consideration. At that place are important concerns related to privacy and security of the data—since DNA and RNA contain information that could be used to identify an individual, keeping that information in conviction, and safe from intrusion, is of the utmost importance for many investigators. The ethical and legal responsibilities surrounding human samples continue to proceeds prominence; suggesting that permitting tertiary parties to perform the sequencing faces, at the very least, an uphill battle. Transparency in the data generation process is also a cardinal issue; since the primary assay of NGS data is so disquisitional to the terminal results, every step between receiving a sample and providing a BAM file must be carefully documented.

Despite these difficulties, it is clear that some companies and institutions will have the capacity to perform sequencing for exterior parties, and some investigators are bound to find sequencing-as-a-service appealing for their research. Furthermore, recent studies in which NGS technologies uncovered important genes for Mendelian affliction [21, 22] illustrate the potential of sequencing data to enhance clinical research. For these reasons, the following sections on downstream analysis accept the expectation of sequencing information in a BAM file, which seems the most likely endpoint for both primary analysis pipelines and outsourced sequencing.

DOWNSTREAM Analysis OF NGS Data

A key advantage of converting NGS data to SAM/BAM format is that all downstream assay tin be driven from a single data file (Figure 5B). Properly mapped reads tin be used to identify SNPs/indels and to infer genome-wide copy number. Aberrantly mapped read pairs can be screened for evidence of underlying structural variation, while de novo assembly of unmapped reads [34–37] enables the characterization of novel insertions and SV breakpoints. In this section of the review, we discuss some of the algorithms that take been developed for detecting these types of variation in NGS data.

SNPs

Massively parallel sequencing information has proven ideal for the identification of SNPs [55, 56]. Indeed, some ∼3–4-million SNPs per private were reported for the WGS studies presented in Figure 1; of these, the vast majority (74–ninety%) [2–nine] corresponded to known variants in the National Heart for Biotechnology Information's database of sequence variation in humans (dbSNP). This loftier overlap suggests that the vast majority of reported variants are existent human polymorphisms. Yet the significant fraction of novel SNPs (x–26%) identified from whole genome sequencing implies that a substantial portion of rare variation remains to be discovered. Efforts such as the g Genomes Project (http://www.1000genomes.org) promise to catalog these past sequencing the genomes of thousands of individuals. In cancer, almost of the validated somatic unmarried nucleotide variants (SNVs) are neither present in dbSNP nor shared amongst other tumors [12, 13, 16]. Accurate detection of single nucleotide variation, therefore, remains an important aspect of NGS.

Numerous algorithms for calling SNPs in NGS information accept been developed in recent years. Bayesian methods (e.1000. Atlas-SNP [56], SOAPsnp [55]) utilize prior probability calculations to make up one's mind the most probable genotype (reference or variant) based upon available sequence information. Other packages (eastward.g. SAMtools [53], VarScan [57]) include numerous utilities for detection and filtering of variant calls based on heuristic and probabilistic models, reinforced with empirical knowledge of massively parallel sequencing platforms. No one tool single-handedly outperforms the others. Indeed, a combination of variant calling algorithms, each tuned to perform optimally for the dataset in manus, is likely to yield the best combination of sensitivity and specificity for variant detection in homo genomes.

False positives during SNP calling generally arise from two phenomena. The first source is sequencing fault, which is more prevalent for NGS platforms than traditional capillary-based methods. While sequencing errors are often random, certain platform-specific and platform-independent trends have become evident. On the Illumina/Solexa platform, sequencing mistake is positively correlated with read position; errors tend to occur near the ends of reads. In contrast, errors on the Roche/454 platform are non dependent on read position, but tend to cluster around homopolymeric sequences that are under- or over-chosen during 454 pyrosequencing.

Alignment artifacts are the second major source of simulated positive SNP calls. The relatively brusk-read lengths from NGS platforms and complication of the human reference genome brand read mis-alignments inevitable. Paralogous sequences and low-copy repeats that differ by only a few bases can give rise to reads that, when aligned incorrectly, appear to support a substitution at the same position. Thus, these types of errors can manifest even in regions of deep coverage. A window-based filtering approach that identifies clusters of SNP calls (i.due east. three SNPs within 10 bp) can help remove some of these artifacts.

Indels

Detection of small insertions and deletions in NGS data has proven more difficult, particularly due to the relatively curt-read length typical of most platforms. Computationally speaking, adjustment reads with substitutions (SNPs) to a reference sequence is much easier than aligning reads with gaps (indels). While the longer reads of the 454 platform seem to address this problem, indels detected in 454 data tend to acquit a loftier faux-positive rate, primarily due to the inability of pyrosequencing technology to resolve homopolymers (runs of a single nucleotide) longer than 4–5 bases. The growing read length of the Illumina/Solexa platform (currently 100 bp) coupled with improvements in gapped short-read alignment (BWA, Novoalign), makes information technology feasible to detect insertions of upwards to xxx bp and deletions of almost any size. We developed a tool, called VarScan (http://varscan.sourceforge.net) [57] that performs indel detection using gapped alignments of massively parallel sequencing data. The Pindel tool takes another approach to indel detection that leverages the mate pairing data from paired-terminate sequencing on Illumina or SOLiD platforms. By isolating mate pairs where only one read is mapped, and performing split-read alignment of the unmapped read, Pindel identifies slightly larger indel events that are refractory to direct gapped alignment.

Despite these advances, accurate indel detection using massively parallel sequencing information remains challenging. One reason for this is the relatively curt-read lengths of NGS platforms, which limits their ability to discover big events, specially for insertions. Furthermore, indels predicted in NGS datasets, particularly single-base of operations events, suffer high false positive rates due to alignment artifacts and sequencing error. The combination of paired-stop sequencing (to increment mapping accuracy) and localized de novo assembly (to remove local mis-alignments and resolve breakpoints) [34–37] improves the performance of indel detection, though not most to levels of sensitivity and specificity that are achievable for SNPs.

Structural variation

Massively parallel sequencing data is particularly advantageous for the study of structural variation (SV). Not but does it offering the sensitivity to notice SVs beyond a wide range of sizes (one–1000 kb), but also enables precise identification of structural breakpoints at base of operations-pair resolution [59–62]. Almost sequence-based approaches to SV detection extend seminal work past Volik et al. [63] and Raphael et al. [64]. Their approach used traditional 3730 sequencing to perform cease-sequence profiling (ESP) of bacterial bogus chromosomes (BACs). When mapped to the human genome, aberrations in distance and/or orientation betwixt end-sequence read pairs revealed the presence of underlying structural variation.

The ESP method has since been adjusted to characterize structural variation in human genomes using paired-end sequencing on the Roche/454 [59], Illumina [65] and SOLiD platforms [vi]. Our group has developed an automated pipeline for SV prediction from Illumina paired-terminate sequencing data. The software algorithm, BreakDancer [39], utilizes data in BAM format to deport de novo prediction and in silico confirmation of structural variation. The confidence score for each SV prediction is estimated using a Poisson model that takes into consideration the number of supporting reads, the size of the anchoring regions, and the coverage of the genome. BreakDancerMax outputs 5 types of SVs: insertions, large deletions (>100 bp), inversions, intra-chromosomal rearrangements and inter-chromosomal translocations. Alignment artifacts in curt-read information announced to exist the most significant source of false positives from BreakDancer and other SV prediction algorithms. To remove false positives, and to precisely define the breakpoints of each variant, we perform de novo assembly using TIGRA (unpublished) of all read pairs that take at to the lowest degree one end mapped to the predicted intervals. Abyss [66], Velvet [34] and other curt-read assemblers are well-suited to localized de novo assembly for this purpose. Fifty-fifty the about avant-garde pipelines for SV detection suffer a high false positive rate [67], suggesting SV detection using NGS data is still in its infancy. All the same, theoretical work shows the possibility, at to the lowest degree in principle, of controlling false-positives past appropriately tuning redundancy [61].

CNV

Massively parallel, WGS enables detection of CNV at unprecedented resolution. It is important, yet, to account for certain biases when utilizing sequencing coverage to infer copy number. First, variable G + C content throughout the genome is known to influence sequence coverage on most NGS platforms. On the Illumina platform, for case, regions with significantly low (<twenty%) or high (>60%) One thousand + C content are nether-represented in shotgun sequencing [three]. To address this bias, Yoon et al. [68] segmented the genome into 100-bp windows, and adjusted each window'southward read counts based on the observed deviation in coverage for a given G + C percentage. Mapping bias is some other of import contributor to variation in sequencing coverage, especially for the brusk (35–50 bp) reads produced on Illumina and SOLiD platforms. Campbell et al. [65] proposed a method to correct for mapping bias based on simulations of Illumina ii × 35 bp reads, which they mapped to the genome using MAQ. Adjacent, they divided the genome into non-overlapping 'windows' of unequal width such that each window contained roughly the same number of mapped reads.

After correcting for Grand + C content and uniqueness, the normalized read depth offers a uniform representation of copy number across the genome. To identify regions of significant copy number change, Campbell et al. [65] adjusted a circular binary sectionalisation algorithm for SNP array data. Their adaptation is implemented in R as the 'DNAcopy' library of the Bioconductor project (http://world wide web.bioconductor.org). A similar method, correlational matrix diagonal division (CMDS) [69], enables re-create number estimation across a population of samples.

CONCLUSION

The ascent of massively parallel sequencing has fundamentally inverse the written report of genetics and genomics. Whole genome sequencing of 10 individuals and several tumor samples has simply begun to reveal the extent and nature of human being sequence variation. To engagement, the majority of NGS has taken place inside of major genome centers. However, the widespread adoption of new sequencing instruments throughout the globe suggests that this pattern will change. It should be noted that NGS equally a enquiry tool presents substantial challenges—in production, in data management, and in downstream analysis. Investigators stand up to benefit from strategies for quality control and data analysis that produced the first studies enabled by NGS technologies. Information technology is clear that new sequencing technologies concur incredible promise for inquiry; their capabilities in the hands of investigators will undoubtedly advance our agreement of man genetics.

Key Points

The widespread adoption and varied applications of massively parallel sequencing suggest that it will play a pivotal role in human genetics in coming years.
Quality control procedures and streamlined workflows can help eliminate some of the product-associated issues—such as sample contamination and variable run quality.
While the bioinformatics challenges presented by NGS are considerable, numerous software tools and algorithms accept been developed to facilitate information-management, short-read alignment and the identification of sequence variants.
The incredible throughput of NGS calls for the implementation of automatic pipelines, which help speed discovery from the adoption of new sequencing applied science to high-throughput research and publication.

FUNDING

This work was supported by the National Human Genome Research Constitute [grant number HG003079, PI Richard One thousand. Wilson].

Acknowledgements

We thank Michael C. Wendl for critical reading of the manuscript. We also thank Robert Southward. Fulton, Lucinda Fulton, David Dooling, David E. Larson, Ken Chen, Michael D. McLellan, Nathan Dees, and Christopher C. Harris of the Genome Middle at Washington University in St. Louis for their contributions to discussions related to this review.

Biographies

•

Daniel C. Koboldt works in the medical genomics group of the Genome Center at Washington University in St. Louis, and maintains a blog on next-generation sequencing at http://world wide web.massgenomics.org.

•

Li Ding is assistant director and head of the medical genomics group at the Genome Eye and a research assistant professor at Washington University in St. Louis.

•

Elaine R. Mardis is co-manager of the Genome Center and an associate professor of genetics at Washington University in St. Louis.

•

Richard Thou. Wilson is managing director of the Genome Center and a professor of genetics at Washington University in St. Louis.

References

i. Mardis ER. The impact of side by side-generation sequencing applied science on genetics. Trends Genet. 2008;24(iii):133–41. [PubMed] [Google Scholar]

2. Ahn SM, Kim Th, Lee S, et al. The get-go Korean genome sequence and analysis: total genome sequencing for a socio-ethnic group. Genome Res. 2009;nineteen(ix):1622–9. [PMC complimentary article] [PubMed] [Google Scholar]

3. Bentley DR, Balasubramanian S, Swerdlow HP, et al. Accurate whole human genome sequencing using reversible terminator chemical science. Nature. 2008;456(7218):53–9. [PMC complimentary article] [PubMed] [Google Scholar]

4. Drmanac R, Sparks AB, Callow MJ, et al. Man genome sequencing using unchained base reads on cocky-assembling DNA nanoarrays. Science. 327(5961):78–81. [PubMed] [Google Scholar]

5. Kim JI, Ju YS, Park H, et al. A highly annotated whole-genome sequence of a Korean private. Nature. 2009;460(7258):1011–5. [PMC gratis commodity] [PubMed] [Google Scholar]

6. McKernan KJ, Peckham HE, Costa GL, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using 2-base encoding. Genome Res. 2009;nineteen(9):1527–41. [PMC complimentary article] [PubMed] [Google Scholar]

7. Pushkarev D, Neff NF, Convulse SR. Single-molecule sequencing of an individual human genome. Nat Biotechnol. 2009;27(9):847–52. [PMC gratis article] [PubMed] [Google Scholar]

8. Wang J, Wang Westward, Li R, et al. The diploid genome sequence of an Asian individual. Nature. 2008;456(7218):60–v. [PMC complimentary article] [PubMed] [Google Scholar]

ix. Wheeler DA, Srinivasan M, Egholm M, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452(7189):872–vi. [PubMed] [Google Scholar]

10. Schuster SC, Miller Westward, Ratan A, et al. Consummate Khoisan and Bantu genomes from southern Africa. Nature. 2010;463(7283):943–vii. [PMC complimentary article] [PubMed] [Google Scholar]

12. Ley TJ, Mardis ER, Ding 50, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008;456(7218):66–72. [PMC free article] [PubMed] [Google Scholar]

13. Mardis ER, Ding L, Dooling DJ, et al. Recurring mutations found by sequencing an astute myeloid leukemia genome. Northward Engl J Med. 2009;361(11):1058–66. [PMC free article] [PubMed] [Google Scholar]

fourteen. Pleasure ED, Cheetham RK, Stephens PJ, et al. A comprehensive catalogue of somatic mutations from a homo cancer genome. Nature. 2010;463(7278):191–half dozen. [PMC complimentary article] [PubMed] [Google Scholar]

fifteen. Pleasance ED, Stephens PJ, O'Meara South, et al. A small-cell lung cancer genome with circuitous signatures of tobacco exposure. Nature. 2010;463(7278):184–90. [PMC free article] [PubMed] [Google Scholar]

16. Shah SP, Morin RD, Khattra J, et al. Mutational evolution in a lobular breast neoplasm profiled at single nucleotide resolution. Nature. 2009;461(7265):809–13. [PubMed] [Google Scholar]

17. Ding L, Ellis MJ, Li Southward, et al. Genome remodelling in a basal-like chest cancer metastasis and xenograft. Nature. 2010;464(7291):999–1005. [PMC costless article] [PubMed] [Google Scholar]

18. Sherry ST, Ward MH, Kholodov Thousand, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29(ane):308–xi. [PMC free article] [PubMed] [Google Scholar]

19. Snyder One thousand, Du J, Gerstein M. Personal genome sequencing: electric current approaches and challenges. Genes Dev. 24(5):423–31. [PMC free article] [PubMed] [Google Scholar]

xx. Clark MJ, Homer N, O'Connor BD, et al. U87MG decoded: the genomic sequence of a cytogenetically abnormal human cancer cell line. PLoS Genet. 2010;6(1):e1000832. [PMC free commodity] [PubMed] [Google Scholar]

21. Roach JC, Glusman G, Smit AF, et al. Assay of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328(5978):636–ix. [PMC free article] [PubMed] [Google Scholar]

22. Lupski JR, Reid JG, Gonzaga-Jauregui C, et al. Whole-genome sequencing in a patient with charcot-marie-molar neuropathy. Northward Engl J Med. 2010;362(13):1181–91. [PMC free article] [PubMed] [Google Scholar]

23. Turner EH, Ng SB, Nickerson DA, et al. Methods for genomic partitioning. Annu Rev Genomics Hum Genet. 2009;10:263–84. [PubMed] [Google Scholar]

24. Mamanova L, Coffey AJ, Scott CE, et al. Target-enrichment strategies for side by side-generation sequencing. Nat Methods. 2010;7(ii):111–8. [PubMed] [Google Scholar]

25. Gnirke A, Melnikov A, Maguire J, et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol. 2009;27(ii):182–9. [PMC complimentary commodity] [PubMed] [Google Scholar]

26. Albert TJ, Molla MN, Muzny DM, et al. Straight pick of homo genomic loci by microarray hybridization. Nat Methods. 2007;4(11):903–5. [PubMed] [Google Scholar]

27. Bashiardes S, Veile R, Helms C, et al. Direct genomic selection. Nat Methods. 2005;2(1):63–9. [PubMed] [Google Scholar]

28. Hodges E, Xuan Z, Balija V, et al. Genome-wide in situ exon capture for selective resequencing. Nat Genet. 2007;39(12):1522–7. [PubMed] [Google Scholar]

29. Okou DT, Steinberg KM, Center C, et al. Microarray-based genomic selection for high-throughput resequencing. Nat Methods. 2007;4(11):907–9. [PubMed] [Google Scholar]

30. Ng SB, Turner EH, Robertson PD, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461(7261):272–6. [PMC gratis article] [PubMed] [Google Scholar]

31. Wang Z, Gerstein K, Snyder Thou. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. [PMC free commodity] [PubMed] [Google Scholar]

32. Chi SW, Zang JB, Mele A, et al. Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps. Nature. 2009;460(7254):479–86. [PMC gratis commodity] [PubMed] [Google Scholar]

33. Licatalosi DD, Mele A, Fak JJ, et al. HITS-Prune yields genome-wide insights into encephalon culling RNA processing. Nature. 2008;456(7221):464–9. [PMC free article] [PubMed] [Google Scholar]

34. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18(five):821–ix. [PMC costless article] [PubMed] [Google Scholar]

35. Young AL, Abaan HO, Zerbino D, et al. A new strategy for genome associates using short sequence reads and reduced representation libraries. Genome Res. 2010;20(2):249–56. [PMC free article] [PubMed] [Google Scholar]

36. Li R, Zhu H, Ruan J, et al. De novo assembly of homo genomes with massively parallel curt read sequencing. Genome Res. 2010;20(2):265–72. [PMC free article] [PubMed] [Google Scholar]

37. Simpson JT, Wong Grand, Jackman SD, et al. ABySS: a parallel assembler for short read sequence information. Genome Res. 2009;19(half dozen):1117–23. [PMC free article] [PubMed] [Google Scholar]

38. Raphael BJ, Volik S, Yu P, et al. A sequence-based survey of the complex structural organization of tumor genomes. Genome Biol. 2008;nine(three):R59. [PMC free article] [PubMed] [Google Scholar]

39. Chen Thousand, Wallis JW, McLellan MD, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009;vi(9):677–81. [PMC gratis article] [PubMed] [Google Scholar]

40. Margulies M, Egholm G, Altman Nosotros, et al. Genome sequencing in microfabricated loftier-density picolitre reactors. Nature. 2005;437(7057):376–80. [PMC gratuitous article] [PubMed] [Google Scholar]

42. Ning Z, Cox AJ, Mullikin JC. SSAHA: a fast search method for large Deoxyribonucleic acid databases. Genome Res. 2001;11(10):1725–ix. [PMC gratis article] [PubMed] [Google Scholar]

43. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;xviii(11):1851–8. [PMC free article] [PubMed] [Google Scholar]

44. Langmead B, Trapnell C, Popular K, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;ten(3):R25. [PMC gratuitous article] [PubMed] [Google Scholar]

45. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–sixty. [PMC free article] [PubMed] [Google Scholar]

46. Li R, Yu C, Li Y, et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009;25(fifteen):1966–seven. [PubMed] [Google Scholar]

47. Rumble SM, Lacroute P, Dalca AV, et al. SHRiMP: accurate mapping of curt color-space reads. PLoS Comput Biol. 2009;5(5):e1000386. [PMC gratis article] [PubMed] [Google Scholar]

48. Homer N, Merriman B, Nelson SF. BFAST: an alignment tool for large scale genome resequencing. PLoS I. 2009;4(eleven):e7767. [PMC costless article] [PubMed] [Google Scholar]

49. Niu B, Fu L, Sun S, et al. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics. 2010;11(187) [PMC free commodity] [PubMed] [Google Scholar]

50. Kozarewa I, Ning Z, Quail MA, et al. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and associates of (G+C)-biased genomes. Nat Methods. 2009;six(four):291–5. [PMC costless article] [PubMed] [Google Scholar]

51. Yeager Yard, Xiao Northward, Hayes RB, et al. Comprehensive resequence assay of a 136 kb region of human being chromosome 8q24 associated with prostate and colon cancers. Hum Genet. 2008;124(2):161–70. [PMC complimentary article] [PubMed] [Google Scholar]

52. Harismendy O, Frazer K. Method for improving sequence coverage uniformity of targeted genomic intervals amplified by LR-PCR using Illumina GA sequencing-past-synthesis applied science. Biotechniques. 2009;46(three):229–31. [PubMed] [Google Scholar]

53. Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. [PMC free article] [PubMed] [Google Scholar]

54. Langmead B, Schatz MC, Lin J, et al. Searching for SNPs with cloud computing. Genome Biol. 2009;x(11):R134. [PMC free article] [PubMed] [Google Scholar]

55. Li R, Li Y, Fang X, et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 2009;19(six):1124–32. [PMC free article] [PubMed] [Google Scholar]

56. Shen Y, Wan Z, Coarfa C, et al. A SNP discovery method to assess variant allele probability from adjacent-generation resequencing data. Genome Res. xx(two):273–lxxx. [PMC free article] [PubMed] [Google Scholar]

57. Koboldt DC, Chen Thousand, Wylie T, et al. VarScan: variant detection in massively parallel sequencing of private and pooled samples. Bioinformatics. 2009;25(17):2283–five. [PMC free article] [PubMed] [Google Scholar]

58. Ye K, Schulz MH, Long Q, et al. Pindel: a pattern growth arroyo to find break points of big deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25(21):2865–71. [PMC costless article] [PubMed] [Google Scholar]

59. Korbel JO, Urban AE, Affourtit JP, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318(5849):420–half-dozen. [PMC costless commodity] [PubMed] [Google Scholar]

60. Lam HY, Mu XJ, Stutz AM, et al. Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat Biotechnol. 2010;28(1):47–55. [PMC gratis article] [PubMed] [Google Scholar]

61. Wendl MC, Wilson RK. Statistical aspects of discerning indel-type structural variation via DNA sequence alignment. BMC Genomics. 2009;10:359. [PMC free article] [PubMed] [Google Scholar]

62. Bashir A, Volik South, Collins C, et al. Evaluation of paired-finish sequencing strategies for detection of genome rearrangements in cancer. PLoS Comput Biol. 2008;4(4):e1000051. [PMC free article] [PubMed] [Google Scholar]

63. Volik S, Zhao S, Chin 1000, et al. Cease-sequence profiling: sequence-based analysis of abnormal genomes. Proc Natl Acad Sci USA. 2003;100(xiii):7696–701. [PMC costless article] [PubMed] [Google Scholar]

64. Raphael BJ, Volik South, Collins C, et al. Reconstructing tumor genome architectures. Bioinformatics. 2003;nineteen(Suppl two):ii162–71. [PubMed] [Google Scholar]

65. Campbell PJ, Stephens PJ, Pleasure ED, et al. Identification of somatically caused rearrangements in cancer using genome-wide massively parallel paired-stop sequencing. Nat Genet. 2008;twoscore(6):722–9. [PMC complimentary article] [PubMed] [Google Scholar]

66. Birol I, Jackman SD, Nielsen CB, et al. De novo transcriptome assembly with ABySS. Bioinformatics. 2009;25(21):2872–7. [PubMed] [Google Scholar]

67. Hormozdiari F, Alkan C, Eichler EE, et al. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 2009;xix(7):1270–8. [PMC costless commodity] [PubMed] [Google Scholar]

68. Yoon South, Xuan Z, Makarov V, et al. Sensitive and authentic detection of copy number variants using read depth of coverage. Genome Res. 2009;19(9):1586–92. [PMC gratuitous article] [PubMed] [Google Scholar]

69. Zhang Q, Ding L, Larson DE, et al. CMDS: a population-based method for identifying recurrent Dna copy number aberrations in cancer from loftier-resolution data. Bioinformatics. 26(four):464–nine. [PMC free article] [PubMed] [Google Scholar]

abramsinced1942.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2980933/