CNV-LDC: An Optimized CNV Detection Method for Low Depth of

Coverage Data

Ayyoub Salmi

1,3

, Sara El Jadid

, Ismail Jamail

, Taouﬁk Bensellak

, Romain Philippe

Veronique Blanquet

and Ahmed Moussa

Animal Molecular Genetics Unit, Limoges University, Limoges, France

Laboratory of Telecommunication Systems and Engineering of the Decision, Ibn Tofail University, Kenitra, Morocco

Technology Laboratory of Information and Communication, Abdelmalek Essaadi University, Tangier, Morocco

Keywords:

Copy Number Variation, NGS Data, Read Depth, Low Depth of Coverage.

Abstract:

Recent improvements in technologies showed much greater variance of our genome than we thought. A part

of this variance is due to submicroscopic chromosomal deletions/duplications called Copy Number Variations

(CNVs). For some of these CNVs, it was clearly demonstrated that they play an important role in disease sus-

ceptibility, including complex diseases and Mendelian diseases. Last advances in next-generation sequencing

have made fast progress in analyzing data for CNVs, in so far as they promise to improve the sensitivity in de-

tection. This has led to the development of several new bioinformatics approaches and algorithms for detecting

CNVs from this data for the four common methods: Assembly Based, Split Read, Read-Paired mapping, and

Read Depth. Here we focus on the RD method that is able to detect the exact number of CNVs in comparison

with the other methods. We propose an alternative method for detecting CNVs from short sequencing reads,

CNV-LDC (Copy Number Variation-Low Depth of Coverage), that complements the existing method named

CNV-TV (Copy Number Variation-Total Variation). We optimize the signal modeling and threshold step to

lift the performance in low depth of coverage. Results of this new approach have been compared to various

recent methods on different simulated data using small and large CNVs.

1 BACKGROUND

With the fulﬁllment of the human genome project,

here we come walk-in ”post-genomic” era. An im-

portant discovery of recent years is that of CNVs

(Copy Number Variants), which showed that the hu-

man genome has an inter-individual variance much

higher than what previously was thought (Beckmann

et al., 2007). The term ”variation” or ”variant” in-

duced somewhat in error, in so far as it suggests that

the CNVs are only benign of the standard variants.

The signiﬁcance of CNV in the pathogenesis of some

rare genetic syndromes, and also frequent multifacto-

rial diseases, is actually extremely complex.

The human genome consists of more than 3 bil-

lion base pairs and it was long thought that the DNA

chains of two randomly selected individuals were

99.9% identical. It was considered that the SNPs (Sin-

gle Nucleotide Polymorphisms) were the main source

of inter-individual variability. However, the scientiﬁc

achievements of recent years led to a complete revi-

sion of this design and uncovered a new dimension

of inter-individual genetic variability. It is submicro-

scopic chromosomal structural changes (Vissers et al.,

2003), which were called CNV. CNVs are distributed

with high probability in a not entirely random order

on all chromosomepairs and vary from one individual

to another in terms ofnumber and distribution pattern.

This is most often due to duplications or deletions at

certain chromosome segments. They include by deﬁ-

nition more than 1000 base pairs (1 kb), but can also

extend over several million base pairs (Mb). Recently

due to widespread of genome sequencing, the opera-

tional spectrum of CNVs has been widened to include

even events as small as 50bp (Alkan et al., 2011).

Despite the fact that CNVs are often located in re-

gions with a reduced number of genes, they may also

contain hundreds of genes and regulatory elements.

Genes that are within the CNV mostly appear not to

play an important role in the embryonic development,

but rather intervene in interactions with the environ-

ment, such as in odor perception or defenses against

Salmi A., El Jadid S., Jamail I., Bensellak T., Philippe R., Blanquet V. and Moussa A.

CNV-LDC: An Optimized CNV Detection Method for Low Depth of Coverage Data.

DOI: 10.5220/0006111600370042

In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2017), pages 37-42

ISBN: 978-989-758-214-1

infections.

With the emergence of new technologies such as

Next Generation Sequencing (NGS), new ﬁelds of ap-

plication have emerged. This technology enables high

resolution detection of CNVs. It generates a large

number of short read sequences (from 50 to 250bp)

using reversible terminatorchemistry (Bentley, 2008).

New computational methods were developed to iden-

tify CNVs from NGS data (Zhao et al., 2013) and

using next-generation sequencing platforms (Mardis,

2013). Four known approaches are widely used: AS

(Assembly Based), SR (Split Read), RP (Read-Paired

mapping), and RD (Read Depth) (Zhao et al., 2013)

(Liu et al., 2013) (Medvedev et al., 2009) (Yoon et al.,

2009) (Xi et al., 2012) (Duan et al., 2013). In spite

of their strengths and weaknesses (Tan et al., 2014)

(Alkan et al., 2009), these approaches are usually

complementary to each other but none of them can

detect the full proportion of DNA variation.

Here we focus on RD methods that rely on the

principle of randomly sampling the short reads on

the genome, and once they are aligned to the refer-

ence genome, their density is locally proportional to

the copy number (Yoon et al., 2009). These methods

are based on a statistical hypothesis testing informing

about the relative existence of CNV through the cor-

relation between the copy number of a genomic re-

gion and the depth coverage of this region (Teo et al.,

2012). The RD approach can be classiﬁed in three

categories depending on the sample type: the sin-

gle sample, the paired sample (case/control), and the

large population sample. In the ﬁrst category, we will

get a report of absolute copy number since there is no

other subject to compare with. While in the second

category we will get a report of relative copies com-

pared to controls as there are controls. For the third

category, the detection of CNVs is done by using the

overall mean of the RD.

Compared to the other approach for CNV detec-

tion detecting CNVs from NGS data, RD is able to

detect the exact number of CNVs, while SR, AS and

RP can just provide a report of only the position and

not the counts of the CNVs.

Speciﬁcally, the procedure of RD based meth-

ods includes the following steps. In a ﬁrst step, the

aligned reads to the reference genome are piled up and

then counted using a sliding (Xie and Tammi, 2009)

or a predeﬁned window. In a second step, the counts

will be normalized to eliminate biases resulting from

repeat regions and GC content (Boeva et al., 2010)

(Janevski et al., 2012), then a contiguous set of win-

dows that have the same number of CNVs is iden-

tiﬁed using a segmentation algorithm. The ﬁnal step

consists in predicting the statistical signiﬁcance of the

calls and ﬁltering (Zhao et al., 2013).

Recently, many CNV detection methods have

been developed (Yoon et al., 2009) (Chiang et al.,

2008) (Gusnanto et al., 2011), but their performances

are not robust. Now, we stand in need for strong meth-

ods for detecting CNVs from NGS data.

We sought to optimize and implement an alter-

native method for detecting CNVs from short se-

quencing reads that complements the existing method

named CNV-TV (Duan et al., 2013). Here we made

the signal modeling using Fused lasso instead of lasso

because of the spatial structure of data (Tibshirani

et al., 1997). Our approach is optimized for low depth

coverage (Zhang et al., 2012) and uses an automatic

threshold selection.

2 METHODS

The ﬁrst step for the CNV detection process starts

with ﬁltering unmapped reads and PCR duplicates

that are marked with the 1024 ﬂag using either Sam-

tools (Li et al., 2009) or Picard, then extracting read

depth signal from a BAM ﬁle. We use a 100bp non-

overlappingsliding windowto compute the mean read

depth across the genome. This gives us a better res-

olution and the ability to detect smaller CNV. The

read depth signal is then corrected for GC bias as

the GC-rich and AT-rich fragments may be under-

represented in the sequencing results (Benjamini and

Speed, 2012). The bin size is set to match the slid-

ing window size. The adjusted read count is com-

puted using the formula (1) where u

is the number of

reads mapped to the i

bin, d is the median read count

across all bin and d

the median read count of those

bins which have the same GC-content as the i

bin.

Ad justedReadCount = u





(1)

A total variation penalized least square model is

used to extract to true signal from the noise as shown

in equation (2) where y

is the read depth signal, x

the recovered smooth signal, φ(x) is the penalty and λ

the penalty parameter.

min

(

∑

i=1

− x

)

+ λ

n−1

∑

i=1

φ(x

i+1

− x

)

(2)

In (Duan et al., 2013) the author suggests the use

of lasso to solve an alternative form of equation (2),

but we decided to use the Fused lasso considering

its better performance as demonstrated by (Tibshirani

et al., 2005). The penalty parameter Lambda was set

BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms

The black dots are read depth. The black line is the smoothed signal. The

red line is the corrected smoothed signal. The green lines are the cutoffs.

Figure 1: Processing result.

using an automated approach by using the Schwarz

information criterion (SIC), once this parameter is

known, the smooth signal is then extracted (ﬁgure 1).

A CNV is identiﬁed as a segment of abnormal am-

plitude, i.e. below or above an estimated cutoff. In or-

der to choose a suitable threshold, we model the read

depth as following a lognormal distribution. This al-

lows us to partially address the problem of detecting

more losses in copy number as described in (Fadista

et al., 2010) and (Turner et al., 2007). This bias could

be due to both biological and technical reasons. In

fact, when using low coverage data, we observed that

even more losses that are part of the noise are being

introduced which leads to more false positive CNVs

as shown in ﬁgure 2 that represents the read depth dis-

tribution in chromosome 20 of the sample HG00097

that was taken from the 1000 genomes project. The

threshold value to call a CNV is calculated such that

the left and right tail of the theoretical distribution

cover 5 percent of this latter.

We observed that most of the false positive CNV

calls occur near the estimated cutoffs. Those calls are

0.00

0.03

0.06

0.09

0 10 20 30

Depth

frequency

Distribution

Logormal

Normal

Figure 2: Read depth distribution for sample HG00097 at

chromosome 20.

0 10 20 30

calls

intensity

Figure 3: Loss of copy number calls in a genomic segment.

mixed with the true heterozygous calls as they also

happen near the cutoffs, which makes them harder to

ﬁlter.

The most of the false positive calls are introduced

as small CNVs. To further reduce their number we de-

cided to discard a fraction of the calls near the thresh-

olds. We ﬁrst separated each of the losses and gains

in copy number into two groups: a ﬁrst group that

is closer to the cutoff which supposedly contain the

most of the false positive calls and a second group

further from the cutoff (ﬁgure 3). After that we mod-

eled the ﬁrst group as following a normal distribution.

The cutoff to remove the potential false positive calls

was calculated as the 95

quantile of the ﬁtted normal

distribution.

3 RESULTS AND DISCUSSION

We made our modiﬁed version of CNV-TV available

as an R package and compared it to three other CNV

detection methods. Those methods were picked based

on the citation in literature and the free availability:

Pindel (Ye et al., 2009), CNVnator (Abyzov et al.,

2011) and DELLY2 (Rausch et al., 2012).

Those methods use different approaches to iden-

tify CNVs. Pindel uses split read, CNVnator uses

read depth and DELLY2 uses paired-end and split

read. The bin size for CNVnator was set to match

the bin size used in our method.

3.1 Data Simulation

To test the performance of our method, we used the

simulation tool ART (Huang et al., 2011) to gener-

ate synthetic next-generation sequencing reads in or-

der to get the exact measures. Escherichia coli strain

k-12 genome that has a length of 5.16Mbp and a

GC content of 50.6% was used as reference to pro-

CNV-LDC: An Optimized CNV Detection Method for Low Depth of Coverage Data

duce genomes with simulated CNVs. The ﬁrst sim-

ulation consisted of 30 deletions and 9 duplications

with a length of 350bp to test the ability to detect

small CNVs. This procedure produced a reference

genome of 4.9Mbp. Short read sequences were gen-

erated from this reference genome with a length of

120bp, the simulated fragment size was 400bp. The

mean sequencing depth used was 12x, which is con-

sidered as a low depth coverage. The short reads were

then aligned to the reference genome using Bowtie2

(Langmead and Salzberg, 2012). Finally we used the

CNV calling methods to get a list of CNV calls which

was compared to the ground truth. True positive calls

were deﬁned as an overlap between a CNV call and

the ground truth.

The second simulation consisted of 30 duplica-

tions and 30 deletions with a length of 1000bp. We

proceeded with the same protocol for the ﬁrst simula-

tion to produce the aligned short reads.

3.2 Results for Simulated Data

Table 1: Number of detected CNVs for the ﬁrst simulation.

Type Ground truth CNV-LDC Pindel CNVnator DELLY2

Deletion 30 30 113 12 33

Duplication 9 10 9 0 0

Table 1 shows the number of detected CNVs for

each method. We can see that the number of CNVs

detected by our method is very close to the ground

truth. to further investigate the results, we calculated

the F-score: a measurement of the accuracy of a given

test. It ranges between 1 and 0. A high score indi-

cates a good performance while a low score indicates

a worst performance.

Table 2: F-scores for the ﬁrst simulation.

Type CNV-LDC Pindel CNVnator DELLY2

Deletion 1 0.42 0.57 0.95

Duplication 0.94 1 0 0

The values of the F-score are given in table 2. we

can clearly see that our method yields better overall

performance even if it is outperformed by Pindel in

duplications detection.

Table 3 and 4 show the number of called CNVs

ans corresponding F-scores respectively for the sec-

ond simulation. Our method was able to detect all the

simulated CNVs with no false positive calls.

Table 3: Number of detected CNVs for the second simula-

tion.

Type Ground truth CNV-LDC Pindel CNVnator DELLY2

Deletion 30 30 79 29 32

Duplication 30 30 30 16 30

Table 4: F-scores for the second simulation.

Type CNV-LDC Pindel CNVnator DELLY2

Deletion 1 0.55 0.98 0.96

Duplication 1 1 0.69 1

We also generated ﬁfty simulated genomes with

15 duplications and 15 deletions with a length of

350bp distributed uniformly using independent runs.

Table 5 shows the mean and standard deviation of the

number of called CNVs. A boxplot of the results is

also given in ﬁgure 4, the green dashed line represent

the number of simulated CNVs for deletions and du-

plications.

Figure 4: Boxplot of called CNVs.

Table 5: Mean and standard deviation for ﬁfty runs.

Type Value CNV-LDC Pindel CNVnator DELLY2

Deletion

Mean 22.5 143.42 5.24 41.76

Deviation 8.47 10.53 0.04 16.04

Duplication

Mean 16.88 13.68 0.04 06.04

Deviation 9.41 0.58 0.2 1.21

Considering the results from the simulations, we

can see that our method performs very well at calling

both small and large CNVs.

4 CONCLUSIONS

In this paper we presented an alternative version of

the CNV-TV method that relies on total variation pe-

nalized least squares model to ﬁt the read depth signal

from a low depth of coverage sequencing data. Here

we assume that any change in the depth of coverage

in a genomic region is correlated with a change in the

copy number.

The cutoffs for CNV calling are set automatically

using a lognormal distribution to ﬁt the read depth fre-

quency. The goal was to minimize the detected loss of

copy number caused by biological and technical bias.

To further reduce this bias, another ﬁlter was added

BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms

for small CNVs using a ﬁtted normal distribution for

the calls close to the cutoffs.

The results of our method were compared to those

of three other CNV detection methods using simu-

lated data to assess its performances. The simulations

consisted of small and large CNVs. In both cases, our

method yielded better overall results. The only draw-

back was the longer execution time in comparison to

the other methods.

REFERENCES

Abyzov, A., Urban, A. E., Snyder, M., and Gerstein, M.

(2011). CNVnator: An approach to discover, geno-

type, and characterize typical and atypical CNVs from

family and population genome sequencing. Genome

Research, 21(6):974–984.

Alkan, C., Coe, B. P., and Eichler, E. E. (2011). Genome

structural variation discovery and genotyping. Nat Rev

Genet, 12(5):363–376.

Alkan, C., Kidd, J. M., Marques-Bonet, T., Aksay, G., An-

tonacci, F., Hormozdiari, F., Kitzman, J. O., Baker, C.,

Malig, M., Mutlu, O., Sahinalp, S. C., Gibbs, R. A.,

and Eichler, E. E. (2009). Personalized copy number

and segmental duplication maps using next-generation

sequencing. Nature Genetics, 41(10):1061–1067.

Beckmann, J. S., Estivill, X., and Antonarakis, S. E. (2007).

Copy number variants and genetic traits: closer to the

resolution of phenotypic to genotypic variability. Nat

Rev Genet, 8(8):639–646.

Benjamini, Y. and Speed, T. P. (2012). Summarizing and

correcting the GC content bias in high-throughput se-

quencing. Nucleic Acids Research, 40(10):e72–e72.

Bentley, D. R. (2008). Accurate whole human genome se-

quencing using reversible terminator chemistry. Na-

ture, 456(7218):53–59.

Boeva, V., Zinovyev, A., Bleakley, K., Vert, J.-P., Janoueix-

Lerosey, I., Delattre, O., and Barillot, E. (2010).

Control-free calling of copy number alterations in

deep-sequencing data using GC-content normaliza-

tion. Bioinformatics, 27(2):268–269.

Chiang, D. Y., Getz, G., Jaffe, D. B., Zhao, X., Carter,

S. L., Russ, C., Nusbaum, C., Meyerson, M., and Lan-

der, E. S. (2008). High-resolution mapping of copy-

number alterations with massively parallel sequenc-

ing. Nature Methods, 6(1):99–103.

Duan, J., Zhang, J.-G., Deng, H.-W., and Wang, Y.-P.

(2013). CNV-TV: A robust method to discover copy

number variation from short sequencing reads. BMC

Bioinformatics, 14(1):150.

Fadista, J., Thomsen, B., Holm, L.-E., and Bendixen, C.

(2010). Copy number variation in the bovine genome.

BMC Genomics, 11(1):284.

Gusnanto, A., Wood, H. M., Pawitan, Y., Rabbitts, P.,

and Berri, S. (2011). Correcting for cancer genome

size and tumour cell content enables better estimation

of copy number alterations from next-generation se-

quence data. Bioinformatics, 28(1):40–47.

Huang, W., Li, L., Myers, J. R., and Marth, G. T. (2011).

ART: a next-generation sequencing read simulator.

Bioinformatics, 28(4):593–594.

Janevski, A., Varadan, V., Kamalakaran, S., Banerjee, N.,

and Dimitrova, N. (2012). Effective normalization for

copy number variation detection from whole genome

sequencing. BMC Genomics, 13(Suppl 6):S16.

Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read

alignment with bowtie 2. Nature Methods, 9(4):357–

359.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan,

J., Homer, N., Marth, G., Abecasis, G., and and,

R. D. (2009). The sequence alignment/map format

and SAMtools. Bioinformatics, 25(16):2078–2079.

Liu, B., Morrison, C. D., Johnson, C. S., Trump, D. L.,

Qin, M., Conroy, J. C., Wang, J., and Liu, S. (2013).

Computational methods for detecting copy number

variations in cancer genome using next generation

sequencing: principles and challenges. Oncotarget,

4(11):1868–1881.

Mardis, E. R. (2013). Next-generation sequencing plat-

forms. Annual Rev. Anal. Chem., 6(1):287–303.

Medvedev, P., Stanciu, M., and Brudno, M. (2009). Com-

putational methods for discovering structural variation

with next-generation sequencing. Nature Methods,

6(11s):S13–S20.

Rausch, T., Zichner, T., Schlattl, A., Stutz, A. M., Benes,

V., and Korbel, J. O. (2012). DELLY: structural vari-

ant discovery by integrated paired-end and split-read

analysis. Bioinformatics, 28(18):i333–i339.

Tan, R., Wang, Y., Kleinstein, S. E., Liu, Y., Zhu, X., Guo,

H., Jiang, Q., Allen, A. S., and Zhu, M. (2014). An

evaluation of copy number variation detection tools

from whole-exome sequencing data. Human Muta-

tion, 35(7):899–907.

Teo, S. M., Pawitan, Y., Ku, C. S., Chia, K. S., and Salim,

A. (2012). Statistical challenges associated with de-

tecting copy number variations with next-generation

sequencing. Bioinformatics, 28(21):2711–2718.

Tibshirani, R. et al. (1997). The lasso method for vari-

able selection in the cox model. Statistics in medicine,

16(4):385–395.

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and

Knight, K. (2005). Sparsity and smoothness via the

fused lasso. Journal of the Royal Statistical Society:

Series B (Statistical Methodology), 67(1):91–108.

Turner, D. J., Miretti, M., Rajan, D., Fiegler, H., Carter,

N. P., Blayney, M. L., Beck, S., and Hurles, M. E.

(2007). Germline rates of de novo meiotic deletions

and duplications causing several genomic disorders.

Nature Genetics, 40(1):90–95.

Vissers, L. E., de Vries, B. B., Osoegawa, K., Janssen,

I. M., Feuth, T., Choy, C. O., Straatman, H., van der

Vliet, W., Huys, E. H., van Rijk, A., Smeets, D.,

van Ravenswaaij-Arts, C. M., Knoers, N. V., van der

Burgt, I., de Jong, P. J., Brunner, H. G., van Kessel,

A. G., Schoenmakers, E. F., and Veltman, J. A. (2003).

Array-based comparative genomic hybridization for

the genomewide detection of submicroscopic chromo-

CNV-LDC: An Optimized CNV Detection Method for Low Depth of Coverage Data

somal abnormalities. The American Journal of Human

Genetics, 73(6):1261–1270.

Xi, R., Lee, S., and Park, P. J. (2012). A survey of

copy-number variation detection tools based on high-

throughput sequencing data. Current Protocols in Hu-

man Genetics, pages 7–19.

Xie, C. and Tammi, M. T. (2009). CNV-seq, a

new method to detect copy number variation using

high-throughput sequencing. BMC Bioinformatics,

10(1):80.

Ye, K., Schulz, M. H., Long, Q., Apweiler, R., and Ning,

Z. (2009). Pindel: a pattern growth approach to detect

break points of large deletions and medium sized in-

sertions from paired-end short reads. Bioinformatics,

25(21):2865–2871.

Yoon, S., Xuan, Z., Makarov, V., Ye, K., and Sebat, J.

(2009). Sensitive and accurate detection of copy num-

ber variants using read depth of coverage. Genome

Research, 19(9):1586–1592.

Zhang, J., Wang, J., and Wu, Y. (2012). An improved ap-

proach for accurate and efﬁcient calling of structural

variations with low-coverage sequence data. BMC

bioinformatics, 13(6):1.

Zhao, M., Wang, Q., Wang, Q., Jia, P., and Zhao, Z.

(2013). Computational tools for copy number varia-

tion (CNV) detection using next-generation sequenc-

ing data: features and perspectives. BMC Bioinfor-

matics, 14(Suppl 11):S1.

BIOINFORMATICS 2017 - 8th International Conference on Bioinformatics Models, Methods and Algorithms