+ All documents
Home > Documents > Minimal-dot plot: “Old tale in new skin” about sequence comparison

Minimal-dot plot: “Old tale in new skin” about sequence comparison

Date post: 10-Nov-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
10
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright
Transcript

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

Author's personal copy

Minimal-dot plot: ‘‘Old tale in new skin’’ about sequence comparison

V. Kirzhner ⇑, S. Frenkel, A. KorolInstitute of Evolution and Department of Evolutionary and Environmental Biology, University of Haifa, Haifa 31905, Israel

a r t i c l e i n f o

Article history:Received 22 April 2010Received in revised form 13 October 2010Accepted 16 December 2010Available online 23 December 2010

Keywords:Human genomeSequence comparisonDot-plotCompositional spectraTrack asymmetry

a b s t r a c t

The authors propose a simple version of the dot-plot scheme to be used in the case whenthe distances between sequence elements may take more than two values. The method isapplicable, in particular, to the case of the sequences of large-length windows when thesets of distance values are continuous. The proposed technique is simple to implementand the results can produce readable maps for further analysis. To illustrate its potential-ities, the method has been applied to the comparison of genomic sequences. The asymme-try in the number of direct and reverse tracks for the Homo sapience genome has beendiscovered.

� 2010 Elsevier Inc. All rights reserved.

1. Introduction

The first method developed to assess the similarity between biological symbol sequences was the dot-plot analysis [7,9].Such symbol sequences are used for describing DNA or protein molecules, where four or twenty symbols are employed,respectively. In certain contexts below, relatively short symbol sequences are referred to as words.

The dot-plot procedure (in its original form) can be described as follows. Let us consider two symbol sequences,ATTCGACG and TCGACAAGC, as DNA texts written in the {A,T,C,G} alphabet. To compare these sequences, the following tech-nique is used. The two fragments are located along the sides of an 8 � 9 rectangle (Fig. 1). Designate each pair of identicalsymbols along the two sequences by a dot and thus obtain a two-dimensional dot-plot diagram. The main output of the dot-plot method is such dot sequences that the indexes of subsequent dots along the whole sequence either both increase by aunit or one index increases by a unit, while the other one decreases by a unit. In geometrical representation, such dot se-quences are parallel to the bisector of some angle of the diagram. Each of the dot sequences, which will be referred to astracks, signifies a symbol-to-symbol coincidence of the fragments of the two sequences under consideration. For example,the track of length 5 marked in Fig. 1 corresponds to the word TCGAC, which is present in both sequences. The other markedtrack of length 3 corresponds to the word AGC, which also appears in both sequences, but in opposite orientations. Thus, thetracks located along the main bisector (which we choose to be the bisector of the top left angle) or along the secondary bisec-tor signify the coincidence or the reverse coincidence, respectively, of the words of the two symbol sequences being com-pared. Correspondingly, we will distinguish between direct and reverse tracks.

The tracks can also be used as the basis for a more general comparison of two sequences, performed, e.g., in the frame-work of the dot-plot diagram filtration approach, proposed by Maizel and Lenk [23]. According to this approach, the twotracks that continue each other, but are separated by a short gap, can be joined together. Indeed, the omission of some dotscaused by a mismatch of symbols can be attributed, in this case, to substitution mutations. It this way, the main elementary

0020-0255/$ - see front matter � 2010 Elsevier Inc. All rights reserved.doi:10.1016/j.ins.2010.12.009

⇑ Corresponding author.E-mail address: [email protected] (V. Kirzhner).

Information Sciences 181 (2011) 1454–1462

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

Author's personal copy

events of molecular evolution – substitutions, insertions, and deletions of symbols as well as transpositions of large frag-ments – can be detected by studying track peculiarities (see, e.g., [10,20]).

The dot-plot technique described above can be called a ‘‘classical’’ one. In the course of bioinformatics development, thereappeared an approach to sequence comparison based on the preliminary division of original sequences into fragments (win-dows), which are usually of equal length. As a result, an original sequence of symbols is transformed into a sequence of win-dows, which, in turn, can be considered as certain formal symbols. The novelty of the approach consists in the comparison ofthe latter sequences through defining pair-wise distances between the formal symbols (windows).

For further consideration, it is important to emphasize that the distance functions employed in the above approach are nolonger binary and may have any, even a continuous, set of values. The dot-plot method can be applied in this case, too, butthe main pattern will differ from the ‘‘classical’’ one. Namely, for a distance scale with a relatively large set of possible values,a colored axis of dots is used in accordance with the values of the distance function at these dots. For example, in the case ofan infinite set of distance values, an arbitrary discrete distance scale can be introduced. The resulting main pattern is a cer-tain dot-plot area which corresponds to fairly close, with respect to the distance values, sets of sequence fragments. As anexample of such colored dot-plots, consider Fig. 1A from [25], where two genome sequences are compared. The character-istic rectangular zones in the dot-plot area are accounted for by the pairs of fragments with almost the same distances be-tween them. Actually, in this figure, only one track can be distinguished, namely, the line along the diagonal with the originin the top-left corner. However, this is a trivial track since the compared sequence are almost identical.

Obviously, such formulation of the problem results in the requirement of local scale regulation (local zoom) for visuali-zation of non-trivial tracks, similar to that used in the dot-plot method implementations described in [11,30]. At present, thedot-plot method is a part of various packages used in the field of bioinformatics [1,6,12,28,30,34], where, as a rule, it is pos-sible to make comparisons for any distance scale.

1.1. The locally-minimal track approach

Obviously, obtaining a track in the case of a multivalent (continuous) distance function is not a visualization problem. It israther a conceptual problem, which is connected, to a great extent, with the formal definition of the track notion. Let us con-sider the following simple example (Fig. 2).

Fig. 1. Example of a dot-plot diagram. The compared sequences are located along the upper and the left sides of the rectangle. The marked trackscorrespond to identical fragments of these sequences.

Fig. 2. The scale effect illustrated on a fragment of a dot-plot matrix.

V. Kirzhner et al. / Information Sciences 181 (2011) 1454–1462 1455

Author's personal copy

In Fig. 2A, which shows a fragment of a dot-plot matrix, the (2, 2, 3) sequence of length 3 is marked. This sequence is,actually, a track. Indeed, the first two distances equal 2, which is the minimal value for the whole matrix shown inFig. 2A. The next distance equals 3 and, though this value can be found in some other parts of the matrix, it is minimal inthe neighborhood. The latter distance suggests that the corresponding points of the sequence are closer to each other thanother pairs of the points in this neighborhood. The greater absolute value of the distance here is due to the increase of alldistances in the neighborhood, caused by some factor unknown to us. In this example, only two versions of coloring (Band C) are possible. In case B, the distances equal to 2 and 3 fall within the same color range and, therefore, the track markedin Fig. 2A appears to be almost completely hidden by the set of equivalent (in this particular range) distances (see Fig. 2B). Incase C, since the distances equal to 2 and 3 fall within different ranges, the track consists of only two dots. Thus, in this exam-ple, the complete track cannot be detected at any possible scale.

In this work, we propose a simple and reasonable method of track evaluation in the case when the distance function hasmore than two values, in particular, an infinite set of values. The method is based on the following natural generalization ofthe classical track approach.

Consider two symbol sequences, S1 and S2, over a certain alphabet A and a functional d, mapping each pair of the alphabetsymbols, a, b (a,b 2 A), into some value d(a,b), which belongs to the positive part of the real axis R. As the symbols a and b runindependently along the sequences S1 and S2, respectively, a distance matrix D = kdijk is created. Each element dij of matrix Dis equal to the value of functional d for the pair of symbols in position i in sequence S1 and in position j in sequenceS2. Theelement dij which is the minimal one in a certain neighborhood of matrix D elements is referred to as a locally-minimal ele-ment of the matrix. Let T denote the whole set of locally-minimal elements of matrix D. Now, let all the elements of set Tequal 1, while all the other elements of the matrix equal 0. In this way, a new matrix D0 is obtained. Since its elements takeonly two different values, this matrix can be considered in the framework of the classical dot-plot approach.

Locally-minimal dot-plot definition. The dot-plot diagram of matrix D is defined as the dot-plot diagram of matrix D0.Such diagram will be referred to as a minimal dot-plot (md-plot) diagram. By choosing different neighborhoods for matrix Delements, one can obtain various implementations of the above definition. Choosing one or another neighborhood is an op-tion, which gives the researcher certain flexibility in building the correspondence map. A few examples of the neighborhoodsare given below.

Procedure 1. Define the elements of set T as the local minima with respect to both coordinates. Namely, the point (i, j)belongs to set T if the distances d(i, j + 1) and d(i, j � 1) of fragment i from fragments j + 1 and j � 1, neighboring j, are biggerthan the value of d(i, j). The same requirement must be fulfilled for the distances of fragment j from fragments i + 1, i � 1.Thus, the ‘‘minimal’’ condition for the point (i, j) has the form:

dði; jÞ < dði; jþ 1Þ; dði; jÞ < dði; j� 1Þ; dði; jÞ < dðiþ 1; jÞ; dði; jÞ < dði� 1; jÞ: ð1Þ

Procedure 2. Consider the local minimum conditions for the distance d(i, j) with respect to the distances of fragment i fromtwo fragments j + 1, j � 1, neighboring j:

dði; jÞ < dði; jþ 1Þ; dði; jÞ < dði; j� 1Þ ð2aÞ

or the same conditions with respect to the distances of fragment j from two fragments i + 1, i � 1:

dði; jÞ < dðiþ 1; jÞ; dði; jÞ < dði� 1; jÞ: ð2bÞ

Each of the above local minimum conditions can be used for the evaluation of set T. It should be noted that conditions (2a) or(2b) are stronger than the requirement that the second derivative with respect to the corresponding direction be positive.Procedure 2 can be implemented under still more tough minimum conditions:

dði; jÞ < dði; jþ 1Þ; dði; jÞ < dði; jþ 2Þ; dði; jÞ < dði; j� 1Þ; dði; jÞ < dði; j� 2Þ; ð3aÞ

dði; jÞ < dðiþ 1; jÞ; dði; jÞ < dðiþ 2; jÞ; dði; jÞ < dði� 1; jÞ; dði; jÞ < dði� 2; jÞ: ð3bÞ

Obviously, Procedure 2 generates different md-plots in implementations (a) and (b) if the sequences S1 and S2 are different.It should be also noted that a threshold parameter can be introduced in the definition of set T by leaving only the local

minima with the absolute values below a certain threshold.It appears reasonable that the procedure of generating local minimal dots should depend on the model of the pattern that

has to be found. In this work, the pattern is a track, i.e., a sequence of at least three dots such that: (1) for each dot, the ‘‘sur-rounding’’ distance values – above and below it, on its right and left – are larger than the value at this dot (see, for example,Fig. 2); (2) the dots are arranged along a bisector without gaps. Procedure 1 described above does generate set T of dots,which satisfy condition (1). Obviously, not all the dots of set T satisfy condition (2), but set T contains all the tracks soughtfor. Of course, Procedure 1 can be modified (conditions made weaker), e.g., through substituting non-strict inequalities forthe strict ones. Another, more essential modification consists in setting the condition that only any three inequalities out ofthe four, constituting Procedure 1, have to be true. Such modifications can be introduced in the framework of the same pat-tern model and reflect the level of noise or certain peculiarities of the distance function.

1456 V. Kirzhner et al. / Information Sciences 181 (2011) 1454–1462

Author's personal copy

In contrast to the above modifications, the addition of inequalities for the neighboring dots located in the diagonal ver-tices does not make sense for the pattern model adopted here since it does not take into account the relationship betweenthe dots of the track itself. In the case of a continuous distance scale, where equal distance values are relatively scarce, such‘‘full’’ neighborhood would generate set T consisting only of isolated dots. Thus, the track would have to be understood assomething else, for example, as a ‘‘dotted line’’. In our opinion, such a new model is far from a meaningful track concept;however, such pattern may be of interest for some other research connected with sequence comparisons.

It should be noted that Procedure 2 described above is also a version of Procedure 1.Smoothing the distance matrix prior to the evaluation of set T may prove helpful for increasing the resolution of the track

map. Taking into consideration the direction of the tracks, we propose to perform filtration along the bisectors parallel orperpendicular to the main bisector. For this purpose, various filters can be employed, e.g., the median filter or the movingaverage. In a sense, these procedures are similar to the one mentioned above [23].

The md-plot fragment shown in Fig. 3 illustrates the application of the proposed method to the comparison of the Homosapiens and the Macaca mulatta genomes. Each dot of the compared sequences corresponds to a genome fragment one mil-lion letters in length. The method of evaluating the distance between such fragments is described below; however, it shouldbe noted here that the distance function used by us is, actually, defined on a pair of random multidimensional vectors andhas a continuous set of values. Therefore, equal distance values occur quite rarely and for the purpose of evaluating set T bymeans of Procedure 1, the difference between strict and non-strict inequalities in the formulation of this procedure is notessential.

The dots in the dot-plot diagram (Fig. 3A) mark the distances which do not exceed a certain threshold. The rectanglesobserved in the diagram are a characteristic feature of the ‘‘threshold’’ technique. Formally speaking, the diagram inFig. 3A represents a ‘‘dot-plot‘‘ in a two-valued scale, where the black and the white dots denote distance values belowand above the preset threshold, respectively. However, it is impossible to detect the tracks on this plot. The origin of sucha pattern was discussed in Introduction. The md-plot (set T) obtained according to the locally-minimal md-plot definition

Fig. 3. Fragments of the dot-plot and md-plot diagrams of the comparison of the Homo sapiens (horizontal line) and the Macaca mulatta (vertical line)genomes. Each dot corresponds to the distance between fragments one million letters in length. A. Dot-plot representation in which dots denote pair-wisedistances lower than a preset threshold. B. md-Plot representation in which dots denote the local minima (set T). C. The diagram which demonstrates onlythe dots from set T constituting direct tracks of lengths more than 3. D. Superposition of A and C.

V. Kirzhner et al. / Information Sciences 181 (2011) 1454–1462 1457

Author's personal copy

by means of Procedure 1 is shown in Fig. 3B. Some of the dots constitute direct or reverse tracks (compare to Fig. 1), whichcomprise the target pattern of the diagram. Close inspection of set T (Fig. 3B) reveals the presence of long direct tracks, whilereverse tracks are almost absent. This ‘‘lack of symmetry’’ is not surprising since the direct and the reverse tracks have dif-ferent origin in the case of comparing close genomic species. The direct tracks are an ‘‘inter-genomic phenomenon’’ sincethey are produced by quite similar genome fragments, which are necessarily present in the two genomes. The more closelyrelated are the species, the more is, as a rule, the number of long direct tracks in the md-plot diagram. In contrast to this, thereverse tracks are accounted for by some specific intra-genomic features. Thus, below we consider only the major type oftracks observed in the diagram (Fig. 3B) – the direct tracks. A more distinct representation of these tracks can be found inFig. 3C. It is interesting to note that the direct tracks are located inside the ‘‘threshold-technique rectangles’’, as can be seenin Fig. 3D, obtained by superposition of Fig. 3A and C.

2. Application of the locally-minimal track method

We have applied the proposed method of constructing md-plots to the analysis of the whole human genome. For this pur-pose, each chromosome is transformed into a sequence of ‘‘points’’, each point being a chromosome fragment 1 Mb in length.The fragments within a single chromosome are ordered in the natural way, while the chromosomes are ordered in ascendingorder of their numbers; chromosomes X and Y are at the end of the list. The chromosome (the DNA strand within a chromo-some) orientation is chosen to be 50–30, which is the direction of transcription (mRNA synthesis). With the total length of thehuman genome being about 3000 million letters, such sequence of fragments is 3000 points in length.

Two genome sequences are used in this work: (a) the whole human genome; the sequence was taken from the EnsemblGenome Browser (Ensembl is a joint scientific project of the European Bioinformatics Institute and the Wellcome Trust San-ger Institute); (b) the same genome in which all mobile elements and genes (including all protein- and RNA-coding genes,pseudo-genes, etc.) are masked. For the sake of statistical stability, we calculate the distance between two fragments onlywhen the total length of the masked regions for each fragment does not exceed 20%. The number of such fragments is about1800 (the total number of fragments being about 3000).

We propose to refer to the (b) type of sequence, lacking all its meaningful units, as dark sequence, by analogy with thenotion of dark matter, introduced in cosmology to refer to the universe matter which is not part of the stars, nebulae, irra-diation or other physical objects directly perceivable by modern astronomic devices.

Applying the md-plot technique separately to the analysis of each of the above sequences makes it possible to study thepeculiarities of genome organization at the whole-genome level as well as at the level of the ‘‘darkest’’ genome fraction.

For further consideration we must define the distance between any pair of fragments. The method for comparing longgenome fragments was developed long ago [4,13,15,27,35] on the basis of the frequency distribution of short oligonucleo-tides (words) which occur in the fragments. Moreover, similar methods have been developed for the comparison of natural-language texts (see, e.g., [26,37] and the recent book [3], which describes the parallelism between the models of analysis oflinguistic and genetic texts). Previously [15] we coined the term method of compositional spectra (CS) to refer to the aboveapproach. At present, there exists a large body of research [16,17,21,23,27,29] on phylogenetic relationships, in which dif-ferent versions of this method are employ. The results of these studies confirm the validity of the CS method. The existingversions of the method differ mainly in the choice of the set of oligonucleotides, called support, which the frequency distri-bution (CS) is evaluated for.

The above approach was also used for constructing colored dot-plot diagrams. One of such diagrams [25] (named by theauthors a similar plot), is based on the spectra of 6-letter words. The main result of this study is associated with an elegantuse of the single track – the main diagonal – present in the diagram. This is a trivial track since the authors compare thegenomes of very close bacterial strains. The differences between these strains manifest themselves in the disruption ofthe diagonal track in some positions.

Two versions of the CS method are employed in the present work: (1) the support is a set of 10-letter words (10-mers);(2) the support is a set of 6-letter words (6-mers). The 10-mers may occur in a sequence even with the mismatch of twosymbols (r = 2), while the 6-mers can only occur with a zero mismatch (r = 0) (the support size is 200 and 250 words, respec-tively [15,16]). The methods of calculating the distances between compositional spectra are chosen with regard to the factthat the results substantially depend on the type of the distance function [33]. In particular, it was shown that with theEuclidian distance the results are less distinct than with the distance based on the Spearmen correlation coefficient [14].

In this work, we use the distances of both types and also the inversion distance, for which the results are presented in moredetail. The inversion distance is introduced in the following way. If, for each distribution, the set of words is arranged, say, indescending order of frequencies, then, for a certain common fixed word order, each distribution generates the correspondingpermutation of the support words. Next, for any two permutations, P and Q, the inversion distance is the minimum number ofinversions which must be applied to one permutation in order to produce the other. It can be shown that this distance is,indeed, a metric (see, e.g., [2]). The concept of this metric is close to that of the Spearmen correlation coefficient. Set T is con-structed according to both Procedures 1 and 2 (version 3a) described above.

A fragment of the md-plot for the whole human genome is shown in Fig. 4. Direct and reverse tracks not less than 4 dots(4 million symbols) in length are marked out. The marked areas are those where the distance values do not exceed one thirdof the average distance across the matrix (compare with the similar presentation in Fig. 3). Evidently, these areas are pre-

1458 V. Kirzhner et al. / Information Sciences 181 (2011) 1454–1462

Author's personal copy

dominantly rectangles and nowhere have they looked anything resembling tracks. Although almost all the tracks in Fig. 4 arelocated inside the rectangles, there are a few tracks that lie outside. This means that the distances between the dots thatconstitute these tracks are larger than the preset threshold.

Table 1 shows the characteristics of the tracks obtained with the permutation metric. The data demonstrate an interestingpattern of asymmetry in the number of direct and reverse tracks. Indeed, the asymmetry Q is estimated as

Q ¼ 1� nR

nD;

where nD and nR are the total number of direct and reverse tracks, respectively, longer than or equal to 3. For example, thevalue of Q for the whole-genome sequence (A in Table 1) calculated using 10-mers equals 0.32 and 0.42 for Procedures 1and 2, respectively. With the dark sequence (B in Table 1), the asymmetry is larger than with the whole genome: thecorresponding values of Q are 0.46 and 0.50.

The data presented in Fig. 5 shows that the asymmetry between the direct and the reverse tracks exists in all the casesstudied here. Under the same conditions, the number of direct tracks is always higher than that of reverse tracks.

3. Discussion

3.1. Tracks and their interpretation

In the classical version of the dot-plot method, where a sequence consists of symbols, the tracks are accounted for by thecoincidence of ‘‘words’’ (strings of symbols) located in different parts of the sequence. According to the approach described inthis work, the tracks reflect relative proximity of the spectra of genome fragments which are arranged in a succession,

Fig. 4. Fragments of the dot-plot and md-plot diagrams of the comparison of the Homo sapiens genome with itself. Each dot corresponds to the distancebetween two fragments which are million letters in length. The pair-wise distances for the dots marked in grey are lower than a certain preset threshold.The diagram shows only those dots of set T which constitute direct or reverse tracks of length more than or equal to 4. The distances are calculated by themethod of inversion distances; set T is generated according to Procedure 1.

V. Kirzhner et al. / Information Sciences 181 (2011) 1454–1462 1459

Author's personal copy

without any gaps. Actually, the proximity of the spectra of two fragments is a rather common phenomenon and its originwas discussed in quite a few articles, e.g., in [15,33]. It was suggested [17,18] that such proximity is related to the ‘‘evolu-tionary distance’’ between the fragments. However, the proximity of the spectra of two fragments cannot be always trans-formed into such ‘‘geometrical’’ characteristics as the alignment distance or, in general, any ‘‘letter-by-letter’’ similarity ofthe fragments. In view of the above, we may claim that the ‘‘directional’’ coincidence of the spectra (tracks) is an essentiallynovel phenomenon of quite different origin than isolated coincidence of any two spectra. The significance of the track con-cept lies in the possibility of step-by-step comparison of genomic fragments, which appears to be especially important in thecase of such hard-to-perceive sequence similarity measure as spectra. A few fragments arranged in the form of a track lookmuch more convincing than a single pair of fragments, no matter how small the distance between them may be. After a trackpattern has been obtained, such notions as track continuation, parallel shift, etc., can be applied to the analysis of the pattern,similar to the case of classical tracks. Moreover, an alignment similar to that proposed in [23] can be used to smooth thenoise prior to the procedures of track evaluation. For example, to obtain an md-plot on the basis of 10-mers spectra forthe whole genome (sequence A in Table 1), we use a one-dimensional five-point filter along the diagonals parallel to the ma-trix main bisector. Such smoothing results in a sharp increase of the number and lengths of direct tracks (see Fig. 6; compareto Table 1).

3.2. Track asymmetry pattern

A few different versions of the md-plot method have been applied to evaluating two ‘‘extreme’’ fractions of a genomicsequence – the whole genome and the dark matter (the same genome in which all the mobile elements and genes, includingall protein- and RNA-coding genes, pseudo-genes, etc., are masked). It has been found that the number of direct tracks isalways larger than that of the reverse tracks. Note that this fact holds, in particular, for a symmetric neighborhood (Proce-dure 1), which, along with the symmetry of the distance matrix tested (two identical sequences), does not give a chance forintroducing asymmetry as an artifact produced by mathematical operations. For the same reason, i.e., the symmetry of Pro-cedure 1, substituting the strict inequalities for non-strict ones in this procedure would not give advantage to either direc-tion of the tracks. It should be also noted that equal distances occur quite rarely in the continuous distance scale employedby us. On the other hand, the existence of the track asymmetry is quite natural for the case of closely related genomes since,by definition, large fragments of one genome are quite similar to the corresponding fragments of the other (see, e.g., Fig. 3C,which shows a great number of direct tracks). However, in the case of intra-genomic comparison, such asymmetry cannot bea priori expected. Its existence, demonstrated above, may have an explanation similar to that of the track asymmetry for clo-sely related genomes. Namely, the asymmetric pattern may reflect the peculiarities of the human genome transformation inthe course of evolution. Indeed, it is well known [5,8,21,32,36] that, in this process, some long sequence fragments wereduplicated and the created copies became inserted into other parts of the genome. It can be suggested that the orientationof the inserted fragments was not changed. The subsequent ‘‘microevolution’’ of the copies due to substitutions, insertions,and deletions of letters as well as transposition of relatively small sequence blocks made these originally identical fragments

Table 1The volume of set T and the number of tracks longer than or equal to 3 obtained with the permutation metric.

Method 10-mers, r = 2 6-mers, r = 0

Procedure 1 2 1 2

Dir Rev Dir Rev Dir Rev Dir Rev

L/T 951798 951798 1518826 1518826 933200 933200 1503906 1503906

A P3 16746 11402 25162 14444 14542 10513 21973 138563 13982 9868 20949 12635 12282 9040 18485 122014 2262 1256 3464 1575 1890 1208 2881 14415 398 250 604 202 322 217 504 1796 72 20 127 28 42 34 89 287 6 8 16 4 6 13 13 78 1 2 0 19 1

B P3 2966 1598 5599 2779 1254 991 3944 27073 2712 1438 5014 2588 1178 920 3617 25234 232 132 516 172 74 60 294 1705 20 26 59 17 2 11 30 136 2 0 9 2 37 2 08 1

First column: A, B – the whole genome sequence and the dark sequence, respectively. Second column: T – the volume of the set of local minima; L – theindicated track length (P3, = 3, 4, 5, etc.). Dir and Rev denote direct and reverse tracks, respectively, obtained using Procedures 1 and 2 (version 3a) withtwo different supports – 10-mers and 6-mers.

1460 V. Kirzhner et al. / Information Sciences 181 (2011) 1454–1462

Author's personal copy

‘‘letter-by-letter’’ incomparable. Yet, the CS technique catches their common origin similar to the way it catches phyloge-netic relationships [17,18,22,24,29,31].

The proposed-above scheme which explains how the observed similarity of genome fragments could have arisen appearsappropriate for major evolutionary events such as duplication of a genome or its significant parts. For relatively small gen-ome fragments, there also exist pretty common mechanisms of their duplication and transfer,which can also contribute tothe formation of tracks. Thus, a hypothetical duplicated fragment which is 3 million letters in length may give rise to a trackof length 3 if the window used for the sequence representation is one million letters long.

4. Conclusions

The modification of the dot-plot method proposed in this work is an effective and flexible tool for the analysis of distancematrices with the elements taking more than two values (actually, far more than two, in particular, an infinite number ofvalues). Such matrices are generated, for example, in bioinformatics, when genome sequences are compared at a level morecomplicated than the letter-by-letter one. In the latter case, the major pattern of the dot-plot diagram is a set of tracks, i.e.,the sequences of dots which correspond to the coincidence of letters. The modified md-plot method suggested here has pro-vided an elegant way of producing the track pattern for distance matrices with a continuous distance scale. Such an approachhas proved to be meaningful from the biological point of view. For example, in the case of the human genome, it turned outthat the number of direct tracks, i.e., those lying along the main bisector, is larger than the number of reverse tracks lyingalong the secondary bisector. This result holds for all the distances tested and the procedures of track generation employedby us. The asymmetry in the number of direct and reverse tracks can be accounted for by the peculiarities of the genomeevolution which are discussed in the article.

Another effective application of the track pattern is using it to reveal similar regions in closely related genomes. This ap-proach is illustrated by comparing the human and the macaca genomes and has been further used for the comparison of thehuman genome with those of higher primates [19].

Fig. 5. The asymmetry values Q, obtained using different methods of track detection. Types of distances: I. Permutation metric; II. Spearman correlationcoefficient; III. Euclidian distance. Columns 1 and 2 represent the results for Procedures 1 and 2, respectively.

Fig. 6. The number of direct tracks in the md-plot representation of the whole genome (sequence A in Table 1). The distances are calculated for 10-mersspectra; the track sets are evaluated before and after filtrating the distance matrix along the main bisector. The vertical axis is presented in logarithmicscale.

V. Kirzhner et al. / Information Sciences 181 (2011) 1454–1462 1461

Author's personal copy

The version of the program which generates the minimal dot-plot based on a distance matrix is available at http://www.valerykirzhner.com//docs/62.exe.

References

[1] J.F. Abril, R. Guigo, T. Wiehe, gff2aplot: plotting sequence comparisons, Bioinformatics 19 (2003) 2477–2479.[2] D.A. Bader, B.M. Moret, M.A. Yan, Linear-time algorithm for computing inversion distance between signed permutations with an experimental study, J.

Comput. Biol. 8 (5) (2001) 483–491.[3] A. Bolshoy, Z. Volkovich, V. Kirzhner, Z. Barzily, Genome clustering: from linguistics models to classification of genetic texts, Studies in Computational

Intelligence, Springer-Verlag, 2010.[4] V. Brendel, J.S. Beckmann, E.N. Trifonov, Linguistics of nucleotide sequences: morphology and comparison of vocabularies, J. Biomol. Struct. Dyn. 4

(1986) 11–21.[5] C.D. Crow, G.P. Wagner, What is the role of genome duplication in the evolution of complexity and diversity?, Mol Biol. Evol. 23 (2006) 887–892.[6] R. Engels, T. Yu, C. Burge, J.P. Mesirov, D. DeCaprio, J.E. Galagan, Combo: a whole genome comparative browser, Genome Res. 10 (8) (2000) 1148–1157.[7] W. Fitch, Locating gaps in amino acid sequences to optimize the homology between two proteins, Biochem. Genet. 3 (1969) 99–108.[8] D.R. Forsdyke, S.J. Bell, Purine loading, stem-loops and Chargaff’s second parity rule: a discussion of the application of elementary principles to early

chemical observations, Appl. Bioinformatics 3 (2004) 3–8.[9] A.J. Gibbs, G.A. McIntyre, The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences, Eur. J. Biochem. 16 (1970)

1–11.[10] A.J. González, L. Liao, Clustering exact matches of pairwise sequence alignments by weighted linear regression, BMC Bioinformatics 9 (2008) 102.[11] M. Itoh, H. Watanabe, CGAS: comparative genomic analysis server, Bioinformatics 25 (7) (2009) 958–959.[12] N. Jareborg, R. Durbin, Alfresco – a workbench for comparative genomic sequence analysis, Nucleic Acids Res. 29 (6) (2001) 1352–1365.[13] S. Karlin, L. Ladunga, Comparisons of eukaryotic genomic sequences, Proc. Natl. Acad. Sci. USA 91 (26) (1994) 12832–12836.[14] M.G. Kendall, Rank Correlation Methods, Charles Griffin& Co., Ltd, London, 1970.[15] V.M. Kirzhner, A.B. Korol, A. Bolshoy, E. Nevo, Compositional spectrum – revealing patterns for genomic sequence characterization and comparison,

Physica A 312 (2002) 447–457.[16] V.M. Kirzhner, E. Nevo, A.B. Korol, A. Bolshoy, One promising approach to a large scale comparison of genomic sequences, Acta Biotheoretica 51 (2)

(2003) 73–89.[17] V. Kirzhner, A. Bolshoy, Z. Volkovich, A. Korol, E. Nevo, Large scale genome clustering across life based on a linguistic approach, BioSystem 81 (3) (2005)

208–222.[18] V. Kirzhner, A. Paz, Z. Volkovich, E. Nevo, A. Korol, Different clustering of genomes across life using the A–T–C–G and degenerate R–Y alphabets: early

and late signaling on genome evolution?, J Mol. Evol. 64 (4) (2007) 448–456.[19] V. Kirzhner, S. Margulis, A. Korol, Large-scale comparisons of primate genomes on the above-gene level, in: The 14th International Conference on

Research in Computational Molecular Biology (RECOMB 2010), Abstract.[20] S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu, S.L. Salzberg, Versatile and open software for comparing large genomes,

Genome Biol. 5 (2) (2004).[21] S. Lockton, B.S. Gaut, Plant conserved non-coding sequences and paralogue evolution, Trends in Genetics 21 (2005) 60–65.[22] G. Lu, S. Zhang, X. Fang, An improved string composition method for sequence comparison, BMC Bioinformatics 9 (Suppl. 6) (2008) S15.[23] J.V. Maizel Jr., RP. Lenk, Enhanced graphic matrix analysis of nucleic acid and protein sequences, Proc. Natl. Acad. Sci. USA 78 (12) (1981) 7665–7669.[24] J. Mra’zek, Phylogenetic signals in DNA composition: limitations and prospects, Mol. Biol. Evol. 26 (5) (2009) 1163–1169.[25] C. Putonti, Y. Luo, C. Katili, S. Chumakov, G.E. Fox, D. Graur, Y. Fofanov, A computational tool for the genomic identification of regions of unusual

compositional properties and its utilization in the detection of horizontally transferred sequences, Mol. Biol. Evol. 23 (10) (2006) 1863–1868.[26] N.J. Randon, J. Lawry, Classification and query evaluation using modelling with words, Inform. Sci. 176 (2006) 438–464.[27] E.P. Rocha, A. Viari, A. Danchin, Oligonucleotide bias in Bacillus subtilis: general trends and taxonomic comparisons, Nucleic Acids Res. 26 (12) (1998)

2971–2980.[28] S. Schwartz, Z. Zhang, K.A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, W. Miller, PipMaker – a web server for aligning two genomic DNA

sequences, Genome Res. 4 (2000) 577–586.[29] G.E. Sims, S.R. Jun, G.A. Wu, S.H. Kim, Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions, Proc. Natl.

Acad. Sci. USA 106 (8) (2009) 2677–2682.[30] E.L. Sonnhammer, R. Durbin, A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis, Gene 167

(1995) 1–10.[31] M. Takahashi, K. Kryukov, N. Saitou, Estimation of bacterial species phylogeny through oligonucleotide frequency distances, Genomics 93 (6) (2009)

525–533.[32] M. Touchon, A. Arneodo, Y. d’Aubenton-Carafa, C. Thermes, Transcription coupled and splicing-coupled strand asymmetries in eukaryotic genomes,

Nucleic Acids Res. 32 (2004) 4969–4978.[33] Z. Volkovich, V. Kirzhner, A. Bolshoy, A. Korol, E. Nevo, The method of N-grams in large-scale clustering of DNA texts, Pattern Recognit. 38 (11) (2005)

1902–1912.[34] M.D. Wilson, C. Riemer, D.W. Martindale, P. Schnupf, A.P. Boright, T.L. Cheung, D.M. Hardy, S. Schwartz, S.W. Scherer, L.C. Tsui, W. Miller, B.F. Koop,

Comparative analysis of the gene-dense ACHE/TFR2 region on human chromosome 7q22 with the orthologous region on mouse chromosome 5,Nucleic Acids Res. 29 (6) (2001) 1352–1365.

[35] T. Wu, J. Burke, D. Davison, A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words, Biometrics 53(1997) 1432–1439.

[36] P. Yin, A.J. Hartemink, Theoretical and practical advances in genome halving, Bioinformatics 21 (2005) 869–879.[37] S. Zadrozny, J. Kacprzyk, Computing with words for text processing: an approach to the text categorization, Inform. Sci. 176 (2006) 415–437.

1462 V. Kirzhner et al. / Information Sciences 181 (2011) 1454–1462


Recommended