1. INTRODUCTION
Sequence alignment refers to the search of similarity regions within biological sequences of DNA, RNA, or proteins. Besides the biological significance and interpretation of the results, the main problem of sequence alignment from the computer science point of view is the very large number of residue-to-residue comparisons that are needed when searching for similarities.
The biological definition of the problem makes it time-consuming taking in consideration the fact that in practice, a single genome may contain in the order of 109 residues. Thus, a computer program can therefore intensively search for regions of similarity between sequences to detect such relationships.
Alignment could be global or local. Global alignment is to find the best match between the entire sequences, whereas local alignment must find the best match between certain regions of the sequences.
Global alignment algorithms such as Needleman–Wunsch (NW) [1], which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. Local alignments instead, like in Smith–Waterman (SW) [2], are more often used for dissimilar sequences that are suspected to contain regions of similarity within their larger sequence context.
Fig. 1 shows a global and a local alignment of the same two sequences. It shows that if sequences are not sufficiently similar, a global alignment tends to spread gaps that hide possible regions of similarity. Residue mismatches are called substitutions (mutations in genetic terminology), and the character “-” denotes gaps that may be insertions or deletions depending on the point of view.
Fig. 1. A global and a local alignment of the same two sequences
A. Substitution Matrix
Biologists have learned that some mutations are more likely than others (e.g., a DNA mutation from cytosine to adenine is more common than cytosine to guanine) and they need alignment algorithms to reflect this property. For this purpose, substitution matrices have been built using statistical data from known sequences and mutations. Fig. 2 shows the BLOSUM62 [3], a common substitution matrix for amino acids alignments.
Fig. 2. The BLOSUM62 substitution matrix
Definition 1: A substitution matrix (sbt) on an alphabet Σ = {a1, a2…, an} has n × n entries, where each entry (i, j) assigns a score for a substitution of the letter ai by the letter aj in an alignment.
The elements on the main diagonal have the highest values to encourage matching of identical residues in alignment algorithms. Amino acids are divided into colored groups according to the chemistry of the side group.
B. Gap Penalty
Besides having substitution matrices for mutations, it is also desirable to score insertions and deletions gaps differently. First, to avoid having gaps all over the alignment, gaps have to be given penalties, just like unmatching amino acids. This penalty cannot be derived from the database alignments used to create the substitution matrices such as BLOSUM since these matrices were derived from ungapped alignments. The score given to an insertion/deletion is commonly called a gap penalty.
There are two schemes used, namely, linear gap (g) penalty and affine gap penalty. Having only one score for any gap inserted is called a linear. However, insertions and deletions often involve a longer stretch of sequence in a single event. For this reason, two different gap penalties are usually included in the alignment algorithms: One penalty for having a gap at all (gap opening penalty (go)) and another smaller penalty for extending already opened gaps (ge). This is called an affine and is actually a compromise between the assumptions that the insertion or deletion is created by one or more events [4].
Definition 2: Given a sequence S over the alphabet Σ of length L, a sequence Sg of length Lg over Σ U “-” is called a gapped sequence of S if Lg ≥ L and there exist a transformation T(S) = Sg such that:
2. PAIR-WISE ALIGNMENT
Pair-wise sequence alignment is defined as an alignment of two sequences to determine how similar they are. In most sequence similarity calculations, a similarity score is inferred from the alignment. Gap insertions are allowed until the resulting sequences are of the same size, and the alignment must obey the restriction that gaps cannot appear in the same position in both sequences. This score is determined based on a substitution matrix and specific penalties for the insertions and deletions gaps.
In the following, the two most common pair-wise alignment algorithms used to compute the similarity matrix H for each a pair of sequences S and T with their length Ls and Lt, respectively, are explained in detail.
Definition 3: Given a pair of sequences S and T over the alphabet Σ with their lengths Ls and Lt, respectively. Let Sg and Tg be two-gapped sequences with lengths and , respectively. A pair-wise sequence alignment of S and T is defined to be a matrix M of size (2 × n) with n = max(Lgs, ) with the following properties: $$$ 1 ≤ i ≤ n.
-
M(s, i) = Sg(i) and M(t, i) = Tg(i)
-
If M(s, i) = “-” then M(t, i) ≠ “-” and vice versa.
Definition 4: Given two sequences S and T on an alphabet Σ. A similarity score function of an alignment.
Score: M → R
Score (M) $$$ R
Assigns a similarity score to each pair of characters in M.
Definition 5: A score of alignment is a real value function (score) that associate for each alignment M(S,T) a real value function Score (M(S,T)).
The problem of pair-wise alignment S and T is to find an alignment Mo(S,T) with optimal score (Mo(S,T)), that is,
Score (M(S,T)) ≤ Score (Mo(S,T)) for all alignments M(S,T).
A. NW Algorithm
The NW algorithm [1], introduced in 1970, as the first dynamic programming tool to compute a global alignment for any pair of biological sequences. The NW algorithm achieves its goal by going through the following three phases:
-
• The initialization phase: Initiates the H(0,0) matrix element by 0. The first row and column are initialized with the costs of gaps of lengths s and t.
i.e., H(s,0) = g.s and H(0,t) = g.t “ 1 ≤ s ≤ Ls, 1 ≤ t ≤ Lt; g is a gap penalty.
-
• The score computation phase: Computes all other values of matrix H(s,t) using one of the following recursive formula:
or
Where, sbt is a substitution matrix and (g, go, and ge) are the gaps penalty.
-
• The trace back phase: Recovers the alignment by tracing back the path starting from the last element H(Ls+1; Lt+1).
A. SW Algorithm
From the evolution perspective, two-related sequences could evolve independently with many independent mutations lowering the similarity between the sequences. Aligning the sequences with noised information often fails to produce a biologically meaningful alignment. In these cases, the local alignment, proposed by Smith and Waterman [2], identifies the longest segment pair that yields the best alignment score is more preferable. In the SW’s algorithm, the longest segment pair between two aligning sequences that yield the optimal alignment is identified by comparing all possible segments of all lengths between the two sequences through dynamic programming technique.
The main difference between this technique and NW’s is that negative scores are set to zeroes. This modification produces an alignment score matrix with positive scores. Thus, the backtracking procedure of the algorithm starts at the highest positive score cell and proceeds until it encounters a cell with zero score. The longest segment pair identified in these backtracking steps is the optimal scored local alignment of the two sequences.
The similarity matrix score H is filled using one of the following recurrence formula:
or
Where, sbt is a substitution matrix and (g, go, and ge) are the gaps penalty.
Fig. 3 shows the example of calculating the pair-wise local alignment S = {ATCTCGTATGAT} and T = {GTCTATCAC} using the SW algorithm.
Fig. 3. Local alignment using Smith–Waterman
The similarity matrix H is shown for:
From the highest score (H(8,11) = 10), a procedure of trace-back carries out the corresponding alignment as shown in Fig. 3.
3. ALIGNMENT FOR MULTIPLE SEQUENCE
Multiple sequence alignment (MSA) refers to the alignment of more than two biological sequences (DNA, RNA, or protein). It is considered as an extension of pair-wise sequence alignment as discussed in the previous section. It helps in many criteria such as identifying diagnostic patterns or motif to characterize protein families, demonstrating homology between new sequences and existing families of sequences.
MSA is NP-complete problem [5] since the computational cost grows exponentially with the expansion of biological datasets. This leads to the development of many algorithms aiming at reaching the most accurate and efficient alignment. Most commonly used algorithms are classified into two categories, progressive and iterative.
Recent studies have shown significant progress in enhancing the quality, accuracy, and speed of MSA tools. However, the big dataset of sequences of biologically relevant length can be difficult and time-consuming to align. Thus, many MSA tools have been proposed. In this section, only important MSA paradigms are introduced. They are used as a base for various MSA distinguished tools [6]. A progressive method most widely used MSA tools utilize the progressive method that was first introduced in 1987 [7]. For aligning N sequences, it goes through the following three main stages (Fig. 4):
Fig. 4. Multiple sequence alignment produced with ClustalW
-
• Stage 1: Generates all possible N(N−1)/2 pair-wise sequence alignment to construct a distance matrix computing the similarity between of each two sequences
-
• Stage 2: Creates a guide-tree using all the pair-wise distances using a clustering method such as unweighted pair-group method with arithmetic [8] or neighbor-joining [9]
-
• Stage 3: Builds-up the final multiple alignments by the progressive inclusion of the N sequences alignment based on the range given by the guide tree.
Stage 1 is calculated using the match between the residues of the two sequences found by the local alignment. The number of exact match is computed by counting the identical residues appearing in the same column in the local alignment excluding gaps, using the equation [10]:
Where, nid(S,T) indicates the number of exact matches using SW algorithm to align S and T. For instance, the nid-value in Fig. 3 is 6. Actually, this method will run-out storing the similarity matrix H, which is not practical for the datasets with long sequences.
Fig. 5. Relation between H, D, D1, D2, DE, and DF
Fig. 6. Scheduling D’s computations on 4 cores
Liu et al. [11] presented new recurrence relations equations (3) and (4) for the nid-value calculation that is compatible for parallel systems such as GPUs. They facilitate nid-computations without calculation of the actual trace-back, to reduce the storage space using the matrices NL(s,t) and NA(s,t) for linear and affine gap penalty, respectively. The computations of these matrices are given by following recurrences:
Where,
For linear gap penalty, and for affine gap penalty, we use:
Where,
4. PROPOSED METHOD
This section propose vectorized and parallelized algorithm for computing the sequence alignment. The aim of this work is switching of all the previous matrices to vectors and computing it by parallel. The main goal for it is to reduce used storage and speed-up runtime, for aligning long biological sequences fast.
The proposed algorithm will implement on the optimal local alignment (SW) algorithm because of the following features:
-
• It has been trusted by biologists for almost two decades with quality that is still comparable to more recent algorithms
-
• Its alignment results are similar to biologists expectations
-
• It is relatively fast, simple, understandable, and provides fairly good alignments across a diverse range of sequence types
-
• It has highly cited aligner, especially for big dataset of sequences as mentioned in recent studies [12]-[15].
SW algorithm [2] compares two sequences by computing a distance that represents the minimal cost of transforming one segment into another, with respect to the given scoring system.
As previously mentioned, there are two approaches to compute SW algorithm, based-on-gap penalty, linear, and affine gap penalties. Gap penalties prefer more continuous gaps to opening new gaps. Therefore, it encourages that gaps occur in loop regions instead of in highly structured regions.
The background biological meaning for this is that biologically divergence is often less likely in highly structured regions, which are commonly very important to the fold and function of a protein. In this paper, the affine gap penalty will be used.
A. Vectorization Approach
Vectorizing all matrices presented in the previous section is the main contribution of the proposed method. It was used when computing the aligning sequences of long lengths, aiming at accelerating computations, and used less space. This approach was based on the calculation of each elements of antidiagonal D in the similarity matrix (H) with affine gap penalty is based only on the computed elements of the four antidiagonals; two from (H) matrix with one from both E and F matrices.
This postulate that just one vector D for current antidiagonal, with six buffers previously computed D1, D2, DE, DE1, DF, and DF1 are enough to calculate the elements of D. After the newly D is computed, D1 is replaced by D2, D2 is replaced by D, DE is replaced by DE1, and DF is replaced by DF1.
In the subsequent iteration, this cyclic method is used to replace the six buffers D1, D2, DE, DE1, DF, and DF1. The values of all cells in D are computed in terms of its diagonal neighbor stored at D1, with its left and upper neighbors stored at D2 in addition the cells in DE and DF, and the maximum value is selected indicating the highest score.
Fig. 5 shows the vectorization approach when aligning pair of sequences S = {GCTACTCAC} and T = {GCTAGG TATGAT} with their lengths are 9 and 12, respectively. It illustrates the calculation of the elements of HA, using affine gap (go = 7, ge = −1) and a substitution cost of (2) if the characters are match and (−1) otherwise, and how it is replaced by D, using D1, D2, DE, DE1, DF, and DF1.
This figure shows the dependence relationship of the elements of the matrices, which is visualized by considering its antidiagonals D1, D2, DE, DE1, DF, and DF1 dependencies. It is clear that antidiagonal D in iteration i = 9 computations depend on the four previously computed antidiagonals D7, D8, and D8F. Therefore, all other computed antidiagonals can bedependence relationship of cell D(5) with its left neighbor DE(5) = −9, upper neighbor DF(4) = −8, and upper left neighbor D1(4) = 5. Where D2(5) + go is a maximum value of (D1E(5) + ge and (5) + go); D2(4) + go is a maximum value of (DF1(4) + ge and DF(4) + go). Using this way, all elements along vector D are computed in parallel from all elements in vectors D1, D2, DE, and DF.
To verify these postulates, authors have proposed the following new recurrence and theorem.
Theorem: The pair-wise local alignment of the sequences S and T, with an affine gap penalty (go for opining gap and ge for extending gap), substitution matrix (sbt), in iteration (i), and with element (j), the equation:
Gives a vectorization Di of the matrix HA and the following relations hold:
Where max(2, s−Ls) ≤ i ≤ min((s + 1), (Ls + 1)) and (imax) indicates the position of the maximum value in the vectors Di.
Proof: From Equation (1), we get:
and from Equation (5), we get:
Since Di−1 is computed in the previous iteration of Di, DiE, and DiF, that is, in one iteration before Di, DiE, and DiF, hence Di2 is computed in the second iterations before Di, let us denote Di−1, Di−2, Di−1E, and Di−1F by Di−1, Di−2, DiE1, and DiF1, respectively. Then, we get:
To proof of NiD(j) let the Equation (6), gives the vectorization NA(s,t):
After vectorizing the matrix NA as shown in Fig. 5.
Then, we get:
Where,
We now show that for a given i, NA(j) is equal to the number of exact matches in the optimal (i) suffix alignment.
Case 1: D(j) = 0. The alignment is empty. Hence, NA(j)=0.
Case 2: D(j) = D1(j−1)+sbt(S(j),T(i−j+1)). The alignment ends with S(j) aligned to T(i−j+1), which contributes m(S(j), T(i−j+1)) to the nid-value. The residual number is than equal to the nid-value got in the optimal j−1 suffix alignment. Hence,
Case 3: D(j) = D2(j−1) + go. The alignment ends with S(j) aligned to a gap, which contributes zero exact matches. The residual number is than equal to the number got in the optimal D2(j−1) suffix alignment. Hence,
Case 4: D(j) = D2(j) + ge. The alignment ends with T(i−j + 1) aligned to a gap, which contributes to zero exact matches. The remaining number is equal to the number found in the optimal D2(j) suffix alignment. Hence,
The increase in NiD occurs only at the vectors Di which has matching in its elements. Hence,
NA(xmax, ymax) = nid(S,T) is obtained by maximization.
B. Parallelization Approach
Another critical challenge that must be dealt with is the gigantic explosion in the amount of molecular data which makes the ability to align a huge number of long sequences becoming even more essential.
For example, the Ribosomal Database Project Release 10 [16] consists of more than million sequences. This leads to the massive number of calculations. Even if the sequences are short, and pair-wise calculations can be done relatively quickly, say at a rate of 5000−1 s, then their alignment still requires almost 12 days of CPU time. Another difficulty is how to store the similarity matrix elements, as it will take up to 40 GB of memory. This leads to the need of new approaches to parallelize the calculations using sort of sophisticated parallel and distributed systems such as multi-cores.
Recent studies refer some attempts have made to accelerate computation of similarity matrix. Ying et al., uses GPU’s in [17]. They show speedup comparing with the serial CPU program. Wirawan et al., [18] introduced a parallel algorithm on the cell broadband engine multi-core system for the calculation by taking benefit of the 128-bit SIMD vectorization registers of each SPEs and used half word values (16 bits) for the computation. Their results show a good speedup comparing with sequential ClustalW program.
Recently, Al-Neama et al. [19] proposed a new parallel algorithm of distance matrix computations of ClustalW is based on OpenMP system. It achieves speedup of about 2.39 on 50 sequences of the average length of 9200 nucleotide; tested on Core-i7 Intel Xeon 2.83GHz of the processor.
The second contribution of the proposed algorithm is parallelizing the computations to align the long-sequences dataset. The multithreads technique is used to apply the parallelism that reduces runtime necessary for repeated tasking synchronization and exchanging data. In addition, it makes efficient the scheduling through a task allocation policy that prefers the distribution according to the location of data. Each core (P) in the processor has a thread that its responsibility is calculation the maximum value of D’s and all threads runs in parallel. Calculations of the vector D are distributed over the total number of available cores (P). The elements of all vectors D1, D2, DE, DE1, DF, DF1, and D are accessible through the core’s shared memory.
The maximum value for each elements of D using Equation (7) and using Equation (8) are calculated in parallel. The value of each cell is evaluated in terms of its diagonal neighbor stored at D1, with its left and upper neighbors stored at D2, with DE and DF, and then the maximum value is selected indicating the highest score.
Fig. 6 shows the parallelization approach. It displays the scheduling of calculations of the elements of D and distributed them on the available cores labeled P0, P1, P2, and P3. They run in parallel on each four sequent elements of D, then they sequentially run on the second sequent 4 rows, and so on.
5. PERFORMANCE EVALUATION
The performance of the conceived parallel implementation of the proposed algorithm was extensively evaluated using different processing parameters. The evaluation methodology that was followed to correctly study the results obtained by the described solution is presented. The analysis goes from pair-wise alignment methods (SW). The improvements achieved due to introduced optimizations for multi-core system.
In this section, all needed information about the experimental setup for performance measurements is illustrated. It includes specifications of used platforms, details of experimented biological datasets, and characteristics of other programs used during comparisons.
A. Platform
As obviously clear from the previous section, the presented algorithm was designed to parallelize computations on a multi-core-based environment. Thus, to correctly evaluate the performance of the both original and proposed methods, the platform was considered as specified: An Intel quad-core-i7-3770, with 3.40 GHz processor and main memory of 8 GB; implementing on 64-bit Linux OS (Ubuntu) and running using C + + with OpenMP library.
B. Datasets
The tests have been conducted using a variety of data sets. These data sets sequences including long, medium-length, and short sequences. These lengths are ranging from 400 to 34,500 residues to study the solution’s overall performance against multiple different sizes. The data sets consist of sequences selected from NCBI [20].
Table I shows the used data set with the number of sequences and average their length with a numerical measure of the scatter of a data set (standard deviation).
TABLE I Used Benchmark Dataset Specifications
C. Programs
Overall, tests have been conducted on the specific platforms using various groups of data sets. To evaluate the implementation of our algorithm, it was tested in comparison to popular and efficient MSA program named ClustalW-MPI. This program is available online at: http://www.mybiosoftware.com/alignment/3052.
The runtime and speedup are considered most common performance measurements. Runtime is the elapsed time for all calculation, including all additions, comparisons, and maximum values. Speedup is the ratio between the runtime of the two involved programs.
Table II gives the runtime (in sec) and speedup of the two sequential of the proposed program against the ClustalW-MPI program computing the distance computation to illustrate how vectorization approach accelerates calculations.
TABLE II Sequential Performance Measurements of Our Algorithm versus ClustalW MPI
Fig. 7 illustrates that the longer sequences have the more acceleration of our algorithm calculation. Furthermore, our program achieves a speedup up to 2-fold over ClustalW-MPI.
Fig. 7. Performance comparison between sequential our algorithm, ClustalW-MPI
Furthermore, parallel runtime of both is shown in Table III. Fig. 8 shows the execution time and speedup of proposed parallel program on the above-mentioned datasets. Our parallel program achieved reducing the runtime of aligning sequences of length 34,500 from 164,613 s using ClustalW-MPI to 79,027 s using the proposed algorithm using 8 cores.
TABLE III Parallel Performance Measurements of Our Algorithm versus ClustalW MPI
Fig. 8. Performance comparison between parallel our algorithm, ClustalW-MPI
The speed-up of our parallel program achieved significant speedup of almost 3 for aligning sequences of longest length of sequence. Obviously, the sequences with the short length, the decrease the overall performance
To conform that the proposed program is most efficient than other executed programs, the parallel efficiency of the available cores was evaluated. It is the ratio of the speedup (S) with respect to the number of processors (P) [21]. It is given by the following equation:
Results are shown in Table IV. It is clear that the efficiency of our algorithm is exponentially increasing as the length of sequences increases. In addition, our program supreme efficiency was up to 0.35 with respect to the ClustalW-MPI for the longest sequence length up to (34 k).
TABLE IV Efficiency Comparisons Using 8 Cores
There is another performance measurement used in computational biology called billion cell updates per second (GCUPS). A GCUPS represents the time for a complete computation of one entry in the similarity matrix, including all comparisons, additions, and maximum operations. Table V shows the performance comparison for the datasets.
TABLE V Performance Comparison (in GCUPS) for Scanning the Datasets
6. CONCLUSIONS
This paper presented a new parallel algorithm for computing the nid-value in the pair-wise local alignment. The results of the proposed algorithm were used in the first stage of MSA. Since the new algorithm was implemented using vectorizing technique, we have got a significant improvement in the performance.
The program was able to calculate nid-value for sequences with length up to (34 k) residues. It surpasses ClustalW-MPI 0.13 with 2.9 speedup and the efficiency reached 0.35. A better performance can be achieved if more cores are provided. Furthermore, it can be accomplished a higher speedup with improved efficiency.
Program’s performance figures vary from a low of 0.438 GCUPS to a high of 3.66 GCUPS as the lengths of the query sequences decrease from 34,500 to 9200.