Recording of the 10th meeting, Robert Giegerich's study group

VSNS BioComputing Division - BioComputing Course

Robert Giegerich's study group

10th BioMOO session transcript, July 13th, 1995


Discussion Theme: The MSA algorithm


Special guest and invited lecturer:

John D. Kececioglu
Department of Computer Science
University of Georgia
Athens, USA

Participants:

(Check out our BioComputing Tutorial; your comments about the chapter on Multiple Alignment would be appreciated !)


There are also links to other VSNS-BCD Guest Lectures at BioMOO.

The recording: (See also the Announcement of this transcript.)

RobertDP [to JohnK]: "What relationships exist between the optimal pairwise alignments and the optimal multiple alignment?"

JohnK says, "That's a good question. Say we look at the minimum sum of pairs problem. If we compute an optimal sum of pairs alignment for a set of sequences, and compute an optimal alignment for any pair of sequences in the set, the score of the optimal pairwise alignment will always be better than the score of the induced alignment between those two sequences, looking only at their two rows in the multiple alignment."

RobertDP says, "you mean the score of EACH optimal pairwise alignment"

JohnK says, "That is why we can compute bounds on the value of an optimal alignment by solving alignment problems over pairs of sequences."

RobertDP says, "Mightn't the induced alignment have the same score as the optimal pairwise one?"

JohnK says, "Let's be more specific. Suppose we have a set of sequences S_1, S_2, ..., S_k.
Yes, RobertDP, the induced alignment may have the same score. In general, the induced score will be greater than or equal to the optimal pairwise score. (We are considering a minimization problem.)"

RobertDP says, can you relate this to the idea of projections?"

JohnK says, "Yes. The induced pairwise alignment score for a pair of sequences S_i, S_j, is the projected score onto the face i,j. I am not sure how to follow on with your question, RobertDP"

RobertDP says, "that is, the score achieved by the path you get when you project the optimal alignment onto that face?"

JohnK says, "Exactly."

RobertDP says, "What information do you get by projecting the OPTIMAL pairwise alignment back into the hypercube, and then doing that for ALL the pairs?"

Robertgi says, "Any projected path must have a higher score then the optimal ali path in that plane."

RobertDP is thinking about all this

JohnK says, "I'm not sure how much you have read about Carrillo-Lipman's idea, but they do essentially what you suggest. They define a region in the multidimensional graph, such that when we project a path in this region onto a face, it's projected value, which we might denote by c(A_i,j) for a pair of sequences S_i, S_j, has the relation c(A_i,j) <= D(S_i,S_j) + U - L, where D(S_i,S_j) is the pairwise distance between the two sequences, U is an upper bound on the value of an optimal sum of pairs multialignment, and L is a lower bound. I should say that from practical experience, these bounds are quite weak. Intuitively, the ``slop'' in U-L, grows quadratically in the number of sequences."

RobertDP says, "Weak in the sense that they define a large volume of the hypercube?"

JohnK says, "OK. Yes, they encompass almost the whole of a face, so their reverse projection into the hypercube ... is almost all the hypercube."

RobertDP says, "That's a very helpful insight"

Robertgi says, "This is really bad news for the practitioner..."

JohnK says, "It's bad news for anyone who has nothing better than this one idea of Carrillo and Lipman. There is another way to reduce the search space."

Robertgi says, "QUESTION to John: What size and number of sequences can we expect to align in the near future?"

JohnK says, "Another good question. It depends on the distances between the sequences. Right now, we can align optimally at most 6 globin sequences, using Carrillo-Lipman's idea."

RobertDP says, ""Is what you call distance what we call edit distance?"

JohnK says, "(Globins are very close.) Yes, RobertDP. Globins are around 200 amino acids long. If we consider families of sequences that are more distantly related, such as retroviral sequences, Carrillo-Lipman can do only 4. But we should understand that optimally aligning 4 sequences of length 200 is quite a feat. The volume of the full dynamic program is (200)^4, which is 16 * 10^8.

RobertDP says, "volume of full dynamic program = volume of the hypercube?"

JohnK says, "I don't like to refer to it as a hypercube."

Robertgi says, "Why not?"

JohnK says, "It is formed from hypercubes. If you want to be precise, it is a lattice, whose vertices have integer coordinates in the space [0,n]^k for k sequences of length n."

RobertDP says, "so volume of the full dynamic program = volume of the lattice?"

Robertgi says, "QUESTION to John: Can you give us an idea of the new approach that you recently developed?"

Robertgi asked this question because he had to leave John's talk early for the class.

RobertDP says, "clever"

JohnK says, "There are several new ideas. First, we do not prune the shortest-path computation by defining a projected region on faces. We find a shortest source-to-sink path, which corresponds to an optimal alignment, by running Dijkstra's algorithm. This algorithm finds a shortest path in a breadth-first computation, maintaining a queue of vertices for which a shortest path from the source is known. When we extract a vertex from this queue, we look at all the edges coming out. Suppose the vertex v is extracted, and we are looking at edge (v,w). Suppose further that we have an upper bound U on the optimal alignment value. We will not consider edge (v,w) if
distance(s,v) + length(v,w) + lowerbound(w,t) > U
Here, s and t are the source and sink vertices."

RobertDP says, "sink is the opposite corner?"

Robertgi says, "The lowerbound(w,t) means you need some lookahead towards the sink?"

JohnK says, "Yes, RobertDP, the sink is the opposite corner, corresponding to vertex (n,n,n,...,n)."

JohnK says to RobertGi, "lowerbound(w,t) is a lower bound on the length of a shortest path from w to t. We can compute this simply by summing, for every pair of sequences, the cost of an optimal pairwise alignment over the suffixes given completing the path from w to t. Actually, we do not recompute these costs every time we look at an edge (v,w) -- we look them up in a table. The details are not too bad, but hard to describe here in real-time."

Mykol says, "Where do U & L come from?"

RobertDP says, "so the optimal pairwise alignments are made over JUST the suffixes, not full length ones?"

Robertgi says, "As with Carillo-Lipman, the upper bound comes from an heuristic alignment."

JohnK says, "Yes, RobertDP. As RobertGi just said out loud here,"

RobertDP says, "I'm still having trouble with the definition of heuristic in this context"

JohnK says, "think of the normal dynamic programming matrix, calculated backwards."

Robertgi says, "A heuristic one is an alignment determined in any way, possibly far from optimal."

JohnK says, "Since we are looking at a minimization problem, any feasible solution, namely any alignment, gives an upper bound on the minimum."

RobertDP is beginning to see the light

JohnK says, "Of course, the better the heuristic alignment, the tighter the upper bound. An optimal alignment gives the tightest possible upper bound, but we don't have a fast way to find it!"

Robertgi says, "A brute heuristic alignment would be e.g.

acgctagctt
acctt-----
tattgtt---

Probably far from optimal!!"

JohnK says, "(There are lots more ideas too.)"

RobertDP says, "Are these ideas critical to success, or could we sit back and wait for several order of magnitude increase in computer power?"

GeorgF says, "I'd be interested in alternative lower bounds L."

JohnK says, "These are such good questions! If we could sit back and wait, we could solve larger problems, but the annoying feature of these algorithms is that they consume as much space as steps in the computation."

RobertDP says, "positive feedback is hard to get in this business (science)...be sure to drink deeply"

Mykol says, "It seems that the success of the method, in terms of reducing the complexity of the alignment computation, lies in optimizing U & L."

GeorgF says, "I know that Dan Gusfield has contributed an alternative lower bound, using a star alignment. Are there others ? Is it still a good idea to look for more ?"

JohnK says, "(Essentially, every step of the computation allocates another memory cell.) One very quickly exhausts all the available memory, and everything comes to a halt. Space is the bottleneck. I wish I could respond to everybody. Tighter upper and lower bounds will help, but only up to a point. For instance, when writing the MSA code, I would sometimes rerun the computation, knowing the value of an optimal alignment. We often got very little additional pruning. What is very difficult is to get a good lower bound.
In principle, one can obtain tighter and tighter lower bounds as follows. For every triple of sequences, one could compute an optimal 3-way alignment. The sum of all these 3-way alignments, when divided by the number of times a pair appears in all triples, will give a lower bound on the value of the optimum. But, it gets increasingly expensive to compute all 3-way alignments. Moreover, we have to store these values to look them up during the shortest-path computation. Now, instead of storing values on each face, we store a value in a 3D volume, for every 3 sequences, this requires a LOT of space -- for k sequences of length n, roughly O(k^3 n^3). The k^3 factor is not too bad, but the n^3 factor kills this idea."

Robertgi says, "QUESTION to John: Do you have data on how effective this new shortest path technique is, compared to the previous Carrillo-Lipmann technique? And does this effectivity still depend on how different the sequences are?"

GeorgF says, "I'd like to know more about the "form" of the polytope that is still explored; does it have "Bubbles" ? (i.e. regions inside that are not explored ?)"

JohnK says, "One can prove that the shortest-paths bounding is strictly tighter than the Carrillo-Lipman bounding. It still suffers from the defect that the slop in the bounds grows like O(k^2), quadratically in the number of sequences. Every additional sequence weakens our bounds significantly. It may allow one to handle one more sequence than Carrillo-Lipman alone. (I should say that every additional sequence seems to require a new algorithmic idea.)"

RobertDP [to JohnK]: A guy named Rose, here at Hopkins, has recently made a lot of progress in the protein folding problem. I wonder if multiple alignment could be framed as similarities in protein folding. His algorithm simply looks at nearest neighbors and walks along the sequence"

JohnK says, "Yes, that is being locally greedy. There are two directions in my current research. First, I am interested (perhaps obsessed) with learning how many sequences we can solve to OPTIMALITY. This is what we have mostly been discussing. The second direction is to design good methods for computing near-optimal alignments. A promising way to compute near-optimal alignments seems to be to use local search, as follows. Use your favorite heuristic to compute an initial alignment. Then, repeatedly divide the sequences into two groups. Over each group we have an induced multiple alignment. Align these two alignments under the multialignment objective function. This produces a new alignment. Then again divide the sequences into two groups, and repeat."

Robertgi says, "Does this converge??"

JohnK says, "It could take an exponential number of iterations. But each iteration improves the value of the alignment. (We stop when there is no improvement possible by a division into just two groups.)"

RobertDP says, "It seems to me that a lot of current work in statistics is directed toward progressive search techniques as well"

Robertgi says, "Is it feasible to come up with a probabilistic estimate on how many iterations take us how close to optimality?"

JohnK says, "Wow, I have no idea. Few problems are amenable to a probabilistic analysis. The real proof is to try it."

GeorgF still wonders about these bubbles :)

RobertDP [to GeorgF]: Bubbles, as in stock prices?

GeorgF says, "Bubbles==Regions inside the polytope defined by Carrillo-Lipman that are NOT explored"

JohnK says, "Bubbles. Yes, the explored space can have holes in it, but it will be connected."

GeorgF nods.

JohnK says, "The bubbles are most pronounced when the scoring function has a high gap-penalty. Then there can arise regions which a pairwise alignment has to ``go around''. The two ways to go around a region on a projected face leave a hole -- or a bubble. To be more descriptive, if there is a large insertion in the middle of a sequence, then the paths on a face involving this sequence will go around the inserted region. We cannot wander through the interior, as this would involve starting and stopping many gaps, which is prohibitive with a high gap penalty."

RobertDP says, "This was a great class today...do we have to make room for the bioethics discussion?"

Robertgi says, "We should come to close within 10 minutes. I think we should adjourn with a BIG THANKYOU to John."

JohnK says, "It has been a real pleasure to participate."

RobertDP says, "THANK YOU"

JohnK says, "Thanks for all the excellent questions!"

Henning says, "THANK YOU"

Robertgi says, "Good luck with the ethics session. Don't forget to tape it..."

End of the recording


Bionet Announcement.

Techniques for the optimal Multiple Alignment of Protein (and Nucleotide) Sequences were the focus of the Guest Lecture given by Dr. John Kececioglu of the Department of Computer Science at the University of Georgia, Athens, USA, on July 13 1995, at the electronic conferencing system BioMOO.

The transcript is available at the following location (WWW/hypertext):
http://www.techfak.uni-bielefeld.de/bcd/Lectures/kececioglu.html

In the form of a Questions and Answers session, the Carrillo-Lipman approach to Multiple Alignment was discussed, with a special emphasis on practical considerations. The corresponding software, known as MSA 1.0, has recently been improved yielding MSA 2.1, currently the most powerful tool for optimal multiple sequence alignment. Dr. Kececioglu, who co-authored the corresponding publications and code [LAK89] [GKS95], explained the new ideas behind MSA 2.1. The meeting closed with some remarks about alternative lower bounds, and a way of calculating near- optimal alignments using a local search technique.

Dr. Kececioglu was special guest in Robert Giegerich's study group, which was one of 7 groups of students that participated in the Summer 1995 Biocomputing Course organized by the VSNS Biocomputing Division / Bielefeld University, sponsored by GNA's Virtual School of Natural Sciences, and the Association for the Promotion of Science and Humanities in Germany. (see http://www.techfak.uni-bielefeld.de/bcd/welcome.html)

The lecture was held in the virtual classroom at BioMOO [And94], and both John Kececioglu and Robert Giegerich connected from the 1994 Dagstuhl meeting on Molecular Bioinformatics. Other participants connected from John Hopkins School of Medicine / USA, Harvard Medical School / USA, and Bielefeld / Germany.

References:

[And94] Christopher Anderson, Cyberspace Offers Chance To Do 'Virtually' Real Science. Science 1994;264:900-901
BioMOO URL: http://bioinfo.weizmann.ac.il:8888/
[GKS95] S. K. Gupta, J. Kececioglu, A. A. Schaeffer, Improving the Practical Space and Time Efficiency of the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment. J Computational Biology, to appear.
URL: http://ibc.wustl.edu/msa.html
[LAK89] D. J. Lipman, S. F. Altschul, J. Kececioglu, A Tool for Multiple Sequence Alignment. Proc Nat Acad Sci USA 1989;86:4412-4415

Back to VSNS BioComputing Division Home Page.