Dr. William R. Pearson
Dept. of Biochemistry
University of Virginia
GeorgF says,"Welcome to the Electronic Conference, by Bill Pearson,
On Distant Evolutionary Relationships !
The event has been organised by the VSNS BioComputing Division, with a
lot of help by BioMOO at Weizmann Institute
(and a lot of other volunteers, notably David Atherton, Heinz Hemken,
Eitan Rubin, and the person I now forget; pls yell at me:)
I'd like to thank Bill Pearson for engaging in this cutting-edge way of distributing information. I'd like him to say a few introductory comments now, telling us how he would like the lecture/discussion to proceed.
Pearson says, "Good morning. I appreciate all of you coming today to discuss protein sequence comparison. I am a very unsophisticated MOO user, so I can do only simple things. Since I am much more comfortable giving a talk where I am physically in front of the audience, I have prepared some slides to focus on the main issues in identifying distantly related proteins."
Felder says [to Pearson], "which application made these structure homology searches?"
Pearson says, "I did not do any structural comparisons. I simply picked
examples of proteins that are known to be homologous.
In the first slide, I introduce the fundamental problem facing a person doing a sequence database search. The program, whether it be BLAST, FASTA, or Smith-Waterman measures sequence similarity - how do you know when a certain level of similarity implies homology?
Now that FASTA and SSEARCH provide statistical estimates, one can often, but no always, use statistical significance as a guide to sequence homology. However, many homologous proteins do not share pairwise statistical significance, and sometimes statistical significance is the result of peculiar amino acid sequences rather than genuine divergence from a common ancestor. The serine protease structures on the first slide demonstrate this.
Closely related sequences, like bovine trypsin and S. griseus trypsin, have very high similarity scores (highly significant) and very similar structures. But we cannot detect significant sequence similarity between bovine trypsin and S. griseus protease A, even though they have very similar structures as well.
So significant similarity can imply homology, but absence of significant similarity does not tell us anything.
The point of the first slide is that when sequences are homologous, the proteins have the same structure. But if they are not homologous, then one does not know anything about the structure of the protein, even if it has a high (but not statistically significant) similarity score. Any questions?"
Robertgi says, "It is clear that high similarity does not imply same structure/function. Doesn't it still imply some evolutionary relationship?"
Pearson says, "If the sequences are not homologous (they do not share a common ancestor), then high similarity means nothing. In the first slide, you can see that the Endochitinase, which is the highest scoring non-serine-proteinase, has no similarity at all with the serine proteases."
Maja says, "high similarity over the whole protein, or high similarity over short regions?"
Pearson says: "If the sequences have high similarity (>25%) over their entire length then they are almost always homologous. It is high similarity over shorter regions, e.g. 60% over 20 residues, or 80% over 8, that does not mean anything.
SophiaK says, "Can it be that two sequences share high similarity but they are not homologous? If this is the case, then this implies that we have to always consider the tertiary structure of the proteins."
YAGIL says, "could you state what Yellow and Red on the slide means ? homology and/or similarity?"
Gustavo says, "But you usually don't have evolutionary information, only similarity. I understand Robert's question was how/when you can imply homology from similarity."
Pearson says, "you can see the same thing on the globin slide (slide 3) where the NADH peroxidase (again the highest scoring unrelated protein) has no structural similarity"
GeorgF says, "These are enough Questions for the moment. Let's now tackle them one at a time!
Pearson says, "The yellow ribbons are beta-sheets, the red is alpha helices. When I say that high-scoring, unrelated sequences do not share a common structure, I mean high-scores that are not statistically significant. In both the structure slides, an expectation value E()=??? reports the statistical significance of the match. Expectation values < 0.02 are usually significant (and imply homology) but expectation values > 1 are not significant. The expectation value is the number of times you would expect to obtain a similarity score if a search was done with a random sequence. Is the meaning of expectation value clear?"
DonovanH nods. [2 additional nods.]
DavidGP says, "Why 0.02?"
Gustavo doesn't understand why each structure seems to have an E() value associated, instead of pairs of structures.
Pearson says, "Gustavo is correct, a single sequence (or structure) cannot have an E()-value (or a similarity score). The E()-values refer to the expectation value of the similarity score obtained by comparing bovine trypsin to the structures on the slide in a search of Swiss-Prot. The other half of the pair - trypsin - is not shown."
PaulSt says, "Have you found statistically significant matches which are NOT similar in structure?"
Pearson says, "The standard in statistics would usually be 0.05 - but I am a bit more conservative, so I said 0.02."
Miri says, "In the first slide we can see E<66 and E<4.2 . Shouldn't it be E>66 ?"
Pearson [to Miri] "In this case, E()<66 which is not statistical significant (even though the proteins are thought to be homologous based on their structures and functions)."
Pearson says, "You can get statistically significant scores sometimes because you have a simple sequence repeat, e.g. PYGPYGPYG. In those cases, statistical significance does not imply homology. But those are usually easy to detect. In general, statistical significance => homology => similar structures."
Nikob says, "I don't understand why the expectation value of lupine legehemoglobin is > 100 in the globin slide. To me, it looks very homologous to globin."
Pearson says, "Yes, this is exactly the point. A comparison of human alpha globin and yellow lupin sequences has a very poor similarity score, even though the proteins are homologous and thus have the same structure."
Robertgi says, "Is the value of statistical significance dependent on the size of the data base we search?"
Pearson says, "Yes - the statistical significance depends on the size of the database search. I usually think in terms of the significance after a search of a database the size of SwissPROT - about 40,000 entries."
Iddo [to Pearson]: "what's your criterion for structural similarity: c-alpha RMS comparison?"
Pearson says, "In the case of the examples shown here (serine proteases and globins), structural similarity is very strong. I do not have a general criterion for structural similarity, but I am using examples where it cannot be disputed. Shall we go on?"
Iddo nods. [3 additional nods.]
Pearson says, "Lets move to the slide after the globins, which shows the relationship between sequence similarity score and sequence length. The programs calculate the statistical significance of a match by correcting the similarity score for the ln(length) effect, and then using the extreme value distribution to calculate the probability of obtaining the length normalised score. You can see that by correcting for the length effect, shorter related sequences are more easily separated (in similarity score) from longer unrelated sequences. This allows one to detect more distant relationships. Is it relatively clear that by correcting for length effects, one can detect more distant relationships?"
MarkP nods [4 additional nods.]
AndyL says, "Some of the related scores are still not that much different from the unrelated scores though of similar length."
Pearson says, "Yes, that is always going to be true for large, diverse
sequence families. There will always be distantly related sequences that have
lower scores than unrelated sequences.
But remember, this distribution of scores is based on a single query sequence. The later slides will show that by searching with several family members, you can often detect (have statistically significant scores) for these distantly related sequences."
MarkP says, "is it useful to repeat the searches with each of the high scoring results?"
Pearson says, "Before looking at the benefits of searching with several sequences, lets convince ourselves that the statistical estimates are accurate. Since these are estimates for alignments with gaps, (unlike BLASTP) it is possible to use gap penalties that break the statistics. Look at the next slide (Accurate statistics with Smith-Waterman)."
AndyL says, "to MarkP: wouldn't it be better to repeat the search with the lower scoring but related sequences"
EitanR says, "Are we talking about local similarity only? Are the scores FastA scores? SM?"
Pearson says, "We are talking only about local similarity, but as the slides show, it doesn't matter whether you use Smith-Waterman or FASTA"
MarkP says [to AndyL] "yeah, I suppose it would be best to use the lowest scoring but unequivocally significant sequences"
Pearson says, "You guys are getting ahead of me. But AndyL has summarised the point of the talk - do additional searches with sequences with lower, but statistically significant scores"
AndyL says, "Sorry :-)"
Pearson says, "No problem. To convince yourself that the statistical estimates
are accurate, you need to look at two things:
(1) the agreement between the expected and actual number of sequences obtaining high scores. This is shown in the inset.
(2) Another good way is to look at the expectation value for the highest scoring unrelated sequence.
But, you say, how do I know whether a sequence is unrelated? For large families of proteins, like the examples here, you can do additional searches. A low-scoring related sequence should produce good scores with other family members. A high-scoring unrelated sequence usually will not preferentially select family members. Is it clear how doing additional searches can help you tell a low-scoring related from a high scoring unrelated?"
AndyL says, "The low scoring sequence will only pull other members out that have evolved in the same direction though"
TRBurglin nods. [3 additional nods.]
AndyL says, "presumably you may need to do several such extra searches?"
Pearson says, "Not really. If you have a large family, then you have examples of many of the sequences that can produce the same folded structure."
Brian Brunk says, "Yes, but how does one initially determine which sequences to use as related and unrelated?"
StuY says, "yeah"
TRBurglin says, "never forget inspection by eye"
Pearson says, "In general, a distantly related sequence will be relatively close to at least some members of the family. I will show this with the G-protein coupled receptor family in just a bit. Let's skip forward to the slide labelled Protein vs DNA comparison. The most important thing you can learn today is that protein sequence comparisons are far far far far more sensitive than DNA sequence comparisons. The Protein vs DNA slide shows the expectation values obtained when a search was done with a protein sequence against genpept and the cDNA was searched against the mammalian part of Genbank. Note that there are many related sequences with very good expectation values in the protein comparison that have terrible expectation values with DNA."
MarkP says, "When searching with the protein I work on, Dps, I found 7 homologs with BLAST using the salmonella sequence. However, one extra homolog that was not pulled out by the Salmonella sequence, was pulled out by searches with a couple of the other homologs."
SophiaK [to MarkP]: what is your cut-off to decide for a homologue?
Pearson says [to MarkP], "yes, that is exactly what I am saying. One should do multiple searches with low scoring homologues."
MarkP says [to SophiaK], "I don't recall exactly, but it was something very low, like 0.001 or less"
Pearson says [to SophiaK], "when you are using SSEARCH or FASTA, cut off
values around 0.02 work well. Since BLAST uses a segment-based approach, you
may want to be more conservative.
OK, does everyone promise not to do DNA sequence comparisons any more (except when looking at repetitive DNA families)"
TRBurglin says, "I think there is one exception"
EitanR says, "I'm working on repetitive DNA - NEVER IGNORE DNA altogether!"
Gustavo promises to do many more DNA seq comparisons. For intergenic sequences. :)"
MarkP says, "The relationship was confirmed by eyeballing a multiple alignment--certain residues were absolutely conserved even in the distant relative (which was only a partial seq)"
AndyL says, "EitanR: me too"
Maja says, "I have heard of occasional exceptions where the DNA sequence is more similar than the protein sequence."
TRBurglin says, "I found it very useful to backtranslate protein sequences and do dna searches but one requires a perfect match then"
Iddo says, "How is codon redundancy factored into all this?"
Pearson says, "Well, just remember that when I do a protein sequence comparison, I can easily look back more than 1 billion years. For non-coding sequences, DNA sequence comparison rarely works farther back than 200 My [[million years]]. Proteins allow you to look back 5X-farther, 5X-farther typically."
Gustavo says, "By translating DNA sequence into protein sequence, you lose information. If protein searches are more sensitive than DNA searches, this means that the algorithms for DNA searches have to be improved, not dropped off entirely. :)"
Pearson says, "While you do lose some information by translating DNA
sequences, you gain two pieces of information that are extremely valuable.
(1) You have the coding frame. (2) you have the length of the sequence.
Since DNA is one chromosome long sequence, the length corrections shown
earlier cannot be applied as easily.
OK, the slide entitled "Examining Marginal Sequence similarity" shows exactly what we have been talking about. Doing additional searches with marginally significant matches to see whether you get good scores with family members or not. I think it is time for me to wrap this up and get some more questions. I will leave the web pages on for some time (perhaps indefinitely). The set of pages on the G-protein coupled receptors [[first page]] [[second page]] shows another example of how you can use low-scoring related sequences with expectation values around 0.01 to find missing links between distantly related protein families.
Lets open the discussion for questions, and plan to finish up in the next 10-15 minutes."
PaulSt says, "What do you suggest when searching for relationships in non-protein coding regions?"
Mike says, "Bill - can you translate E values into confidence levels? I think I read that 0.05 ~ 95%?"
Pearson says, "The statistical estimates for proteins can also be used for
DNA. There is an earlier slide that shows this. An E-value is an E-value -
so 0.02 works there as well.
Yes - 0.05 => 95% confidence that the score was not caused by a chance event."
EitanR says, "In the case of the ISOPENTENYLTRANSFERASE, you get a mixture of GSTs [[Glutathione-S Transferases]] and many other things. What do I do if my favorite protein gives me this kind of result? Do you have any cook-book recommendations?"
Pearson says, "That is a good question. The interesting thing is, all those GST's are very similar to one another. So you haven't really hit on a bunch of different GST's, as would be expected for a genuine homology. You have a chance high score with one GST, and then, of course, a similar high score with other GST's that are 95% identical. There are some rules-of-thumb given in the summary. Any other questions or comments?"
TRBurglin says, "how about small matches? Even though they are not evolutionarily relevant"
Mike says, "how do you deal with frame-shift errors in dna searches?"
Brian Brunk says, "Bill, you mentioned that one needs to be more conservative in interpreting BLAST matches, which search program do you recommend?"
AndyL says, "How useful do you think motif database searches can be?"
Pearson says, "OK - one at a time:"
Pearson says, "(1) small matches - if the structures are not homologous, small matches do not mean anything."
TRBurglin says, "can they show convergent structures? How about small matches 20-40 long, which"
Pearson says, "That was the point of the serine protease and globin structure examples. High scoring unrelated sequences (with small matches) have completely different structures."
Pearson says, "RE: Motifs - yes, they can certainly be useful. Sometimes they extract the essential information required to identify very diverse sequences. But do not confuse pattern classification with biology; if the sequences are homologous, they share a common fold and all the amino acids in fold must be conserved. The fact that a very small number of amino acids have a very highly conserved pattern does not make a protein function or protein fold. If you put that motif in a non-homologous protein, you would not recreate the function of the motif."
Maja says, "yes, what about convergent evolution at active sites?"
TRBurglin says, "what I meant with the small similari... ok, no convergent structures?"
Pearson says, "RE: frameshifts - I don't have to deal with them, I only look
Seriously, FASTA, because it does a limited Smith-Waterman that allows gaps, can do a much better job than BLASTN on error-prone DNA data."
Gustavo . o O ( error-prone, or mutated )
Pearson says, "to Brian - The current versions of FASTA and SSEARCH are significantly more effective than BLASTP for the most distant relationships. I have a paper in Prot. Science ((1995) 4:1145-1160), see the reading list)."
DonovanH [to Pearson]: "Have you ever tried supplementing a protein search with DNA sequence comparison. Due to degeneracy, you should gain some small info, right?"
MarkP says, "hang on, what do you mean you don't have to deal with frameshifts? many sequences in the databases are riddled with small framehifts if you look hard for them."
Pearson says, "RE: convergent structures - one almost never detects significant similarity as the result of convergence. The most common exception is repetitive coiled-coil alpha-helical structures."
Iddo [to Pearson]: "Can you factor into the sequence similarity amino-acid substitution (such as Leu-->Ile ) which have small structural significance?"
Pearson says, "Molecules that have a repeated coiled-coil structure often share significant similarity even though they are not homologous."
Mykol says, "Leu => Ile can have a dramatic effect in the multimerization of coiled-coils, for instance."
Pearson says, "to Iddo - the PAM250 and BLOSUM50 scoring matrices that are used for protein sequence comparison factor in the Ile->Leu similarity"
Iddo says, "I gave a bad example... But I hope that my point was taken.
Certain a.a. substitutions have smaller structural effects than others"
Iddo [to Pearson]: Thanks.
Pearson says, "RE: frameshifts. We are working on a FASTX type of program that does a protein sequence comparison using the underlying, possibly
frameshifted, DNA sequence.
OK folks, are we all done?"
Brian Brunk says, "Yes, Thanks!!"
Mykol says, "Thanks!! Especially for the references."
Mike says, "yep - thanks Bill!"
[More applause from the audience deleted.]
TRBurglin "may I come back to the DNA example?
MarkP says, "what about scoring in the two different sorts of serine codon?"
Pearson says, "To TRBurglin - yes?"
GeorgF says, "I'd like to thank Bill Pearson, and all the people who helped, for this very successful event. I hope we will see Bill Pearson again at another time, and we hope that additional distinguished researchers will join him!"
Miri says, "Thanks for the interesting experience."
Mike says, "bye all. Good lecture :)"
TRBurglin says, "basically, although it is a DNA search, it is really a protein against DNA in all six frames search. For example I worked with homeobox genes, and by taking the most conserved region, "
Pearson says, "To TRBurglin - homeobox genes are very highly conserved, so they are not a very good example."
TRBurglin hmm, perhaps we can email? :-)
Pearson says, "Certainly, I am firstname.lastname@example.org"
Pearson says, "Good bye ALL. I enjoyed my visit."
GeorgF says, "(Sales pitch :-)
If you'd like to inform yourself about the VSNS BioComputing Division, try http://www.techfak.uni-bielefeld.de/bcd/welcome.html. We'll start another course in Spring/Summer 1996, and need instructors and consultants, see http://www.techfak.uni-bielefeld.de/bcd/plans.html."
Available: Electronic Meeting transcript on "Protein Evolution - How far back can we see?", by Dr. W. Pearson.
Distant protein sequence relationships and their detection were the focus of the Guest Lecture given by Dr. William R. Pearson of the Department of Biochemistry at the University of Virginia, USA, on December 5, 1995, at the electronic conferencing system BioMOO, with participants connecting from 10 countries.
The edited transcript includes links to a slide show prepared by Dr. Pearson, and is available at the following locations (WWW/hypertext): http://www.techfak.uni-bielefeld.de/bcd/Lectures/pearson2.html http://merlin.mbcr.bcm.tmc.edu:8001/bcdusa/Lectures/pearson2.html http://info.mcc.ac.uk/hpctec/courses/Biocomputing/vsns/bcd/Lectures/pearson2.html
The statistical significance of similarity scores, and their implications on homology and structural similarity were discussed first. Dr. Pearson then explained how these scores can be corrected for the lengths of the sequences, and how their accuracy can be assessed. Using additional searches to tell a low-scoring related sequence from a high-scoring unrelated one was the next topic of the 2-hour conference. Our guest explained that the usual superiority of protein sequence comparisons (in contrast to the comparison of the protein-coding DNAs) is due to knowledge about the coding frame, and the length of the sequence, even though there is a loss of codon information. The participants discussed this and a wide range of other issues until the end of the lecture.
The Electronic Conference was organised by the VSNS Biocomputing Division, sponsored by the Virtual School of Natural Sciences, and the Association for the Promotion of Science and Humanities in Germany (see http://www.techfak.uni-bielefeld.de/bcd/welcome.html). We would like to thank Francisco De La Vega, Alexander Sczyrba, David Atherton, Andy Tsouladze, Chris Lilley, Heinz Hemken, Eitan Rubin, Gustavo Glusman, and other "BioMOO folks" (see http://bioinfo.weizmann.ac.il/BioMOO) for their assistance.