Sequence analysis

Friday, August 13, 2010 Tags: Bioinformatics, Bioinformatics Database, DNA Sequencing, RNA sequence, Sequence analysis 0 comments

Sequence analysis

Sequence

Amino acids are commonly referred to by name or by an

Neutral-Nonpolar

Glycine

L-Alanine

L-Valine

L-Isoleucine

L-Leucine

L-Phenylalanine

L-Proline

L-Methionine

Neutral-Polar

L-Serine

L-Threonine

L-Tyrosine

L-Tryptophan

L-Asparagine

L-Glutamine

L-Cysteine

Acidic

L-Aspartic

L-Glutamic

Basic

L-Lysine

L-Arginine

L-Histidine

Nucleic
UUU

UUC

UUA

UUG

CUU

CUC

CUA

CUG

AUU

AUC

AUA

AUG

GUU

GUC

GUA

GUG

* GUG may also code for the initiator Met. This

Regulation of

thank you suresh kumar

Basics for sequence analysis

Proteins

A protein is typically built of a series of basic blocks called amino acids , chained together in a linear sequence of blocks. Amino acids may come in a variety of shapes and properties: they may be small or bulky, hidrophobic or hidrophyllic, electrically charged or neutral, etc... hence allowing for very complex shapes and interactions to be produced.

abbreviation, usually in three or one letter. This allows for more efficient descriptions of how they are chained together to build a protein:

Acids

For them the number of basic building blocks is a lot smaller, each nucleic acid chain being composed of series of only four possible different nucleotides which furthermore provide for a very limited set of interactions.

Nucleic acids come in two flavors: DNA (DeoxyriboNucleic Acid) and RNA (RiboNucleic Acid). Both of them consist of a series of nucleotides that are glued one after the other to constitute the sequence of blocks that make up the functional chain.

Nucleotides are composed of a phosphate group, a sugar (ribose in RNA, and deoxyribose in DNA) and a base which marks the specific difference among nucleotides. The base may be one of guanine, cytosine, adenine and thymine in the case of DNA or guanine, cytosine, adenine or uracil for RNA. They can be referred to by their one letter abbreviations G, C, A, T and U. Interactions are mainly driven by the stablishment of hydrogen bonds, which can only be established among thymine (or uracil) and adenine (two hydrogen bonds) and cytosine and guanine (three hydrogen bonds).

As we said previously, the main role of nucleic acids is to convey all the genetic information needed to make proteins and control the building process. Protein sequences are coded by nucleic acids using groups three of nucleotides that code for a given amino acid: the code is more or less universal with little exceptions, and includes redundancy to increase the fidelity of the reading process when making duplicates or translating the information:

3-letter	1-letter
Gly	G
Ala	A
Val	V
Ile	I
Leu	L
Phe	F
Pro	P
Met	M

Ser	S
Thr	T
Tyr	Y
Trp	W
Asn	N
Gln	Q
Cys	C

Asp	D
Glu	E

Lys	K
Arg	R
His	H

triplet is therefore "ambiguous". expression is encoded as specific patterns that are to be recognized by the translation machinery under appropriate circumstances.

Sequence databases:

For overview of database - click here

Overview of sequence analysis tools

Sequence Comparison

An alignment is an arrangement of two sequences, which shows where the two sequences are similar, and where they differ. An optimal alignment, of course, is one that exhibits the most similarities, and the least differences. Broadly, there are three categories of methods for sequence comparison.

• Segment methods compare all overlapping segments of a predetermined length (e.g., 10 amino acids) from one sequence to all segments from the other. This is the approach used in dotplots.

• Optimal global alignment methods allow the best overall score for the comparison of the two sequences to be obtained, including a consideration of gaps. These programs align sequences over their whole length.

• Optimal local alignment algorithms seek to identify the best local similarities between two sequences also including explicit consideration of gaps. Alignment may only be over a short span of sequence.

Dotplots

The most intuitive representation of the comparison between two sequences is using dotplots. One sequence is represented on each axis and significant matching regions are distributed along diagonals in the matrix.

There are two different algorithms that are commonly used in creating dotplots. The first method involves matching identical regions of sequence and plotting a dot in these areas. The second involves using "sliding windows" to compare two sequences using a threshold score ` * ' value. A window size is selected as a run of adjacent nucleotide or amino acid residues, and a score chosen to reflect the degree of similarity of sequence required. Each window of sequence A is compared to each window of sequence B, and a dot is only placed in that region if the match scores or exceeds the set threshold level.

Online tool links:

Dotlet Programme

Learn dotlet by example

Sequence alignment

The algorithms we will be using are more rigorous than those used for searching databases; so even if you have retrieved a sequence from a database using something like BLAST. The basic idea behind the sequence alignment programs is to align the two sequences in such a way as to produce the highest score - a scoring matrix is used to add points to the score for each match and subtract them for each mismatch. The matrices commonly used for scoring protein alignments are more complex than the simple match/mismatch matrices used for DNA sequences such as the one we saw earlier; the scores that form the protein matrices are designed to reflect similarity between the different amino acids rather than simply scoring identities. Over time various mutations occur in sequences; the scoring matrices attempt to cope with mutations, but insertions and deletions require some extra parameters to allow the introduction of gaps in the alignment. There are penalties both for the creation of gaps and for the extension of existing ones; the default gap parameters given in alignment programs have been found to be empirically correct with test sequences but you should experiment with different gap penalties.

BLAST

BLAST (Basic Local Alignment Search Tool) is a heuristic method to find the highest scoring locally optimal alignments between a query sequence and a database. Previous versions of BLAST did not allow gapped alignments, but BLAST2 (from the HGMP-RC telnet and www menus) does. A gapped BLAST search allows gaps (deletions and insertions) to be introduced into the alignments that are returned. Allowing gaps means that similar regions are not broken into several segments. The scoring of these gapped alignments tends to reflect biological relationships more closely.

The BLAST algorithm and family of programs rely on work on the statistics of local sequence alignments by Altschul et al[]. The statistics allow us to estimate the probability of obtaining an alignment with a particular score. The BLAST algorithm permits nearly all sequence matches above a cutoff ` * ' to be located efficiently in a database.

The algorithm operates as follows:

• BLAST scans the database for words (typically 3-mers for proteins) that score at least T (a designated threshold value) when aligned with a word in the query sequence - such aligned pairs are called hits.

• If a second non-overlapping hit is found within a distance A of the first and on the same diagonal, the first hit is extended between the database and query sequences in both directions. Extension continues, scoring all the time, until the running score drops below the maximum score seen so far by a value X. The resulting local alignment is called an HSP (high-scoring segment pair) or MSP (maximum scoring segment pair).

• If the alignment score of the HSP exceeds a given value Sg (the gapped score), then a gapped extension of the HSP is initiated.

Earlier versions of BLAST looked only for single hits and extended them all; however, the extensions did not incorporate gaps and thus missed some potentially interesting matches. The gapped extension currently used, takes much longer to execute, but speed is improved overall by the requirement for two non-overlapping close hits before the initial extension is triggered, and the value of Sg is chosen so that only about one extension is triggered per 50 database sequences.

These modifications to BLAST mean that it now runs three times faster than earlier versions and in trials it found more statistically significant alignments than the old BLAST .

BLAST FAMILY OF PROGRAMS

The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases. (Most of the time use of these is behind an interface.)

• blastp: compares an amino acid query sequence against a protein sequence database.

• blastn: compares a nucleotide query sequence against a nucleotide sequence database.

• blastx: compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.

• tblastn: compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).

• tblastx: compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

• PSI-Blast: Position-Specific Iterated BLAST . This is potentially a very sensitive method to pull out significant hits in a protein-protein database search. This first performs a gapped BLAST database search and then uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-Blast may be iterated until no new significant alignments are found. We'll look at this tomorrow when we do some protein analysis.

Online tool links:

NCBI - Blast

WU - Blast@ EBI

Global sequence alignment

A global alignment is one that compares the two sequences over their entire lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length. The alignment maximises regions of similarity and minimises gaps using the scoring matrices and gap parameters provided to the program.

Online tool link:

Clustalw @ EBI

Local sequence alignment

global sequence alignment algorithms align sequences over their entire lengths. You do need to think about whether that type of alignment makes sense for your sequences. For our example, where we expect each exon to be represented in the sequences and in the same order, it has worked well - however, how well do you think this approach would work with, for example, multidomain proteins that share one domain but not others, or sequences where there have been regions of duplication? A second comparison method, local alignment, searches for regions of local similarity and need not include the entire length of the sequences.

Online tool links:

Pairwise alignment at EBI

Pairwise alignment at NCBI

Protein Sequence Analyisis

You can get a variety of clues by looking for patterns and motifs in your sequence:

• These are often derived from multiple sequence alignments.

• Conserved protein domains or regions can be very useful in trying to determine which protein family a sequence belongs to, catalytic sites, carbohydrate binding sites etc.

• Various research groups have created their own databases and search tools; it might be worth using a variety of these.

FIND HOMOLOGOUS ( PARALOGOUS AND ORTHOLOGOUS) SEQUENCES

Using a database similarity search can give you a great deal of information:

• Homologues may be well annotated and their function documented in the literature.

• Simply comparing your sequence with homologues can tell you a lot.

• Phylogenetic analysis may reveal evolutionary relationships between proteins and help you decide which family or super family a protein belongs to.

• N.B. Be aware of convergent evolution.

HAVING SOME IDEA OF STRUCTURE MAY HELP YOU PREDICT POSSIBLE FUNCTIONS

Knowing the protein fold(s) together with conserved domains (or even residues) may tell you what type of functions this protein could have.

align="justify">scource:-bioinformaticsweb

Biotechnosium...

Follow Us On Twitter

Keep UpTo Date

Labels

Followers

My Blog List

Pages

Blog Archive

Feedjit

Sequence analysis

Nucleic
UUU

UUC

UUA

UUG

CUU

CUC

CUA

CUG

AUU

AUC

AUA

AUG

GUU

GUC

GUA

GUG

* GUG may also code for the initiator Met. This

Proteins

Dotplots

BLAST

BLAST FAMILY OF PROGRAMS

Protein Sequence Analyisis

FIND HOMOLOGOUS ( PARALOGOUS AND ORTHOLOGOUS) SEQUENCES

HAVING SOME IDEA OF STRUCTURE MAY HELP YOU PREDICT POSSIBLE FUNCTIONS

No Response to "Sequence analysis"

Popular Posts

Phe	UCU	Ser	UAU	Tyr	UGU	Cys
Phe	UCC	Ser	UAC	Tyr	UGC	Cys
Leu	UCA	Ser	UAA	Stop	UGA	Stop
Leu	UCG	Ser	UAG	Stop	UGG	Trp
Leu	CCU	Pro	CAU	His	CGU	Arg
Leu	CCC	Pro	CAC	His	CGC	Arg
Leu	CCA	Pro	CAA	Gln	CGA	Arg
Leu	CCG	Pro	CAG	Gln	CGG	Arg
Ile	ACU	Thr	AAU	Asn	AGU	Ser
Ile	ACC	Thr	AAC	Asn	AGC	Ser
Ile	ACA	Thr	AAA	Lys	AGA	Arg
Met	ACG	Thr	AAG	Lys	AGG	Arg
Val	GCU	Ala	GAU	Asp	GGU	Gly
Val	GCC	Ala	GAC	Asp	GGC	Gly
Val	GCA	Ala	GAA	Glu	GGA	Gly
Val*	GCG	Ala	GAG	Glu	GGG	Gly

Biotechnosium...

Follow Us On Twitter

Keep UpTo Date

Labels

Followers

My Blog List

Pages

Blog Archive

Feedjit

Sequence analysis

Nucleic UUU UUC UUA UUG CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG * GUG may also code for the initiator Met. This

Proteins

Dotplots

BLAST

BLAST FAMILY OF PROGRAMS

Protein Sequence Analyisis

FIND HOMOLOGOUS ( PARALOGOUS AND ORTHOLOGOUS) SEQUENCES

HAVING SOME IDEA OF STRUCTURE MAY HELP YOU PREDICT POSSIBLE FUNCTIONS

No Response to "Sequence analysis"

Popular Posts

Nucleic
UUU

UUC

UUA

UUG

CUU

CUC

CUA

CUG

AUU

AUC

AUA

AUG

GUU

GUC

GUA

GUG

* GUG may also code for the initiator Met. This