クロコダイル日記: C大学での特別講義　スクリプト

今年の春に初めての英語で講義を行いました。その時に準備したスクリプトをメ
モをここに貼っておきます。
結局、本番ではスクリプトは使わず、ほぼアドリブでこなしましたので、話した
内容とは随分と異なりますが。
誤字・脱字などはそのままです。スライドは敢えてだしません。

Slide 1.
Thank you very much for the introduction of me, Tamura-sensei.
Today, I would like to give you a lecture on fundation resources for
bioinformatic analysis.
All the resources providing here are available on world wide web, so you
can access these materials everytime, anywhere even if in GOLDEN WEEK.
So let me start my part of this lecture.

Slide 2.
Firstly, I would like to confirm the definition of bioinformatics.

I think almost of you know that this type of workout is how to apply
informatics or mathematics to solve problems suggested by biologists.

Explanation from other side is also available. That is How to explain
biological phenomenon by processing of biological data by use of
computer informatics technology

This means "bioinformatics" itself is arose from two field of approach.
So, we recognize bioinformatics is a category of informatics and a
category of biology.
This is not serious matter because this field of sciences is developed
by information science and biology.

This field is also known as "Computational Biology". Probably you know.
But today I would like to call it "bioinformatics".

And also, please note that definition of bioinformatics is different by
standpoint or educational background of each researcher. There are many
definitions. Someone say that bioinformatics is a time-scale biology
because the results are tightly connected to time such as million years
to picoseconds.

Slide 3.

This slide shows list of goal, merits and demerits of bioinformatics.
One of goal is building new hypothesis from the collected experimental data.
Not only for processing on biological data. Processing of experimental
data is just a first step of bioinformatics research at least I believe.
And also, Narrowing down of potential direction for next step of related
experimental study. Bioinformatics is very expected filed of work
especially for collaborative research.

Selected merits are described here.
In generally, bioinformatics analysis is reproductive if you want to
repeat it.
And computers are not expensive facility nowadays.
Everyone take same result by same method.

Demerits and pitfall are here.
Experimental data is needed for preferred analysis.
If you have no appropriate experimental data, you cannot kick off the
research or obtain correct results.
And you should note that part of result by bioinformatics research is
just a prediction, so results should be checked by experimental studies
if possible.
And you have to study hard may field of area, both of computational
science and web resources and progress in biology.
Finally, computer performance is very important.

Slide 4.

This slide shows a category map of bioinformatics.
Bioinformatics consists of three part of subcategory described here.
Representative work is Sequence analysis.
And structure informatics is also important especially for biomedical
sciences.
New field of work are arose in near decade called "Network biology"
Additionally, text-mining is also interesting work for medical science.
I review the part of subcategory to check their properties

Slide 5.

Sequence based informatics consists of sequence analysis such as
molecular evolution and comparative genomics.

This is the main stream of bioinformatics nowadays.
It is based on collaborative work by International banks of DNA or
protein sequence.
Also Accelerated by advancement of biotechnology of DNA, protein sequencing.
For technology of sequence analysis, natural language processing
effectively assisted.

Text mining is the natural language processing on biomedical literatures.
This field included in the part of sequence analysis because these
fields are tightly bound by method and technology.
Researchers of text-mining sometimes get into field of sequence analysis.

Slide 6.

Structure based informatics composed of three categories listed here.
These are 3D structure analysis, chemoinformtics and molecular
simulation technology.
Any molecule and biopolymers off cause consists of atoms.
So any structure of biopolymer even so macromolecule can be described by
atomic positions.
So, this field provides most microscopic resolution.
Progress of this type of bioinformatics is based on application of
physical chemistry such as thermodynamic theory and quantum chemistry.

In general any types of molecules in the cell can be targeted by
structural informatics.
For example, glycoprotein and lipid membrane can be treated by similar
manner to protein or DNA
Because the structural informatics handles atomic coordinate of molecule.

This field is focused on not only evolutionary interests also demand on
drug development.
You should note that this type of bioinformatics, especially for
molecular simulation are very costly and time-consumed. High-end type of
cluster servers are necessary.

Slide 7.

Network (pathway) based informatics is consists of expression analysis
represented by microarray informatics and systems biology.
Network type bioinformatics is focused on change in cellular location
and what is the condition for the expression of gene products.
This provides higher order of biomedical information that binds to cell
biology.
Systems biology is newly rising filed of bioinformatics and have great
potential of impact.
Goal of this strategy is reconstruction of cell.
So application of the systems biology will be synthetic biology.
And also will be most comprehensive biology.
However, most of the systems biology method is in still developmental phase.
For application to biomedical research, many problems should be solved.
The representative problem is deficiency of parameters and lack of
fundamental equation to solve.

Slide 8.

This is category map of bioinformatics again.
Actually, these categories are not independent, they are linked each
other by method, technology and experimental data for processing.
For example, Structure and sequence analysis are tightly bound by the
important relationship based on their evolution and function.
And molecular simulation strongly depends on input 3D structure.
And also, sequence analysis and text mining are sharing the method and
techniques.
I recommend, you learn about linkage and interaction among the subgroups
in the bioinformatics, to deeply and systematically understand the
bioinformatics researches.
Today's lecture is focused on sequence analysis and structural informatics.
And I would like to review in Japanese.

Slide 10

From here, I would like to introduce fundamental resources for sequence
analysis.

There are three types of sequence resources on WWW.
Data Bank type is primary database.
The databank accepts submission by someone who determines the sequence
of DNA or protein.
Some journal requests data submission before publication of article of
the sequenced material. This rule is named by hold until published (HUP).
However, quality of sequences in databank is not enough so much.
Sometime it includes error or miss significant information

Annotation type is a curated database to keep the quality of contents.
It is based on sequence bank and annotation work and clustering were
applied the databank contents. Method and target of annotation depends
on the database policy.
Almost of them are freely available.

Finally, I have to mentioned sequences in patent article.
DNA, protein sequences are also found in patent form.
There are basically opened by national department for intellectual
property of each country.
However, these types of data are not easily for handling and searching.
If you have to deal with patented sequence, patient work is needed.
Please note that HUP described above are not applied to the patented
sequences.
Patent owner do not have to submit to the public sequence bank.

Slide 11

I think you know Genbank of NCBI that is the representative data bank
for DNA sequences.
And also, DDBJ (DNA Data Bank of Japan) and EBI (European bioinformatics
institute ) are also established.
These DNA banks are sharing the sequence and annotation data through the
internet.
Update interval is everyday.
You have ever seen the DNA accession number in the databank such as
listed here.
One alphabet and 5 digit number is older entry.
More recent one has 2 alphabet and 6 digit number.
Number after the point means revised time of the sequence or contents of
annotation.

Slide 12.

Next one is databanks for protein sequences.
Three protein resources are well known and unified into international
databank called Uniprot. Uniprot itself mean "universal protein resource".
One of these is the protein information resource at Georgetown University.
Swiss-Prot hosted by EBI and Swiss Institute for bioinformatics (SIB)
provides only protein sequence directly determined by experimental sequence.
TreEMBL of EMBL amino acid sequence predicted from coding region of DNA.
Update interval is for 3 or 4 weeks.

Structure database also provides amino acid sequences.
Basically, aim of this type of databank is providing structures of
biomolecules, not for sequences.
However, you can utilize them as a possible protein sequence resources
off cause.
Details in Protein data bank are presented by latter slides.

Slide 13.

Annotation type database is reliable sequence resources with unique
annotation and it cover the pit falls of bank type data bases.
Data banks type is not user-oriented because of the sequence redundancy.
The problem is more serious when the clustering was not applied.
Too many identical sequences are stored as different entries.
That arise confused situation for browsing annotation and sequence analysis.
And also, sequencing error is unfortunately still surviving in the
databanks.

Annotation type has solved these problems.
Sequence-based clustering was applied to remove the redundancy.
And additional information on the sequences is provided by annotation work.
Annotation is related to gene family, evolutional information, 3D
structure motifs and so on.
Errors in sequence are corrected by annotation work such as trimming of
vector sequence contamination.
So I recommend using annotation database if you want to skip these
patient processes on sequence analysis because annotation database is
user-friendly and reliable resource.

Slide 14.

Representative annotation databases are listed here.
RefSeq at NCBI is well known annotation database on world wide web.
Details are presented in the next slide.
EnsEmbl hosted by EMBL, EBI and WTSI (Wellcome Trust Sanger Institute)
provides genome based annotation
H-Invitational database is annotation database for human full-length cDNAs.
UCSC (University of California Santa Cruz) genome site is genome-based
annotation database.
In general, coverage of biological species is different by their
annotation policy.
For example, RefSeq covers thousands of species but annotation is not
enough for effective data mining.
Ensemble provides more detailed annotation such as coding region of
alternative splicing and covers 30 model species and Homo sapience.
However, if genome sequencing is not completed, certain number of gene
is missed in the database.
H-Invitational DB is restricted to human genes only.
But the annotation is very deeply and comprehensive and gene count is
almost completed because they use all the transcripts of human genome.
Human curation work is applied for the annotation quality.

Slide 15.

Next, I would like to check what is annotation.
The simplest definition is just adding information on sequence and genes.
Actually, annotation by use of computational processing is based on
sequence analysis software and checking output by human curation.
Clustering of sequences is also important for construct "gene" and gene
family.
Linking to external resources is also a first step of annotation.

Narrow definition is functional annotation of gene by use of sequence
similarity search against homologous gene with known function.
Mapping to genome is also important to locate locus of genes.
Assignment of gene family and estimating taxonomic coverage are
evolutionary annotation.

Standard definitions are detection of repetitive sequence, subcellular
localization motif , expression profiles and mapping to metabolic pathway

Broad definition are linking to other databases and correction, edit of
sequence and so on.

Slide 16.

Current de facto standard of annotation database is RefSeq hosted by NCBI.
Sequence set is derived from GenBank.
Coverage of biological species is widest.
Vector sequences are removed from the original sequence.
Redundancy is also removed. An unique identifier is assigned to
identical sequence.
Update is regularly at interval of 2 or 3 month.
Entrez Gene database (formerly called as Locus Link) are build in NCBI
web site. This database provides genomic position and cytoband of the genes.
Gene map viewer is available for almost of model organisms.

Slide 17.

This is the list of patent databases.

USPTO is United States patent and trade mark office.
This department hosts a database of patented DNA.
But sequence search is not available.
Patome DB is based on sequences of WIPO and PSIPS database.
Gene ID, Gene symbol search is available.
PatGenDB has sequence search engine but not free database.
PSIPS is a public database maintained by USPTO.
Sequence data of WIPO can be downloaded.
Public database listed here are not designed for bioinformatics analysis.
So, patient processing is needed if you utilized them for bioinformatics
analysis.
But this type of database should not be ignored.
Please note that, according to PatomeDB article in Nucleic acid research,
55% of DNA sequences are linked to one or more patents.
I believe rate of patented sequence will be increased.

That is all for outlines of biological sequence resources in today's
lecture.
I will take your question.

Slide 18.

Next, I would like to talk about how to read data format of the sequence
resources.
Data format means data structure defined by computer program or database.
I think that is basic thing.

Programs, databases can define unique format for the processing.
Format converting programs basically free but almost of them can be
operated by UNIX and Linux.
Some major format is supported by programs running on Windows. But the
Windows version is not freely available.
Also, please not that some of format are not text file. Binary format,
only computer can read it is used sometime.

Here is the list of format mainly used in sequence analysis.
For DNA sequence and annotation, GenBank, EBI flat file format is widely
used.
DDBJ also provides unique format but almost same as GenBank format.
UniProt, PDB file format is also important for amino acid sequence but
FASTA format is most frequently used for the bioinformatics sequence
analysis.
I think almost of Web services and software for sequence analysis can
accept FASTA file format.
Download data of sequence is also FASTA file.
Today, I would like to talk about FASTA sequence format.

Slide 19.

Example of FASTA format is described here.
This example holds DNA sequence but it can be used for the amino acid
sequences.
FASTA file is originally input file format of sequence comparison
program FASTA.
I think you know BLAST. FASTA program is same type of sequence analysis
program.
First line is called "header" line. It starts "greater-than " angle
bracket symbol.
Header line includes sequence ID, and short annotation and so on.
Sequence is defined by a series of single letter.
Capital letter and lowercase are also used for the representing building
block.
For nucleotide sequences, the building blocks are listed from 5' to 3'.
For amino acid sequences, listed from N terminus to C terminus.
Number of characters is not restricted, but lower than 80 characters is
preferred.
For amino acid sequence, asterisk means a stop codon.
Double slush denotes the end of sequence.
Number of sequence in a FASTA file is not restricted.
If it contains only one sequence, it called single FASTA file.
Two or more FASTA file, it is referred to as multi FASTA file.

Slide 20.

In general, the building block of nucleic acid in sequence is
represented by A T G C and U.
However, for nucleotide sequence in FATSA is not only ATGCU.
There are many nonstandard representations for a building blocks listed
here.
There are defined by IUPAC naming rule.
R means G or A, purine type nucleic acid.
Y means T or C, pyrimidine nucleic acid.
I do not speech remained special characters but these are rarely
observed in FASTA files.
If you want to make a program for processing nucleic acid, you have to
cover this type of characters.
The special characters in complement strand are not solved.
These are hard to process.

Slide 21.

Sequence comparison is main stream of sequence analysis.
So please check the aims, goal of sequence comparison.

Discovering a gene.
For example: Prediction of transcribed region on genomic sequence.
Prediction of gene function for Function annotation based on sequence
similarity with known genes.
It can be used for classifying sequences. That process is also known as
clustering
To define gene family and to locate functional motif.
Prediction of evolutionary relationship by sequence comparison.
That result in construction of phylogenetic tree.
Linkage between sequence and phenotype.
Phenotype prediction based on SNP, microsatellite.
Also used for quality check on sequence. Trimming DNA of cloning vector,
Detection of frame-shift error and so on.

Slide 22.

There are several options for sequence comparison.
If the similarity between your sequence and target is expected, you just
aligned your queried sequence against target database or sequences.
In general, alignments are categorized into 2 types.
Pairwise type of alignment is made by two sequence comparison.
Multiple sequence alignment is made by three or more sequences.
Sequence alignment can be generated if the sequences are closely related.
Of course, it is also utilized to find out identical sequences in the
database or genomic sequence.

If your queried sequence is distant from the target database,
You will try to make pairwise alignment by PSI-BLAST
Please note that PSI-BLAST can be applied to amino acid sequence.
If you could not find any result by PSI-BLAST, you can try again by use
of alternative method based on machine learning theory such as hidden
Markov model, artificial neural network, support vector machine and
genetic algorithm.
These methods are effective to find out distantly related sequence.

You can choose option one or option two.
Point to regard is your knowledge on what is the sequence derived from.
For example, biological species is very informative.
Basically, you try the method one and if non of result is responded, you
try PSI-BLAST and other method such as machine learning theory.

Slide 23.

I would like to explain the fundamentals of algorithm for pairwise
alignment.
In 1970, Needleman and Wunsh have applied dynamic programming method to
making pairwise alignment from two related sequences.

Dynamic programming itself is based on simple idea.
Create a dynamic programming matrix based on substitution matrix.
Locate the optimal score and the alignment is traced from here to origin
of matrix through the shortest path.

Smith-Waterman algorithm is an applied version of dynamic programming
for local alignment. Smith-Waterman algorithm was implemented in a
program S SEARCH.
Details of Smith- Waterman algorithm will be explained later.

In 1988, FASTA program was developed by Pearson.
This program computes E-value of alignment. This value is utilized for
assessment of alignment instead of alignment score.

At least in 1990, BLAST program is published by David Lipman.
This program is faster than FASTA. Sensitivity of BLAST is decreased by
tread-off with the higher performance.
Several versions of BLAST children were created.
NCBI BLAST is most frequently used and implemented in lot of Web
services as sequence search engine.

Slide 24.

Next slides show you the algorithm of BLAST.
Basically, logic of BLAST consists of four steps.
Cut, Align, Stretch and Combine.
First three step in corresponds with Smith-Waterman algorithm.
For nucleotide sequence, complement sequence of query is generated and
aligned to target sequence.

I would like to explain more detail by use of the pair of the sequence
described here.

Slide 25.

First step of the BLAST algorithm is splitting the query sequence into
fragments called "Word".
Sequence length of Word can be modified by users.
In this sample word length was set at three.
So the queried sequence is split into three fragments NCI, AMQ, and MPY.
In similar manner, frame-shifted words are also created as described here.

Slide 26.

Second step of the BLAST is to align the Words against subject sequence.
Subject sequence is the target sequence in BLAST.
The matched island is called "Word hit" or "Seed".
Insert or deletion are not inserted in Word hit.
There is a different between BLAST and FASTA for generation of Word hit.
In BLAST, mismatches are allowed in Word hit
In contrast, FASTA allows exact match only.

In this case, Word hit of "AMQ" is created.
This will be used for scaffold of longer alignment.

Slide 27.

In third step, Stretch the Word hits to extend the coverage of queried
sequence.
Alignment score is calculated and if score is no longer improved, stop
the stretching.
How to calculate the alignment score will be explained later.

In this sample, NCI for N-terminus and MPQ for C-terminus are generated.
In the extended region, sequence mismatches and gasp are allowed.

Slide 28.

Finally, combine the extended word hits.
Gap is not allowed for combining the Word hits.
In this case, YRI were inserted into the space before the neighborhood
"Wordhit".

The process is iterated until no more word hits is combined.
If the query sequence is nucleotide, complement strand of the sequence
also aligned in similar manner.

Slide 29.

Main output of BLAST is pairwise alignment.
To assess the result of sequence comparison, the obtained alignment is
checked precisely.
For evaluation of alignment, several numerical are calculated and
printed to the output.
Score is sum of reword and penalty of the alignment based on
substitution matirix.
Row score is plane type of the alignment score.
Bit score is another type of alignment score that counts background of
amino acid composition of target sequences.
Higher score means fine similarity.
Expected value is well known measurement for assessment of sequence
alignment.
Details will be explained latter slides.
Lower E-value represents higher significance.
P-value is another statistical value and denotes probability of an
alignment occurring by chance.
Sequence identity and coverage are helpful for special situation.

Sample of pairwise alignment is described here.
Alignment length is length of aligned region including gap sites.
Unaligned regions are out of the aligned region.
Block means in serial matched sited in the alignment.
Sequence match is identical match or accepted mutation.
Accepted mutation is substituted site by chemically-equivalent residue
and restricted to amino acid sequence alignment.
Gap site are insertion and deletion. Sometime these are called "Indel".

Slide 30.

This slide shows details of raw score.
To count raw score, substitution matrix is utilized.
For amino acid sequence alignment, there are two major types of
substitution matrix called PAM and BLOSSUM.
Users can choice the matrix.
For nucleic acid alignment, no matrix is utilized.
Please note that score depends of sequence length.
In Blast output, alignments are sorted by raw score.

Slide 31.

E-value is probability of the alignment in the database.
This means appearance of the alignment by chance against the background
database.
So the non significant alignment represents higher E-value.

E-value is calculated by the formula.
M is queried sequence length and N is all of sequence length in the target.
So E-value depends on search space.
K and lambda are varied by used substitution matrix.
K is 0.14 and lambda is 0.318.
S is raw score.

Please note that if E-value is higher than e point minus 10, functional
similarity is expected.
So Cut off of E-value sets e point minus 10, if you want to find genes
with similar function.

Slide 32.

Identity and coverage are most simple values for alignment quality.
In the case of nucleotide sequence, Gap is counted as mismatch.
For amino acid sequence, Gap is ignored for the count of mismatch.
But an accepted mutation is counted as mismatch.
Sequence coverage is two values for an alignment, because sequence
length query and target can be treated as dominator of coverage.

For example, this is the amino acid alignment case,
Sequence A is twenty five-amino acid long and B is 18-amino acid long.
In the pairwise alignment, alignment length including gap is 18. Count
of exact match and Gap site is 11 and 1 respectively.
So the sequence identity is calculated to be 61%.
Coverage for sequence A is 76%. And coverage of B is 100%.

Slide 33.

I would like to introduce a problematic case of alignment.
Sometime position of gas is not determined at unique position.
There are some possible site to insert the gap.
In this case, gas can be inserted into both of 13th and 14th. Both are
correct answer.
There is no exact solution to determine the gap site.
Programs output only one possible alignment.

Slide 34.

This also problematic case, but it is not serious so match.
There are two possible sites in the cytosine for the sample alignment.
Please note that longer gap insertion is considered as more natural
according to inspect of evolutionary biology.
To correct the non-preferred alignment, adjusting gap open and gap
extension penalty is an effective solution.
To force preferred alignment, increase open gap penalty and just perform
BLAST again.

Slide 35.

Please note that similarity and homology is almost same but not
equivalent term.
Similarity means that a shared character observed and it expressed by
high or low.
In contrast, homology means sharing biological ancestor and having
evolutionary relationship. And this term is not gradual.

For example, similarity between genome sequence and mRNA sequence.
This is not homology just a similarity.
And similarity between same gene family can be expressed as homology.
Database search provides like BLAST anytime called as "Homology search".
But this is not correct expression. It's just a similarity search.

Slide 36.

If you want to try sequence search for distantly related sequences,
please refer to this slide.
These strategy are higher sensitivity than plane type of BLAST and FASTA.
I think PSI-BLAST is first choice for this type of object if you search
amino acid sequence.

Slide 37.

This slides shows algorithm of PSI-BLAST.
This type of BLAST variant is iterative search using amino acid profile.

Slide 38.

Here is timeline of sequence analysis.
Please check it if you interested in.

Slide 39.

So, next up is presentation of the fundamentals of structural
bioinformatics.
That's all for sequence analysis resources.
Ant question.
I summarized the point of sequence comparison method in Japanese.

Slide 40.

Categories of structure analysis are roughly classified into static
structure analysis and dynamic structure analysis.
Static structure analysis is museum type science and Dynamic analysis is
theoretical science.
Target is same but the concepts of them are not shared quite, so much
separated.
Today, I would like to mainly focus on static structure analysis.

Slide 41.

Basically, 3D structure is assemble of atomic coordinate represented by
X and Y and Z.
Unit of atomic coordinate is Angstrom.
Atomic coordinates of biological molecules can be determined by
experimental studies such as X-ray analysis and NMR spectrometry.
And also, 3D-structure prediction can provide atomic coordinates.
Atomic coordinates are analyzed by static and dynamic method.
Ultimate goal of a series of the analysis is related to Drug discovery
and structure prediction.

Slide 42.

These are information resources of atomic coordinates or structural
annotation on biological molecules.
Protein databank, Nucleic acid database provides information on atomic
coordinate of biomolecules.
These row data were analyzed and annotated by users. And the results are
released from their web site.
For annotation database, PDBsum hosted by EBI is representative.
And for structural classification, SCOP and CATH are established in the
early age of structural bioinformatics

Slide 43.

Protein databank is established in 1971 at Brookhaven national institute.
Current PDB is international collaborative work.
RCSB at USA and PDBj in Japan and MSD of EBI is joined to activity of
"wwPDB".
BioMagRes Bank is also joined at 2006.
Just like international sequence bank, the four organizations share the
structural information determined by X-ray and NMR and electron microscopy.
Target is all of biological material excepted lipid.
Chemical compounds are also included in the structure information.
PDB code is 4digit such as 3BLM and 1A8P and so on.
One entry has one ID.
5th digit is alphabet. It means chain ID.
HUP rule is applied to publish to Journal.

Slide 44.

In present, PDB is managed by RCSB. RCSB is research collaboratory for
structural bioinformatics and it stores many structure entries.
RCSB consist of Rutgers University and UCSD super computer center and so on.
Current version of PDB provides helpful search system.
Keyword ,ID and sequence search are available.
Search system by chemical compound is also build.

Slide 45.

This slide shows the data growth and count of contents in RCSB PDB.
Currently, about fifty-six thousand entries are stored on PDB.
Almost of the entries is Protein.
For method, X-ray is major than other type of approaches.
Number of entry is acutely increased from 2001.
Please note that PDB is a databank type information resource, so
redundancy is not removed.
Hence number of entry is not equal to number of structure.
For example, when structure is identical but resolution of new one is
improved from the older entry, different ID is assigned to the new entry.

Slide 46.

This slide shows the merits and demerits of methods to determine the
structures.
Please note that X-ray data does not include hydrogen atom.
In contrast, NMR can observe hydrogen atoms.
However, NMR cannot be applied to macromolecule.
Limits are about three hundred of residues for protein.
NMR also can observe dynamic structure in the solution.
Electron microscopy can determine structure of super molecule such as
ribosome.
However, only backbone structures such as alpha trace are provided due
to the low resolution.

Slide 47.

This is the sample of PDB format.
Atom name and atomic coordinate are described here.

Slide 48.

PDBsum is one of representative structural annotation database, hosted
by EBI.
It provides structure pocket, molecular surface and Ramachandran plot
and ligand and cofactors in the PDB entry.
Sequence search and keyword search are available.

Slide 49.

Let me check the definition of protein fold.
For broad definition, protein structure defined by atomic coordinates of
all atoms in the molecule. Side chains are included.
Narrow definition is only trace of alpha-carbon atom in protein or
peptide bond.
In structure classification studies, narrow definition is frequently
applied.

Related terms are listed here.
Topology is position of secondary structure units.
And packing is assembling of secondary structure units.

Slide 50.

This is graphical representation of alpha trace and mainchain.
Mainchain includes peptide bond.
Basically, mainchain can be defined from alpha-trace.
Merit of alpha trace is computational cost.
Mainchain can define correct secondary structure.

Slide 51.

Classification of protein fold is performed based on the PDB entries.
Mainly, hierarchal classification is applied.
SCOP and CATH are major database of such type of protein classification.
Another types of protein classification is based on structural distance.
Today, I would like to focus on SCOP and CATH.

Slide 52.

This slide shows number of entries in SCOP and CATH.
In SCOP database, recent version defines 1086 fold.
The count of SCOP fold is increased as described here.
And also, number of CATH topology corresponds with fold is one thousand
one hundred and ten.
These counts of fold are almost same.
Please note that number of protein fold is limited and estimated as one
thousand one hundred.

Slide 53.

Next, I would like to talk about how to compare 3D structure.
Because comparison is very informative method anytime.
Many approach to compare the 3D structures are submitted.
Basic method of structure comparison is minimization of RMSD.
Alternative methods are listed but I would like to skip them today.
Application of structure comparison, for example,
It is utilized to classify3D structure domain as employed by SCOP and CATH.
Structural similarities can be used for functional annotation.
In molecular dynamics, structural change is calculated by RMSD.
Applying to discovery of superfold is also interesting.

Slide 54.

RMSD is "root mean square deviation".
Calculation of RMSD consists of three steps.
Firstly, generate sequence alignment to make residue-residue pair list.
And then translate and rotate the atomic coordinates to fit the target
structure.
Finally, calculate RMSD using coordinates of C-alpha or main chain atoms.
Formula to RMSD is described here.
Iterative process is applied to minimize the RMSD.
If RMSD is not decreased anymore, the calculation stops.
In general, 2 or 3 angstrom of RMSD is evidence of the structural
similarity.
However, please note that RMSD itself depends on protein size.
If protein size is larger, calculated RMSD is increased by accumulation
of the deviation of each atom.
Slide 55.

ProteinDBS is applied to detect similarity using RMSD.
You can submit PDB file as a query.
And result will be shown as below.
Top of one hundred structural neighbors in PDB are presented in the output

Slide 56.

Next, I would like to talk about protein structure prediction.
Protein structure prediction is basically predict 3D structure from
amino acid sequence.
That is grand challenge of bioinformatics.
Several types of prediction strategy are submitted.
They utilizes known protein structures, amino acid sequence and physical
properties of amino acid residues.
There are four method currently.
Homology modeling, 3D-1D (threading) method, Fragment assembly and ab
initio.

Slide 57.

The four methods are categorized into two groups.
Homology modeling and 3D-1D are template based prediction.
Fragment assembly and ab initio modeling are free from template of known
strcutre.

Template prediction method is first choice.
Template free type is in development phase yet.

So if you already find a template structure in PDB, please try homology
modeling.
If homologous structure is not found, you should try both of 3D-1D and
fragment assembly.
If the two results are almost same, the prediction may be confident.

Please note that realistic solution is only homology modeling.

Because method 3 sometime predicts new folding, the method is referred
to as "ab initio". However, to avoid confusing with 4, 3 is usually
called "de novo"or "New fold".

Slide 58.

This slide shows summary of homology modeling.
Homology modeling is based on simple concept.
"If sequence is similar, the structure is also similar."
This concept is considered as solid for almost of case.
If sequence identity of the alignment is more than 30 to 40%, this
method can be applied.
Enough coverage is also needed.
Result of prediction depends on the sequence alignment.
Prediction of side chain conformation is out of the strategy.
If there are too many gaps in the alignment, please be careful about the
prediction result.

Slide 59.

Representative web service of homology modeling is SWISS-MODEL hosted by
SBI.
User can submit amino acid sequence or alignment.
This service responds quickly. Computational time was 80 seconds for the
medium size of protein.
The prediction result includes side chain of amino acid. But it lacks
hydrogen atoms.

That is all for structure prediction.
Slide 60.

If you want to predict binding partner of a protein,
Knowledge based approach is helpful.
Database of protein-protein interaction, PPI is listed here.
Sequence search is available on the PPI databases.
Representative PPI database are BOND, DIP, MINT, HPRD and INTACT.
Integrated databases also provides PPI data.
You can predict binding partner from the result of sequence search.

If you know binding partner of your gene product, you can predict the
complex structure.

Slide 61.

It is very important for linking gene network and structural informatics.
Fundamental algorithm of complex prediction consists of five steps.
Modeling of protein shape and electrostatic map.
And then, sampling complex candidates from global search by protein
docking simiulation.
Next, protein complex candidates are evaluated by the program. For
example, binding free energy is applied to rank and filter out the
complex candidates.
Finally, refinement of predicted complex is performed.
That is basic schema of protein complex prediction.

Slide 62.
This slides shows details of complex prediction 1 and 2 is
pre-processing phase.
3 is most time-consumed step. So performance of this step is improved by
this type of program.

Slide 63.
Step 4 and 5 are post-processing of docking simulation.
Especially, assessment of step 4 is most important for the final result
of the prediction.
A lot of strategies are considered listed here.
Step 5 is molecular mechanics and molecular dynamics. Free energy can be
calculated in this step.

Slide 64.
This is an example of Web server for prediction of protein complex.
Please visit the web service if you interested in prediction of protein
complex.
This case is homo-dimer of super oxide dismutase one.
GRAMM-X could result in successful prediction.

That is all for complex structure prediction.

Slide 65.

Next I would like to mention disordered region in protein.
The disordered region is unstable protein structure more than 50 amino
acid long.
It is also known as natively unfolded region
The atomic coordinates of disordered region are missed in X-ray and NMR
structure determination.
And these sometimes become stable by binding with other protein.
CREB binding protein is a transcription factor and very long disordered
region is detected.

Slide 66.

Based on accumulated experimental data, challenge of prediction on
disordered regions is launched.
This approach of the prediction is collection of experimental data and
machine leaning for prediction.
Currently, it is known that about five hundred of proteins have the
unstructured motif.
Number of disordered region is more than one hundred.

Slide 67.
These are representative Web service for prediction of disordered regions.
DISOPRED and POODLE are winner of disordered region section of CASP contest.

Slide 68.
This slide shows summary of structural bioinformatics.

That is all for structural informatics.
Is there any question. Comments are also welcomed.

Okay, I will summarize them in Japanese.

Slide 69.

Finally, I would like to introduce a helpful philosophy called
"biological hierarchy".
"Biological hierarchy" consists of microscopic phase to macroscopic phase.
All of biological information is categorized into a phase of the
biological hierarchy.
So linking between different phases is discovery in the life science.

Slide 70.

This is the examples of discovery of inter phase linkage.
Disease depends on diet is linked between individual and environment phase.
For example, it is discovering the risk for stomach cancer and salty diet.
Disease triggered by conformation change in protein such as CJD.
This discovery linked individual and atomic phase.
As shown, the important discovery always linking different phases in
biological hierarchy.

Slide 71.
And I think there are two additional guides to biological hierarchy.
Evolution and polymorphism extended from individual phase.
Evolution includes comparative genomics, molecular evolution and
phylogenetic information and it can approach origin of life.
Polymorphism is keys to linking phenotype and other phase of biological
hierarchy.

Slide 72.
This is the final message from this lecture.
First step to discovery is
-all the information you have should be organized by the guide of
biological hierarchy.
-I wish you never be drowned biological information overflow by use of
the philosophy.

That's all for today's lecture.
I will take your questions.

クロコダイル日記

2009年9月10日木曜日

C大学での特別講義　スクリプト

0 件のコメント:

自己紹介

2009年9月10日木曜日

C大学での特別講義 スクリプト

0 件のコメント:

C大学での特別講義　スクリプト