Study of evolutionary history and function(s) of a protein is an intriguing but not a trivial task.
The computational analysis usually starts with collecting homologues proteins.
Consequent steps include:
The next crucial step is analysis of sequence conservation, protein features and gene neighborhoods in the context of phylogenetic clustering.
Such comprehensive analysis is time-consuming and error-prone.
To help scientists in analysis of their data we have created a protein analysis hub that automates these steps and furthermore provides additional instruments.
Full pipeline mode.
There are two ways the pipeline can be used in this mode:
If you want to use this option just paste your sequences or upload a file and click SUBMIT button.
You can also provide a list of NCBI or MiST protein ids/locus tags instead of protein sequences, and our id analyzer will extract the sequences from NCBI and MiST.
Just don't forget to tick Retrieve sequences (NCBI and MiST).
If you have your own alignment that you already prepared and edited you may skip the alignment step of the pipeline and use your alignment instead. Just paste your alignment and uncheck Align option.
If you want to use TREND just to align your sequences and build a tree you may skip domains identification step unchecking Identify Features option.
If you want to use this option:
a) Click Add second area button. New query area will appear.
b) Put full-length sequences into the first area (or choose a file). Protein features will be identified using these sequences.
c) Put fragments of sequences into the second area (or choose a file). Protein sequences in this area will be used for alignment and building phylogentic tree.
Important: sequence names in two areas should be identical. If you downloaded sequences from the same database or were changing sequence names consistently in all the files this will happen naturally.
Partial pipeline mode.
If you have your own phylogenetic tree and protein sequences used to produce the tree use this mode.
Important: sequence names in all the files should be matched. If you downloaded sequences from the same database or were changing sequence names consistently in all the files this will happen naturally.
a) In Domains section click Partial Pipeline button.
b) Upload protein sequences used to build the tree (Choose file with protein sequences (fasta) button).
c) Optional, if you want the alignment to be ordered according to the tree leaves order, upload your alignment (Choose file with alignment (fasta) button).
d) Upload the tree in newick format (Choose file with tree (Newick) button).
If you just want to reorder sequences in your alignment according to the tree leaves just upload the alignment and tree.
a) Put at the beginning of the tree leaves names NCBI/MiST protein ids/locus tags separated from the rest of the names by space, underscore ( _ ) or vertical bar (|).
b) Tick Retrieve sequences (NCBI and MiST). Our id analyzer will extract the sequences from NCBI and MiST.
c) And then just upload the tree in newick format (Choose file with tree (Newick) button) and start the analysis
If you download sequences from NCBI or MiST and use them to build the tree, the ids will be naturally at the beginning of sequences and you don't have to do anything.
As a result of running Domains pipeline a phylogenetic tree combined with interactive protein features will be produced.
Clicking on features will open an information block with details of the identified features. Domain analysis details contain links to entries in corresponding database (Pfam and CDD) for each identified domain. Clicking on a feature will highlight the part of a sequence corresponding to it. To zoom in/zoom out use mouse wheel.
All the produced data is downloadable.
You can cluster prokaryotic genes based on the shared domains of the encoded proteins and visualize the clusters and gene neighborhoods on phylogenetic tree running our neighborhoods analysis pipeline.
Protein names in the file with the sequences or the tree should start with protein identifiers separated from the rest of the name by space, underscore ( _ ) or vertical bar (|).
a) NCBI RefSeq Id;
b) Locus tag, either old or new;
b) MiST Id.
The analyzer can be run in four ways:
1) by pasting protein sequences (or uploading corresponding file)
2) by pasting phylogenetic tree in newick format (or uploading corresponding file)
3) by pasting a list of NCBI/MiST protein ids/locus tags (or uploading corresponding file). Don't forget to check Retrieve sequences (NCBI and MiST)
4) by pasting an alignment in FASTA format (or uploading corresponding file)
Bear in mind that the refSeq Ids are not organism specific. If you want to explore the neighborhoods of genes from a particular organism you should provide locus tags or MiST Ids.
'Operon tolerance' parameter is a distance in nucleotides between neighboring genes to consider them as being encoded in one operon.
'Not shared domains tolerance' parameter is a number of not shared domains between any two proteins that is allowed for corresponding genes sill to be considered as members of the same cluster.
'Number of neighboring genes on one side (max 15)' parameter is a number of neighboring genes on each side of a gene of interest. Maximum is 15, i.e. 30 neighboring genes in sum will be displayed.
As a result of running Neighborhoods pipeline a phylogenetic tree combined with interactive gene neighborhoods will be produced.
Genes that belong to the same cluster will be colored in the same color. Genes that belong to the same operon will have borders of the same color.
Hovering the mouse on the picture shows the corresponding gene information including its product name, NCBI and MiST ids, links to the databases, encoded domains and cluster ids. To zoom in/zoom out use mouse wheel.
All the produced data is downloadable.
First supported version is shown