ProteoRE Galaxy instance provides necessary tools to execute a complete annotation pipeline of a protein list identified by LC-MS/MS experiments. This tutorial introduces these tools and guides you through a simple pipeline using some example datasets based on the following study: “Proteomic characterization of human exhaled breath condensate” by Lacombe et al., European Journal of Breath, 2018.
Estimated time to achieve this tutorial is 60 minutes. If you have any question, problem or feedback, please contact us at contact@proteore.org.
The objective of this tutorial is to annotating and exploring a proteomic dataset by answering the following questions:
In order to follow this tutorial, general knowledge of Galaxy's environment is necessary. Please read Galaxy introduction if you are not familiar with this environment.
For this tutorial, we will use three datasets:
A shared data library that contains these datasets is available, or you can import the history below to your own history by clicking on the green plus button:
Once identified and/or quantified using a MS-based approach, interpreting the proteome in a sample is an important step to characterize its content in terms of functional properties in order to extend the biological knowledge related to this sample. In this tutorial, we illustrate the annotation and the exploration of the EBC proteome by performing the following steps:
A full workflow for this tutorial is available in Shared Data > Workflows > ProteoRE_workflow_Tutorial_1 or you can import from below:
And here is the history contains result jobs after running this workflow using the datasets imported from Input Datasets section:
A group of 10 proteins were identified in both “technical” control samples with an enrichment in EBC samples below a fixed threshold. These proteins were thus considered to be technical contaminants (see list of proteins in Table 4 in Lacombe et al. 2018) and have to be removed from the initial dataset.
Step 1. From Tool Panel choose ProteoRE > Data Manipulation > Filter by keywords or numerical value tool.
Step 2. In Input file parameter, select Lacombe_et_al_2017.txt. Keep default option Yes for header parameter.
Step 3. Click Insert Filter by keywords box to add the list of keywords to be filtered out. In this case, keywords are list of Uniprot accession numbers.
Step 4. Fill in the parameters in Filter by keywords section:
Click Execute button.
In History Panel, two output files will be created:
Note: you can change the name of each result datasets by clicking Edit (pencil) button.
As EBC samples are obtained from air exhaled through the oral cavity, and even though the RTube collection device contained a saliva trap to separate saliva from the exhaled breath, contamination with salivary proteins had to be assessed. We decided to check the expression pattern for each protein of the "core" EBC proteome using the Human Protein Atlas (HPA). As HPA is indexed by Ensembl gene identifier (ENSG) we first need to convert Uniprot ID to Ensembl gene (ENSG). Secondly, check for proteins which are highly expressed in the salivary glands as reported by HPA, then in a third step, we filter out these proteins.
Step 1. From Tool Panel choose ProteoRE > Data Manipulation > ID Converter tool.
Step 2. In section Provide your identifiers, option Input file containing your identifiers is chosen by default. Select input file and set its parameters as following:
Then click Execute button.
In History Panel, a new file named ID Converter on data 5 will be created:
You can see in this file, a new column which contains Ensembl IDs was added.
Step 1. From Tool Panel choose ProteoRE > Protein list annotation > Add expression data to your protein list tool.
Step 2. In section Enter your list of Ensembl gene ID, option Input file containing your identifiers is chosen by default. Select input file and set its parameters as following:
Step 3. Numerous information can be extracted from the HPA source files, you can read user documentation at the end of the submission form of the tool for more detailed description. In this tutorial, we select Gene name, Gene description, RNA tissue category (according to HPA) and RNA tissue specificity abundance in "Transcript Per Million".
Then click Execute button.
In History Panel, a new file named Add expression data to your protein list on data 8 will be created:
Four columns were added (n°5, 6, 7 and 8) corresponding to the HPA information previously selected; scroll down the table, note at the end of the list, column n°8, that AMY1B, CALML5, PIP, ZG16B, CST4, MUC7, CST1 and CST2 have been reported as highly enriched in salivary gland with elevated RNA transcript specific TPM value for each, suggesting that these proteins may come from the saliva and not from the exhaled breath condensate. We thus will removed these biological contaminants from our initial protein set.
Step 1. Again from Tool Panel, choose ProteoRE > Data Manipulation > Filter by keywords or numerical value tool.
Step 2. In Input file parameter, select Add expression data to your protein list on data 8. Keep default option Yes for header parameter.
Step 3. Click Insert Filter by keywords box to add the list of keywords to be filtered out. In this step, we will filter out the lines that contain "salivary" in the column of RNA transcript specific TPM.
Step 4. Fill in the parameters in Filter by keywords section:
Then click Execute button.
Two output files is created:
Note also that a list of “gene” may have been entered (selected on the basis of their TPM value) applied to column n°5 instead of the keywords "salivary" to column n°8, as it has been done in "Lacombe et al, 2018".
The resulting list of 141 proteins identified in the two pooled EBC samples (excluding the 10 salivary proteins) is now submitted to Gene Ontology (GO)-term enrichment analysis to determine functions that were significantly enriched in our EBC proteomic dataset compared to the lung proteome (corresponding to tissue-specific genes extracted from the Human Protein Atlas). To do so, we first build a lung reference proteome (that should be more representative of the studied sample conversely to a full human proteome) that will be used for enrichment analysis performed with the ClusterProfiler tool (based on the R package clusterProfiler)
Step 1. From Tool Panel, choose ProteoRE > Annotation retrieval from DB > Retrieve tissue-specific expression data tool.
Step 2. Set parameters as following:
Click Execute button.
A new file is now added to History Panel:
Note that expression information for respiratory cell types is retrieved (column 4; e.g. macrophages, pneumocytes, respiratory epithelial cells) that could be used for further refinement of your reference background.
As the ClusterProfiler tool (we are going to use for the enrichment analysis) does not consider ENSG (Ensembl gene) identifiers as input, we need to convert these IDs into either entrez gene ID or Uniprot accession number that are compliant with.
Step 1. From Tool Panel choose ProteoRE > Data Manipulation > ID Converter tool.
Step 2. Set input parameters:
Step 3. Set source and target ID type:
Click Execute button. Two new columns will be added to input file:
Step 1. From Tool Panel choose ProteoRE > Functional Analysis > clusterProfiler tool.
Step 2. Set input parameters:
Step 3. Set source/type of input IDs: Uniprot Accession number
Step 4. Set species: Human
Step 5. GO categories representation analysis parameters: Yes
Step 6. GO categories enrichment analysis parameter: Yes
Step 7. Set GO categories: select all three options
Click Execute button.
Now in History Panel, there are a new text output file and a new list of graphical outputs.
The suffix "GGO" (GroupGO) corresponds to the results "GO categories representation analysis" option (performs a gene/protein classification based on GO distribution at a specific level) while the suffix "EGO" (EnrichGO) corresponds to the results from the enrichment analysis (based on an over-representation test of Go terms against the lung reference background). Two type of graphical output are provided either in the form of bar-plot or dot-plot.
According to this analysis, the main biological processes that were found over-represented in EBC compared to lung were numerous immune system processes and exocytosis (see EGO.BP.dot.png, for Enriched Biological Process GO terms dot-plot representation in png format). Below you can click on Go to dataset to view diagrams of BP category.
The 141 proteins identified in EBC samples are now mapped to biological pathways and visualized via the web service of Reactome, an open access, manually curated and peer-reviewed human pathway database that aims to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge.
Step 1. From Tool Panel choose ProteoRE > Pathway Analysis > Reactome tool.
Step 2. Set input parameters:
Then click Execute button.
From History Panel, click View data button of the new output to display access to Reactome in the central panel. Click the Analyse button to display the Reactome analysis tools page via the web service and display the results. Browse biological patwhays in which EBC proteins are highlighted (e.g. immune system pathways) using Reactome interface functionalities. Here you can click on Go to dataset > Analyse to open the web service page.
Our experimental design and the dataset produced (i.e. the list of 151 proteins identified in both pooled EBC samples including the 10 salivary proteins) were compared to the two most extensive EBC proteome maps previously described for healthy subjects (Mucilli et al., 2015 ; Bredberg et al., 2012). To do so, a Venn diagram showing the overlap between our dataset and the two previous EBC characterizations in healthy donors is drawn using the Jvenn tool from ProteoRE.
Step 1. From Tool Panel choose ProteoRE > Data Manipulation > Jvenn tool.
Step 2. Set first input (our EBC sample) parameters:
Step 3. Set second input (sample from Bredberg's study) parameters:
Step 4. Set third input (sample from Mucilli's study) parameters:
Then click Execute button.
Now a text output and a graphical output will be created. From the venn diagram, we can see the number of proteins that is common/unique for each list combinations (click on Go to dataset to view the venn diagram).
----------------
This is the end of the Tutorial #1. Thank you for joining and completing this tutorial.
proteore
All published pages
Published pages by proteore