Proteore tutorial #1. Annotating a protein list identified by LC-MS/MS experiments

ProteoRE Galaxy instance provides necessary tools to execute a complete annotation pipeline of a protein list identified by LC-MS/MS experiments. This tutorial introduces these tools and guides you through a simple pipeline using some example datasets based on the following study: “Proteomic characterization of human exhaled breath condensate” by Lacombe et al., European Journal of Breath, 2018.

Estimated time to achieve this tutorial is 60 minutes. If you have any question, problem or feedback, please contact us at contact@proteore.org.

Objective

The objective of this tutorial is to annotating and exploring a proteomic dataset by answering the following questions:

How to filter out technical contaminants?
How to check for tissue-specificity?
How to perform enrichment analysis?
How to map your protein list to pathways (Reactome)?
How to compare your proteome with other studies?

Requirements

In order to follow this tutorial, general knowledge of Galaxy's environment is necessary. Please read Galaxy introduction if you are not familiar with this environment.

Input datasets

For this tutorial, we will use three datasets:

The list of proteins identified by LC-MS/MS in the exhaled breath condensate (EBC) from Lacombe et al.:

Galaxy Dataset | Lacombe_et_al_2017.txt

And two others EBC proteomes previously published:
- Mucilli et al.

Galaxy Dataset | Mucilli.txt

Bredberg et al.

Galaxy Dataset | Bredberg.txt

A shared data library that contains these datasets is available, or you can import the history below to your own history by clicking on the green plus button:

Galaxy History | ProteoRE tutorial #1. Example datasets

Methods

Once identified and/or quantified using a MS-based approach, interpreting the proteome in a sample is an important step to characterize its content in terms of functional properties in order to extend the biological knowledge related to this sample. In this tutorial, we illustrate the annotation and the exploration of the EBC proteome by performing the following steps:

A full workflow for this tutorial is available in Shared Data > Workflows > ProteoRE_workflow_Tutorial_1 or you can import from below:

Galaxy Workflow | ProteoRE_ProteomeAnnotation_Tutorial (release 1.1)

And here is the history contains result jobs after running this workflow using the datasets imported from Input Datasets section:

Galaxy History | workflow use case 1 : release 1.1

Filtering out technical contaminants

^{Go to Methods}

A group of 10 proteins were identified in both “technical” control samples with an enrichment in EBC samples below a fixed threshold. These proteins were thus considered to be technical contaminants (see list of proteins in Table 4 in Lacombe et al. 2018) and have to be removed from the initial dataset.

Step 1. From Tool Panel choose ProteoRE > Data Manipulation > Filter by keywords or numerical value tool.

Step 2. In Input file parameter, select Lacombe_et_al_2017.txt. Keep default option Yes for header parameter.

Step 3. Click Insert Filter by keywords box to add the list of keywords to be filtered out. In this case, keywords are list of Uniprot accession numbers.

Step 4. Fill in the parameters in Filter by keywords section:

The column of the input dataset on which the filter will be apply, in this case is the column that contains Uniprot accession numbers (c1 as by default).
You can perform exact or partial match with the keywords entered. Partial match is set by default. We keep default option (No) in this tutorial.
Here we choose to copy and paste the following list of Uniprot accession number (separated by ";"):
P04264;P35908;P13645; Q5D862 ;Q5T749; Q8IW75;P81605;P22531; P59666; P78386

Click Execute button.

In History Panel, two output files will be created:

Filter_by_keywords_or_numerical_value_on_Lacombe_et_al_2017.txt- Filtered_lines: output list the ten protein contaminants removed from the original dataset (11 lines, header line included)

Galaxy Dataset | Filter by keywords or numerical value on Lacombe_et_al_2017.txt - Filtered lines

Filter_by_keywords_or_numerical_value_on_Lacombe_et_al_2017.txt: output contains the remaining proteins that will be considered for further analysis (152 proteins, header line included)

Galaxy Dataset | Filter by keywords or numerical value on Lacombe_et_al_2017.txt

Note: you can change the name of each result datasets by clicking Edit (pencil) button.

Check for the presence of biological contaminants

^{Go to Methods}

As EBC samples are obtained from air exhaled through the oral cavity, and even though the RTube collection device contained a saliva trap to separate saliva from the exhaled breath, contamination with salivary proteins had to be assessed. We decided to check the expression pattern for each protein of the "core" EBC proteome using the Human Protein Atlas (HPA). As HPA is indexed by Ensembl gene identifier (ENSG) we first need to convert Uniprot ID to Ensembl gene (ENSG). Secondly, check for proteins which are highly expressed in the salivary glands as reported by HPA, then in a third step, we filter out these proteins.

1. Convert Uniprot ID to Ensembl gene

Step 1. From Tool Panel choose ProteoRE > Data Manipulation > ID Converter tool.

Step 2. In section Provide your identifiers, option Input file containing your identifiers is chosen by default. Select input file and set its parameters as following:

Input file: Filter_by_keywords_or_numerical_value_on_Lacombe_et_al_2017.txt
Does your input contain header: Yes
The column number: c1

Step 3. Set the Source type and Target type(s) of ID to map.

Select type/source of identifier of your list: Uniprot accession number
Target type of IDs: Ensembl gene ID

Then click Execute button.

In History Panel, a new file named ID Converter on data 5 will be created:

Galaxy Dataset | ID Converter on data 5

You can see in this file, a new column which contains Ensembl IDs was added.

2. Check for proteins highly expressed in salivary glands

Step 1. From Tool Panel choose ProteoRE > Protein list annotation > Add expression data to your protein list tool.

Step 2. In section Enter your list of Ensembl gene ID, option Input file containing your identifiers is chosen by default. Select input file and set its parameters as following:

Input file: output of ID Converter step ID Converter on data 5
Does your input contain header: Yes
The column number: c4

Step 3. Numerous information can be extracted from the HPA source files, you can read user documentation at the end of the submission form of the tool for more detailed description. In this tutorial, we select Gene name, Gene description, RNA tissue category (according to HPA) and RNA tissue specificity abundance in "Transcript Per Million".

Then click Execute button.

In History Panel, a new file named Add expression data to your protein list on data 8 will be created:

Galaxy Dataset | Add expression data to your protein list on data 8

Four columns were added (n°5, 6, 7 and 8) corresponding to the HPA information previously selected; scroll down the table, note at the end of the list, column n°8, that AMY1B, CALML5, PIP, ZG16B, CST4, MUC7, CST1 and CST2 have been reported as highly enriched in salivary gland with elevated RNA transcript specific TPM value for each, suggesting that these proteins may come from the saliva and not from the exhaled breath condensate. We thus will removed these biological contaminants from our initial protein set.

3. Filter out the contaminants

Step 1. Again from Tool Panel, choose ProteoRE > Data Manipulation > Filter by keywords or numerical value tool.

Step 2. In Input file parameter, select Add expression data to your protein list on data 8. Keep default option Yes for header parameter.

Step 3. Click Insert Filter by keywords box to add the list of keywords to be filtered out. In this step, we will filter out the lines that contain "salivary" in the column of RNA transcript specific TPM.

Step 4. Fill in the parameters in Filter by keywords section:

The column of the input dataset on which the filter will be apply, in this case is the column of RNA transcript specific TPM: c8
You can perform exact or partial match with the keywords entered. Partial match is set by default. We keep default option (No) in this tutorial.
You can either copy and paste list of keywords (separated by ";") to text area or choose a file that contains keywords in text format, in which each lines contains a keyword. Here we choose to type "salivary" in text area.

Then click Execute button.

Two output files is created:

Filter by keywords or numerical value on Add expression data to your protein list on data 8 - Filtered lines: 10 proteins have been removed from the EBC list.

Galaxy Dataset | Filter by keywords or numerical value on Add expression data to your protein list on data 8 - Filtered lines

Filter by keywords or numerical value on Add expression data to your protein list on data 8: 141 proteins remain.

Galaxy Dataset | Filter by keywords or numerical value on Add expression data to your protein list on data 8

Note also that a list of “gene” may have been entered (selected on the basis of their TPM value) applied to column n°5 instead of the keywords "salivary" to column n°8, as it has been done in "Lacombe et al, 2018".

Functional annotation of the EBC proteome (enrichment analysis)

^{Go to Methods}

The resulting list of 141 proteins identified in the two pooled EBC samples (excluding the 10 salivary proteins) is now submitted to Gene Ontology (GO)-term enrichment analysis to determine functions that were significantly enriched in our EBC proteomic dataset compared to the lung proteome (corresponding to tissue-specific genes extracted from the Human Protein Atlas). To do so, we first build a lung reference proteome (that should be more representative of the studied sample conversely to a full human proteome) that will be used for enrichment analysis performed with the ClusterProfiler tool (based on the R package clusterProfiler)

1. Build a lung reference proteome as a background: Retrieve tissue-specific expression

Step 1. From Tool Panel, choose ProteoRE > Annotation retrieval from DB > Retrieve tissue-specific expression data tool.

Step 2. Set parameters as following:

Experimental data source: Two experimental data sources are proposed (expression data from immunohistochemistry (IHC) and from RNAseq experiments both from HPA), here we retrieve information based on IHC (default param)
Tissue: Dropdown menu allows to select tissue of interest among a list of 58 tissues, click Lung, redo by then clicking Bronchus
Expression level: Ranges from High to Not detected (according to HPA criteria), here only High, Medium and Low are selected
Reliability score: Indicates how reliable the expression/detection level is; here we select Enhanced and Supported which are the most reliable score according to HPA, you can read user documentation at the end of the submission form of this tool for more detailed description

Click Execute button.

A new file is now added to History Panel:

Galaxy Dataset | Retrieve tissue-specific expression data

Note that expression information for respiratory cell types is retrieved (column 4; e.g. macrophages, pneumocytes, respiratory epithelial cells) that could be used for further refinement of your reference background.

2. Build a lung reference proteome as a background: Convert Ensembl gene ID to Uniprot and Entrez gene ID

As the ClusterProfiler tool (we are going to use for the enrichment analysis) does not consider ENSG (Ensembl gene) identifiers as input, we need to convert these IDs into either entrez gene ID or Uniprot accession number that are compliant with.

Step 1. From Tool Panel choose ProteoRE > Data Manipulation > ID Converter tool.

Step 2. Set input parameters:

Input file: Retrieve tissue-specific expression data
Does your input contain header: Yes
The column number: c1

Step 3. Set source and target ID type:

Select type/source of identifier of your list: Ensembl gene ID
Target type of IDs: Uniprot accession number and Entrez gene ID

Click Execute button. Two new columns will be added to input file:

Galaxy Dataset | ID Converter on data 4

3. Functional analysis using ClusterProfiler

Step 1. From Tool Panel choose ProteoRE > Functional Analysis > clusterProfiler tool.

Step 2. Set input parameters:

Input file: the EBC proteome to be analyzed after technical and biological contaminants removal which is the output of biological contaminants filter step.

Galaxy Dataset | Filter by keywords or numerical value on Add expression data to your protein list on data 8

Header: Yes
Column number: c1

Step 3. Set source/type of input IDs: Uniprot Accession number

Step 4. Set species: Human

Step 5. GO categories representation analysis parameters: Yes

Level of ontology: 3

Step 6. GO categories enrichment analysis parameter: Yes

P-value: 0.01
Q-value: 0.05
Would you like to define your own background IDs: Yes
- Input file: lung proteome we previously build.
- Header: Yes
- Column number: c7

Step 7. Set GO categories: select all three options

Click Execute button.

Now in History Panel, there are a new text output file and a new list of graphical outputs.

The suffix "GGO" (GroupGO) corresponds to the results "GO categories representation analysis" option (performs a gene/protein classification based on GO distribution at a specific level) while the suffix "EGO" (EnrichGO) corresponds to the results from the enrichment analysis (based on an over-representation test of Go terms against the lung reference background). Two type of graphical output are provided either in the form of bar-plot or dot-plot.

According to this analysis, the main biological processes that were found over-represented in EBC compared to lung were numerous immune system processes and exocytosis (see EGO.BP.dot.png, for Enriched Biological Process GO terms dot-plot representation in png format). Below you can click on Go to dataset to view diagrams of BP category.

Galaxy Dataset | GGO.BP.png

Galaxy Dataset | EGO.BP.dot.png

Galaxy Dataset | EGO.BP.bar.png

Visualize EBC proteome on biological pathways (using Reactome)

^{Go to Methods}

The 141 proteins identified in EBC samples are now mapped to biological pathways and visualized via the web service of Reactome, an open access, manually curated and peer-reviewed human pathway database that aims to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge.

Step 1. From Tool Panel choose ProteoRE > Pathway Analysis > Reactome tool.

Step 2. Set input parameters:

Input file: the EBC proteome to be analyzed after the removal technical and biological contaminants
Header: Yes
Column number: c1

Then click Execute button.

From History Panel, click View data button of the new output to display access to Reactome in the central panel. Click the Analyse button to display the Reactome analysis tools page via the web service and display the results. Browse biological patwhays in which EBC proteins are highlighted (e.g. immune system pathways) using Reactome interface functionalities. Here you can click on Go to dataset > Analyse to open the web service page.

Galaxy Dataset | Reactome on data 12

Comparison with other proteomic datasets from previous studies

^{Go to Methods}

Our experimental design and the dataset produced (i.e. the list of 151 proteins identified in both pooled EBC samples including the 10 salivary proteins) were compared to the two most extensive EBC proteome maps previously described for healthy subjects (Mucilli et al., 2015 ; Bredberg et al., 2012). To do so, a Venn diagram showing the overlap between our dataset and the two previous EBC characterizations in healthy donors is drawn using the Jvenn tool from ProteoRE.

Step 1. From Tool Panel choose ProteoRE > Data Manipulation > Jvenn tool.

Step 2. Set first input (our EBC sample) parameters:

Input file: The EBC proteome to be analyzed after technical contaminants removal (BEFORE biological contaminants removal) - Filter by keywords or numerical value on Lacombe_et_al_2017.txt

Galaxy Dataset | Lacombe_et_al_2017.txt

Header: Yes
Column number: c1
Name of the list: Lacombe et al

Step 3. Set second input (sample from Bredberg's study) parameters:

Input file: Bredberg.txt
Header: No
Column number: c1
Name of the list: Bredberg et al

Step 4. Set third input (sample from Mucilli's study) parameters:

Input file: Mucilli.txt
Header: No
Column number: c1
Name of the list: Mucilli et al

Then click Execute button.

Now a text output and a graphical output will be created. From the venn diagram, we can see the number of proteins that is common/unique for each list combinations (click on Go to dataset to view the venn diagram).

Galaxy Dataset | Venn diagram text output

----------------

This is the end of the Tutorial #1. Thank you for joining and completing this tutorial.

About this Page

Author

proteore

All published pages
Published pages by proteore

Rating

Community
(0 ratings, 0.0 average)

Proteore tutorial #1. Annotating a protein list identified by LC-MS/MS experiments

Objective

Requirements

Input datasets

1. Convert Uniprot ID to Ensembl gene

2. Check for proteins highly expressed in salivary glands

3. Filter out the contaminants

1. Build a lung reference proteome as a background: Retrieve tissue-specific expression

2. Build a lung reference proteome as a background: Convert Ensembl gene ID to Uniprot and Entrez gene ID

3. Functional analysis using ClusterProfiler

Author

Related Pages

Rating

Tags