Title : A large variety of informatics tools have been developed for automated
glycopeptide analysis which advanced the glycoproteomics field
Abstract :
- However, recent reviews summarizing the glycoproteomics field in terms of available software tools suggested the need of a single software tool which could address the following concerns: (i) elucidation of both N- and O-linked glycopeptide spectra, (ii) matching glycopeptides to known protein sequences , (iii) scoring/ranking of potential glycopeptides , (iv) usage of product ion spectra, and (v) high-throughput and batch-wise analysis
- In this report, we addressed these concerns by using the widely applied Mascot search engine for automated glycopeptide analysis
- In principle, other protein search engines could be used as well, if additional letters can be defined for monosaccharides as described here
- The success of the software-assisted intact glycopeptide analysis also depends on the enrichment strategy and the information available in the MS2 spectra
- The enrichment strategy employed in this study worked well to enrich sialylated glycopeptides and the LC-MS data sets contained mainly glycopeptide spectra
- For any software tool, the MS2 spectra of intact glycopeptides should contain both peptide and glycan information in order to provide a scoring and ranking of potential glycopeptide identifications and matching the glycopeptides to protein sequences
- A considerable amount of research has been performed for developing efficient fragmentation tools for glycopeptide analysis
- Unlike the collision based fragmentation techniques, the glycan structure remains relatively intact in electron transfer dissociation (ETD) spectra, thus providing information about the peptide sequence
- The combination of collision based ( HCD /CID) and electron transfer (ETD/ ECD ) based fragmentation techniques provide complementary information about the glycopeptide sequences
- Data driven acquisition strategies, for example HCD-product dependent CID-and ETD fragmentation strategies have also been shown to be effective in intact glycopeptide analysis
- The recently introduced electron transfer and higher-energy collision induced dissociation (EThcD) technique seems to work quite well for intact glycopeptide analysis
- However, with the used Q Exactive mass spectrometer, we could only use HCD
- Therefore, we showed the advantages of using stepped HCD while generating glycopeptide MS2 spectra
- HCD mass spectra at lower energies (Fig. 2A,B) are typically dominated by glycosidic fragment ions, whereas at higher energies the mass spectra (Fig. 2C) mainly contained peptide cleavage ions, thereby hampering successful mapping of both glycan and peptide moieties
- The HCD mass spectra using stepped NCE provided information both at the glycan and peptide level (Fig. 2D)
- A recent study also showed the same effect using low and high energy CID on a Q-TOF instrument for synthetic glycopeptides and standard glycoproteins
- A large number of available software tools for glycopeptide annotation deals mainly with N-linked glycosylation
- Software tools that can automatically annotate both N-linked and O-linked glycopeptides are of great advantage
- For example, Mascot annotated a total of nine different mono-, di-, tri- and tetra-sialylated N-glycan structures on a single glycosylation site ( Asn 93 ) of serum alpha-1-acid glycoprotein 1 (Fig. 5)
- Though, the sialylated N-linked glycans were the main focus in this study, the presented Mascot approach can of course identify other types of N-glycan structures (Supplementary Fig. 7)
- O-linked glycosylation on the other hand is more difficult to study, due the inherent lack of a consensus motif
- The obtained results using bovine fetuin documented that the Mascot search engine can indeed be used for O-linked glycopeptide analysis
- Mono- and di-sialylated core-1 O-linked glycans were annotated to two different sequences
- According to UniProt and some recent publications, the peptide sequence HTFSGVASVESSSGEAFHVGK carries only phosphorylation on serine residues (320, 323 and 325)
- However, the data presented here (Fig. 3C,D), clearly indicated to the presence of mono- and di-sialylated O-linked glycans on this peptide sequence
- Due to the lack of a consensus glycosylation motif , while assembling the O-glycopeptide database , every serine and threonine peptide must be considered as a potential glycopeptide , thus challenging the large-scale O-glycoproteomics studies
- The established approach was further validated by analyzing LC-MS data sets generated from 24 serum samples
- Mascot annotated a total of 257 glycoproteins containing 4653 redundant N-linked sialylated glycopeptide variants with an estimated false discovery rate (FDR) of 8%
- The FDR estimation for intact glycopeptide identifications is debatable and especially in case of glycopeptide identifications in relatively small numbers, the accurate estimation of FDR values is not possible
- Moreover, FDR control of both glycan and peptide identifications is challenging and based on the analytical workflow used, some customized strategies have been proposed
- Provided fragmentation information of both peptide and glycans of all the glycopeptides , FDR tools provided in Mascot can be confidently used
- Therefore, we only considered positive hits if a Mascot ion score of 25, a top scoring match to a particular spectrum and a significance threshold p-value < 0.001 was achieved
- At this point, we suggest using more confident filters such as the significance threshold p-values/Mascot ion scores
- Moreover, it was even possible to extract the XIC values and quantitatively compare the glycopeptide identifications using Mascot Distiller
- The protein as well as the glycopeptide ratios indicated very little to no significant differences between indolent and aggressive serum prostate cancer samples
- Still, we showed here the possibility of high-throughput identification and relative quantification of intact glycopeptides using this large dataset of 24 LC-MS runs
- Due to the availability of well-established tools like Mascot Daemon, Mascot Distiller and Proteome Discoverer, relatively fast identification and comparison of multiple LC-MS glycopeptide data sets is possible
- Many of the available software tools for glycoproteomics lack this ability of high-throughput and batch-wise analysis of large datasets
- Despite the significant results obtained with this approach, some issues regarding intact glycopeptide analysis are yet to be solved and are worth discussing
- The majority of the glycopeptide identification strategies consider the glycan structures as monosaccharide com positions, whereas we defined in our approach the glycan structures as linear sequences that best represents their behavior in the glycopeptide MS2 spectra
- With any of these approaches, it is difficult to analyze glycan structures for example specifying linkage information and differentiating glycan topologies
- Manual interpretation of the MS2 spectra, in particular spectra of the glycans alone probably is the best way in such special cases
- With our approach, for example the presence or absence of fucose residues can be specified without prior knowledge
- Moreover, as shown in the results (Supplementary Fig. 4), no prior knowledge is required in defining the position of fucose residues , as Mascot automatically annotates the fucose residue to the core HexNAc or HexNAc residues following the core glycan structure
- In terms of differentiating glycan topologies, if these topologies exhibit different fragmentation behavior, this could be specified in the linear glycan sequences and thereby enabling the possibility of topology differentiation
- However, this should be experimentally verified and manual validation will still be required for confirmation
- The N- and O-glycan databases used in this study are relatively small
- Since, the samples were specifically enriched for sialylated glycans, the N-glycan databases used in the study considered only sialylated glycans
- Using the total human proteome and glycome databases in preparing custom glycoprotein databases would of course have an impact on the quality of assignments
- For example, keeping a constant glycan database and using varying sizes of the proteome databases, the obtained results (Supplementary Fig. 6) indicated that the overlap was much higher between focused plasma protein databases, compared to the whole human proteome
- A recent study scrutinizing the frequently used glycopeptide identification Byonic software, also indicated that the glycome size, proteome size and number of modifications can have a profound impact on the search outcome
- This indeed is a well-known observation, even with the regular proteome search engines, that the database size and number of variable modifications increases the search space exponentially thus influencing the search outcome
- Considering the complexity involved in glycoproteomics, at this point we suggest using the custom glycoprotein databases that closely represent the samples used in the study
- Iterative search approaches, for example provide an alternative opportunity to overcome this limitation
- The data could be searched against the database containing only N-linked sialylated glycans for the first search
- The unannotated MS spectra can then be searched against another database of other glycans of potential interest and this could be iteratively repeated
- Though the stepped HCD function used in the study provided both peptide and glycan information, we observed that this is not universal and for some glycopeptide sequences , no peptide fragmentation was observed
- We believe that this also has an impact on the search outcome, when utilizing larger glycoproteome databases
- Fragmentation methods that generate glycan and peptide fragments , irrespective of glycopeptide sequences , will open up the possibility of using general proteome databases
- One specific limitation applicable for the described approach is that since the glycan com positions are added to the peptide N-terminus, the peptide b-ions present in the spectrum cannot be used
- The confirmation of the peptide sequence arises only from the y-ions, which are typically dominant in tryptic peptides using CID and HCD
- As mentioned above, several computational tools have been developed for automated identification of glycopeptides and the following reviews provide a detailed overview
- A large number of academically developed computational tools showed potential on automated glycopeptide identification studies for example, GlyDB, GlyPID, GlycoFragWork, GlycoMaster DB, GlycoPeptideSearch, GlycoPep Detector, GlycoPep Evaluator, GlycoPep Grader, Integrated Glyco- Proteome Analyzer, MAGIC, pGlyco, Protein Prospector, SweetNET, Sweet-Heart and a few more
- Most of the academic tools are usually open-source, however academic tools are mostly designed for specific needs and the majority of the tools are not continually followed up
- Moreover, very often they are lacking an appropriate graphical interface making them less user-friendly and often need additional informatics assistance to utilize them
- Commercial tools on the other hand are designed to be user friendly, continuously developed further and upgraded based on the research demands
- SimGlycan, GlycoQuest and Byonic are among the commercially available glycopeptide identification tools
- SweetNET, a recently introduced bioinformatics workflow uses an iterative process where glycan derived oxonium ion are used to filter the MS2 data for glycopeptides , the resulting set is then searched against protein databases to generate molecular networks for intact large scale glycopeptide identification
- For the database search of N-glycopeptides using Mascot, the glycan variable modification was defined as 5Hex \+ 4HexNAc attached for asparagine residues and loss of 5Hex + 4HexNAc or 5Hex + 3HexNAc from b- and y-ions including the N-glycosylation site was included
- Despite using the same search engine, the general concept of SweetNet is completely different to our approach
- Byonic is one of the most frequently used software package for glycopeptide data analysis and successfully reported in several different glycoproteomics studies
- Byonic identifies glycopeptides at the level of peptide sequence and glycan com position by searching the predefined or user-defined separate glycan and protein databases
- Glycan residues are specified as monosaccharide com positions and the potential glycopeptide candidates are scored by placing each glycan on the consensus N-glycosylation motifs
- In addition to the peptide/glycopeptide fragments , the presence of common oxonium ions and glycopeptide ions (Pep \+ HexNAc) are also considered while scoring the glycopeptides
- In our approach unlike Byonic, first glycan structures are defined in a linear fashion, which at best represent their behavior in the MS2 spectra
- The linear glycans and the protein sequences are then curated into a single glycoprotein database
- Fragmentated glycopeptides are searched against this database and scored based on the peptide type b- and y-ions using the standard Mascot scoring algorithm
- When comparing the Byonic software with our approach, quite similar results were obtained
- Some of the recent large-scale glycoproteomics studies also displayed the successful identification of thousands of glycopeptides
- However, one of the major advantages of using Mascot for automated glycopeptide analysis is its wide distribution and easy to use nature compared to many of the available software tools for glycoproteomics analysis
- Mascot as a computational tool is continuously followed up since two decades, widely acclaimed and established in the proteomics community across the world and easily adaptable for glycopeptide analysis as described here
- The necessary changes to establish Mascot for glycopeptide analysis are simply done by defining the letters (O, J, U) in the unimod.xml file (Supplementary Fig. 1) and updating the Mascot server with the glycoprotein database
- The linear glycan sequences as well as the script to prepare a custom glycoprotein database are presented along with this report, and just needs to run a single command before the database is ready
- Thus, no specific informatics skills are required to establish this workflow and a typical single LC-MS file from e.g. serum need a couple of minutes until the glycopeptide identifications are obtained
- In conclusion, we showed that Mascot, a widely accepted and used software could be easily implemented for automated glycopeptide analysis
- Though, at this point it does not solve all the problems associated with glycoproteomics, this single tool collectively allows the (i) elucidation of both N- and O-linked glycopeptide spectra, (ii) matching glycopeptides to known protein sequences , (iii) scoring and ranking of potential glycopeptides , (iv) usage of product ion spectra, and (v) high-throughput and batch-wise analysis
Output (sent_index, trigger,
protein,
sugar,
site):
- 0. glycopeptide, , -, -, glycopeptide
- 1. glycopeptide, , -, -, glycopeptide
- 1. glycopeptides, , -, -, glycopeptides
- 10. glycopeptide, , -, -, glycopeptide
- 11. glycopeptide, , -, -, glycopeptide
- 13. glycopeptide, , -, -, glycopeptide
- 16. glycopeptides, , -, -, glycopeptides
- 16. glycoproteins, , glycoproteins, -, -
- 17. glycopeptide, , -, -, glycopeptide
- 18. glycopeptides, , -, -, glycopeptides
- 19. glycoprotein, , alpha-1-acid glycoprotein 1, -, -
- 19. glycosylation, , alpha-1-acid glycoprotein 1, -, Asn 93
- 19. glycosylation, , alpha-1-acid glycoprotein 1, -, site
- 2. glycopeptide, , -, -, glycopeptide
- 22. glycopeptide, , -, -, glycopeptide
- 26. O-glycopeptide, , -, -, O-glycopeptide
- 26. glycopeptide, , -, -, glycopeptide
- 26. glycopeptide, , -, -, serine and threonine peptide
- 26. glycosylation, , -, -, motif
- 28. glycopeptide, , -, -, glycopeptide
- 28. glycoproteins, , glycoproteins, -, -
- 28. sialylated, , variants, -, -
- 29. glycopeptide, , -, -, glycopeptide
- 31. glycopeptides, , -, -, glycopeptides
- 34. glycopeptide, , -, -, glycopeptide
- 35. glycopeptide, , -, -, glycopeptide
- 36. glycopeptides, , -, -, glycopeptides
- 37. glycopeptide, , -, -, glycopeptide
- 39. glycopeptide, , -, -, glycopeptide
- 4. glycopeptide, , -, -, glycopeptide
- 40. glycopeptide, , -, -, glycopeptide
- 49. glycoprotein, , glycoprotein, -, -
- 5. glycopeptide, , -, -, glycopeptide
- 5. glycopeptides, , -, -, glycopeptides
- 5. sialylated, , -, -, glycopeptides
- 51. glycopeptide, , -, -, glycopeptide
- 53. glycoprotein, , glycoprotein, -, -
- 57. glycopeptide, , -, -, glycopeptide sequences
- 59. glycopeptide, , -, -, glycopeptide sequences
- 6. glycopeptide, , -, -, glycopeptide
- 6. glycopeptides, , -, -, glycopeptides
- 62. glycopeptides, , -, -, glycopeptides
- 63. glycopeptide, , -, -, glycopeptide
- 67. glycopeptide, , -, -, glycopeptide
- 68. glycopeptide, , -, -, glycopeptide
- 68. glycopeptides, , -, -, glycopeptides
- 69. N-glycopeptides, , -, -, N-glycopeptides
- 69. N-glycosylation, , -, -, site
- 7. glycopeptide, , -, -, glycopeptide
- 71. glycopeptide, , -, -, glycopeptide
- 72. glycopeptides, , -, -, glycopeptides
- 73. N-glycosylation, , -, -, motifs
- 73. glycopeptide, , -, -, glycopeptide
- 74. glycopeptide, , -, Pep + HexNAc, glycopeptide
- 74. glycopeptides, , -, -, glycopeptides
- 74. peptide/glycopeptide, , -, -, peptide/glycopeptide fragments
- 76. glycoprotein, , glycoprotein, -, -
- 77. glycopeptides, , -, -, glycopeptides
- 79. glycopeptides, , -, -, glycopeptides
- 80. glycopeptide, , -, -, glycopeptide
- 81. glycopeptide, , -, -, glycopeptide
- 82. glycopeptide, , -, -, glycopeptide
- 82. glycoprotein, , glycoprotein, -, -
- 83. glycoprotein, , glycoprotein, -, -
- 84. glycopeptide, , -, -, glycopeptide
- 85. glycopeptide, , -, -, glycopeptide
- 86. glycopeptide, , -, -, glycopeptide
- 86. glycopeptides, , -, -, glycopeptides
- 9. glycopeptide, , -, -, glycopeptide sequences
Output(Part-Of) (sent_index,
protein,
site):
- 19. alpha-1-acid glycoprotein 1, Asn 93
- 19. alpha-1-acid glycoprotein 1, site
*Output_Site_Fusion* (sent_index,
protein,
sugar,
site):
- 19. alpha-1-acid glycoprotein 1, -, Asn 93