PMCID: PMC5795011

 

    Legend: Gene, Sites, Suger

Section : Defining glycan residues for Mascot database search

Content :
  1. A series of Y type glycosidic fragment ions are required for confident characterization of glycan structures and both collision-induced dissociation (CID) and higher-energy collisional induced dissociation ( HCD ) fragmentation techniques provide such pattern at lower normalized collision energy (NCE) values
  2. A typical example of an HCD (NCE =  15) MS2 spectrum of a glycopeptide derived from bovine alpha-1-acid glycoprotein is presented in Fig. 1A
  3. Starting with the Y1 ion by following the mass differences between the most intense peaks and the mass difference between the Y10 ion and the precursor, the glycan sequence can be easily determined
  4. The complete glycan structure contains four HexNAc , five Hex and two Neu5Ac residues , which represent a di-sialylated biantennary N-glycan (Fig. 1A)
  5. Assuming the glycan residues similar to amino acids, deducing the glycan structure from glycopeptide MS2 spectrum is similar to peptide sequencing
  6. In order to use the Mascot search engine for automated glycopeptide analysis, the basic requirements include that (i) the sugar residues must be defined with unique one-letter codes in a Mascot readable format, (ii) each glycan structure must be defined in a linear format and (iii) a customized database must be prepared which consists of combined protein and glycan sequences
  7. Mascot uses the Latin alphabet as one-letter codes and 20 of them are assigned to the standard amino acid residues , and B, X and Z are hard-coded
  8. Of the remaining three letters (O, J and U), O was assigned to N-acetylhexosamine (GlcNAc, GalNAc), J to hexoses (Galactose, mannose) and U to sialic acid (Table 1)
  9. Fucose was defined as a variable modification on N-acetylhexoseamines (O)
  10. For Mascot, the three letters O, J and U must be defined in the unimod.xml file (Supplementary Fig. 1)
*Output_Site_Fusion* (sent_index, protein, sugar, site):
Section : Linearized glycopeptide sequences and custom glycoprotein databases

Content :
  1. After manual annotation of various glycopeptide MS2 spectra, linear glycan sequences were defined based on the criteria that they should i) cover the maximum possible intense peaks in the MS2 spectrum and ii) provide close to complete information about the glycopeptide sequence
  2. Considering the di-sialylated bi-antennary glycopeptide (Fig. 1A), the following linear sequence OJUUJOJJJOO-peptide fulfills the criteria mentioned above (Fig. 1B)
  3. By attaching the glycan sequence at the peptide N-terminus, the Yn type glycosidic cleavage ions now become the peptide cleavage type y ions
  4. The last three residues at the N-terminus (OJU) cover the three most intense peaks of oxonium ions at 204.086 (HexNAc) , 366.138 (HexNAc-Hex) and 657.233 (HexNAc-Hex-Neu5Ac) as b1, b2 and b3 ions
  5. The remaining linear sequence ( UJOJJJOO-peptide ) can be annotated to the intense peaks as yn to yn+7 ions (Fig. 1B)
  6. The spectrum now contains a series of eight y type and three b type intense ions
  7. All major glycan structures were converted to linear sequences following the same principles (Supplementary Table 1)
  8. The next step was to create a customized database , where both the protein and glycan sequences co-exist
  9. An in-house written python script was developed for this purpose (Supplementary File)
  10. Briefly, following an in-silico digestion, the tryptic peptides containing NxT/S/C motifs (N-linked glycosylation) or serine/threonine residues (O-linked glycosylation) and the linear glycan sequences were combined (Supplementary Fig. 2)
  11. The custom database used in the manuscript, if not otherwise described, consists of a total of 406 potential glycoproteins which were known to be glycosylated in serum (PeptideAtlas N-Glyco build 2010)
  12. After adding 21 unique linear sialylated glycan sequences (Supplementary Table 1), the database contained 41,727 potential glycopeptide sequences and a total of 1,195,485 residues
*Output_Site_Fusion* (sent_index, protein, sugar, site):
Section : Identification of N- and O-linked glycopeptides by Mascot

Content :
  1. The feasibility of the Mascot search engine for automated annotation of both N- and O-linked glycopeptides was validated using two standard bovine glycoproteins ( alpha-1-acid glycoprotein and fetuin )
  2. When searched against the custom glycoprotein database , the MS2 spectrum shown in Fig. 1A is annotated as a di-sialylated bi-antennary N-glycopeptide of alpha-1-acid glycoprotein , with a Mascot ion score of 24 (Fig. 2A)
  3. As theoretically expected, Mascot annotated the intense peaks to a series of y ions starting from y13 (peptide + HexNAc) until y20 (peptide + HexNAc(O)3 − Hex(J)4 − Neu5Ac(U)1).
  4. Together with the precursor mass, and the b1, b2 and b3 ions, the presence of additional HexNAc(O)1 − Hex(J)1 − Neu5Ac(U)1 residues was confirmed, thereby providing 100% sequence coverage of the glycan (Fig. 2A)
  5. However, no other information in the spectrum confirmed the peptide sequence except the precursor mass
  6. The lack of peptide fragmentation information in the MS2 spectrum might create difficulties in differentiating glycopeptide sequences resulting in similar Mascot ion scores
  7. However, the fragmentation of the glycopeptides can be fine-tuned by the NCE values used for HCD fragmentation
  8. As an example, the tryptic peptides of alpha-1-acid glycoprotein were fragmented at different NCE values
  9. At NCE values of 15 and 25, the di-sialylated bi-antennary glycopeptide MS2 spectra displayed glycosidic fragment ions (Fig. 2A,B)
  10. However, the same glycopeptide contained a series of peptide cleavage type y ions (y4 to y9) at an NCE value of 35 with almost no information about the glycan structure
  11. Hence, Mascot annotated mostly the peptide part of the glycopeptide sequence (Fig. 2C)
  12. Consequently, a single NCE value might not provide enough information about both the glycan and peptide sequence
  13. With the stepped NCE option of quadrupole-orbitrap mass spectrometers, the instrument can acquire fragmentation data of the precursors at multiple collision energies
  14. With this option, up to three different NCE values can be selected to generate a composite MS2 spectrum as shown in Fig. 2D, combining 15, 25 and 35 as NCE values
  15. This MS2 spectrum revealed near to complete information about the glycan sequence and the peptide y ions (y4, y5, y8 and y12) were detected as well
  16. Mascot unambiguously annotated this MS2 spectrum with an ion score of 34 (Fig. 2D)
  17. The feasibility of the Mascot search engine for the analysis of O-linked glycopeptides was validated by analyzing the mass spectrometry data of bovine fetuin against a custom O-glycoprotein sequence of fetuin
  18. Mascot annotated mono- and di-sialylated core-1 O-glycans on two different peptide sequences
  19. A series of y ions (y5 to y18) and b ions (b1, b2,b3) covering the most intense peaks (Fig. 3A,B) clearly confirmed that these MS2 spectra correspond to the O-linked glycopeptides
  20. These spectra were unambiguously annotated with an excellent Mascot ion score of more than 40
  21. Similar to N-linked glycopeptides , the b1, b2 and b3 ions at m/z values of 292.102 (Neu5Ac) , 454.155 (Neu5Ac-Hex) and 657.233 (Neu5Ac-Hex-HexNAc) covered the low mass glycan fragment ions and provided an additional layer of confirmation about the O-glycopeptide spectra
  22. A similar fragmentation behavior was observed for two other O-glycopeptides of the same protein (Fig. 3C,D)
  23. To further display the feasibility of Mascot, analysis of bacterial O-glycosylation was performed on a purified PilE protein
  24. The PilE protein contains a di-N-acetyl-bacillosamine (diNAcBac) and galactose based glycans with a potential acetylation on the galactose residue
  25. Mascot was able to annotate the diNAcBac (Supplementary Fig. 3A), diNAcBac-Gal (Supplementary Fig. 3B) residues as well as the monoacetylation (Supplementary Fig. 3C) and diacetylation (Supplementary Fig. 3D) on galactose residues
  26. For the complex O-glycosylation study, we re-analyzed the previously published mass spectrometry data from the immunoaffinity purified fractions and whole cell extract of a Neisseria gonorrhoeae strain
  27. The Mascot annotated glycopeptides were compared to the previously published data, where software assistance and manual data analysis was performed and the majority (11/13 glycopeptides ) of the previously confirmed O-glycopeptides were identified automatically (Supplementary Table 2)
  28. Taken together all these results clearly indicates the potential of the described approach i.e. using stepped NCE values, a custom linearized glycoprotein database and the Mascot search engine for automated glycopeptide annotation
*Output_Site_Fusion* (sent_index, protein, sugar, site):
Section : Identification and label-free quantification of serum N-glycoproteome

Content :
  1. The proposed procedure was validated by analyzing the N-linked sialylated glycoproteome of serum samples from healthy individuals (n = 12) and patients diagnosed with prostate cancer (n = 12)
  2. Tryptic peptides from serum samples were desalted using zwitterionic chromatography-hydrophilic interaction liquid chromatography solid phase extraction ( ZIC-HILIC SPE) to enrich glycopeptides , followed by enrichment of sialylated glycopeptides with TiO2 beads (Fig. 4)
  3. The N-linked sialylated glycopeptides were analyzed by LC-MS using HCD with stepped NCE and the acquired MS2 spectra were submitted to the Mascot search engine for automated identification and relative quantification using Mascot Distiller (Fig. 4)
  4. The result of the described strategy for large scale automated glycopeptide analysis of LC-MS datasets was demonstrated by serum alpha-1-acid glycoprotein 1 ( A1AG1 ) as an example
  5. Considering zero missed cleavages, NxT/S/C motifs and a peptide length of 6–30 amino acids, A1AG1 potentially contained two N-glycosylation sites in the custom glycoprotein database (QDQCIYNTTYLNVQR, ENGTISR)
  6. A1AG1 was identified with a protein score of 1559 and 27% sequence coverage by the database search of 24 LC-MS runs
  7. Most of the sialylated N-glycans were identified on the sequence QDQCIYNTTYLNVQR
  8. Mascot annotated nine different mono-, di-, tri- and tetra-sialylated N-glycan structures on this glycosylation site (Fig. 5)
  9. Almost all the intense peaks in the MS2 spectra of mono-sialylated bi- (Fig. 5A), tri- (Fig. 5B) and tetra-antennary (Fig. 5C) glycopeptides were annotated by Mascot, confirming the presence of these glycan structures
  10. The peptide sequence was confirmed by annotation of y5, y6 and y8 ions
  11. MS2 spectra shown in Fig. 5 D–F were annotated to di-sialylated bi-, tri- and tetra-antennary glycopeptides , confirmed by a complete series of y ions representing both peptide and glycan cleavages
  12. The same was found for tri- and tetra-sialylated glycan structures on the same peptide sequence (Fig. 5G–I)
  13. Nine different glycan structures with varied degree of complexity and sialylation on the same glycosylation site , and near to complete information about both the peptide and glycan part proved the capability of the current approach for large scale automated glycopeptide analysis
  14. Some of the above sialylated glycopeptides were also identified with attached fucose residues
  15. As mentioned above, fucose was considered as a variable modification during the database search
  16. Though it is not possible to pinpoint the exact location of fucose residues , it can be easily concluded whether the fucose is attached to the core HexNAc residue or the HexNAc residues after the trimannosyl core glycan structure
  17. For example, the top scoring matches of tri-sialylated tri-antennary and di-sialylated tetra-antennary glycopeptides indicated a fucose residue after the core structure
  18. The absence of peak at +146 Da following the peptide \+ HexNAc peak clearly indicated that the fucose residue is not attached to the core HexNAc (Supplementary Fig. 4A,B)
  19. As opposed to the above examples, Mascot annotated the fucose residue to the core HexNAc of a di-sialylated bi-antennary glycopeptide of alpha-2-macroglobulin
  20. The presence of a peak at +146 Da, following the peptide \+ HexNAc peak clearly indicated that the fucose is attached to the core structure (Supplementary Fig. 4C)
  21. Therefore, it must be considered that the fucose is either attached to the core HexNAc or HexNAc residues following the core glycan structure when determining the position of fucose residues in Mascot output
  22. In addition to fucose , other modifications such as sulfation and phosphorylation of HexNAc or Hex could also be considered as variable modifications if this is of interest
  23. However, using more variable modifications increases the search space and thus the uncertainty in some assignments
  24. Using this approach, a total of 257 glycoproteins were identified from the 24 serum samples (Supplementary Table 3)
  25. Within these 257 glycoproteins , a total of 970 unique glycosylation sites and 3447 non-redundant glycopeptide variants were identified (Supplementary Tables 4, and 5)
  26. Of these 3447 glycopeptide variants , the most abundant are the di-sialylated bi-antennary glycans with no (377), one (291) and two fucose residues (169)
  27. The next major glycopeptide variants included the di-sialylated tri-antennary and mono-sialylated di-antennary glycopeptide variant without and with fucose residues (Supplementary Table 5)
  28. The specific enrichment for di-sialylated bi-antennary glycans might indicate the abundance of these glycans in the serum proteins
  29. However, an effect of the enrichment protocol cannot be ruled out
  30. Label-free quantification of the glycopeptides (aggressive vs. indolent prostate cancer) was performed using the replicate quantitation protocol of Mascot Distiller
  31. The median protein ratios revealed no significant changes between aggressive and indolent samples and most of the protein ratios were within the range of 1.0 ± 0.5 (Supplementary Table 3)
  32. To find out any quantitative differences at the glycosylation level, the glycopeptides were segmented based on the glycan structures irrespective of the protein origin and the corresponding ratios were plotted as violin plots
  33. Figure 6 displays the glycopeptide ratios of the three most abundant glycan structures and most of them have peptide ratios near to 1.0, indicating no significant changes between the analyzed indolent and aggressive cancer samples
  34. Glycopeptide ratios of various other glycan structures which were identified in more than 10 different peptide sequences are presented in Supplementary Fig. 5
  35. Most of these structures had also peptide ratios around 1.0 with a very few being up or down
  36. For example, the median glycopeptide ratio of the tri-sialylated tri-antennary glycopeptides is near 1.0 based on 77 values, whereas the mono- (73 values) and di-fucosylated (19 values) versions have a median peptide ratio slightly above 1.0
  37. The tri-sialylated tetra-antennary glycopeptides had two different populations at median peptide ratios of 1.0 and 1.5, whereas the fucosylated version had a median peptide ratios slightly above 1.0 (Supplementary Fig. 5)
  38. Summarized, the presented data shows the ease and feasibility of the proposed workflow for automated glycopeptide identification and quantification
  39. In addition to the database used in obtaining the above presented results, the LC-MS data sets of the 24 serum samples were also searched against differentially sized custom glycoprotein databases created from (i) all known plasma/serum proteins from PeptideAtlas build 2010 ( 2421 glycoproteins ), (ii) all deamidated proteins identified following PNGaseF treatment of glycopeptides from the same 24 serum samples ( 280 glycoproteins ) and (iii) Swiss- Prot annotated human proteome ( 14120 glycoproteins )
  40. Irrespective of the databases, 68 glycoproteins were consistently identified in all four different databases (Supplementary Fig. 6)
  41. There is a good level of agreement between the three plasma protein related databases because 121 glycoproteins were consistently identified
  42. The deamidated proteins ( 280 proteins ) identified after the PNGaseF treatment potentially represent well the detectable glycoproteins present in the 24 serum samples
  43. Comparing the glycoprotein databases created from deamidated proteins identified following PNGaseF treatment and plasma glycoproteins reported in Peptide Atlas, out of the 257 glycoproteins identified, 163 were found to be common representing 63% overlap (Supplementary Fig. 6)
  44. This result clearly indicates the authenticity of the glycoproteins identified by the workflow presented in this study
*Output_Site_Fusion* (sent_index, protein, sugar, site):
Section : HCD- MS2 spectrum of a di-sialylated bi-antennary glycopeptide (m/z 1177.81373+) derived from bovine alpha-1-acid glycoprotein 1

Content :
  1. The glycopeptide was fragmented at an NCE value of 15
  2. (A) Y represent the peptide bound glycosidic cleavage ions and the insert shows the corresponding peptide bound glycan structures
  3. (B) Linearizing glycan structures with the corresponding three letters O (GlcNAc, GalNAc), J (Galactose, Mannose) and U (Neu5Ac) are depicted
  4. A linearized di-sialylated bi-antennary glycan structure attached to the N-terminus of a peptide and annotated y and b type cleavage ions corresponds to the MS2 spectrum
*Output_Site_Fusion* (sent_index, protein, sugar, site):
Section : Mascot annotated MS2 spectra of a di-sialylated bi-antennary N-glycopeptide (m/z 1177.81373+) fragmented at different NCE values

Content :
  1. NCE values of ( A) 15 , (B) 25, (C) 35 and (D) the composite MS2 spectrum of the same precursor fragmented using the stepped NCE values of 15, 25 and 35 are displayed
  2. Standard bovine alpha-1-acid glycoprotein 1 was digested with trypsin, the glycopeptides were analyzed by LC-MS and the data were searched against the custom glycoprotein database using the Mascot search engine
*Output_Site_Fusion* (sent_index, protein, sugar, site):
Section : Mascot annotated O-glycopeptide MS2 spectra of fetuin using stepped NCE values

Content :
  1. Bovine fetuin was digested with trypsin, analyzed by LC-MS using the stepped NCE function (15, 25 and 35) and searched against the custom O-glycoprotein database
  2. Mascot annotated mono- (A,C) and di-sialylated (B,D) core-1 O-linked glycopeptide spectra from two different peptide sequences
*Output_Site_Fusion* (sent_index, protein, sugar, site):
Section : Workflow for N-glycoproteome analyses of serum samples

Content :
  1. Briefly, the serum samples from control and patients diagnosed with prostate cancer were digested with trypsin and desalted with ZIC-HILIC SPE
  2. Next, the N-linked sialylated glycopeptides were enriched with TiO2 beads, followed by LC-MS analysis using a Q Exactive mass spectrometer applying stepped NCE for HCD fragmentation
  3. The intact glycopeptide mass spectra were submitted to the Mascot search engine for identification and relative quantification with Mascot Distiller
  4. The data was searched against a custom glycoprotein database prepared from 21 linear N-linked sialylated glycans and proteins (444) known to be glycosylated in serum (PeptideAtlas N-Glyco build 2010)
*Output_Site_Fusion* (sent_index, protein, sugar, site):
Section : Annotation of nine different glycan structures with varied degree of complexity and sialylation by Mascot on a single glycosylation site ( Asn 93 ) of alpha-1-acid glycoprotein 1 in serum

Content :
  1. Shown here are the representative HCD MS2 spectra annotated by Mascot
  2. The nine different glycopeptide variants included the mono-sialylated bi- (A), tri- (B), tetra-antennary (C), and the di-sialylated bi- (D), tri- (E), tetra-antennary (F)
  3. Tri- (G,H) and tetra-sialylated (I) glycan structures on the same glycosylation site were also annotated by Mascot
*Output_Site_Fusion* (sent_index, protein, sugar, site):
  • 0. alpha-1-acid glycoprotein 1, -, Asn 93
Section : Violin plots representing the glycopeptide ratios (aggressive vs. indolent prostate cancer) of the three most frequent glycopeptide variants identified in the current study

Content :
  1. Glycopeptides identified and quantified in 24 serum samples were segmented based on the glycan structures irrespective of the protein origin
  2. The three most frequent glycopeptide variants were the mono-sialylated bi-antennary, di-sialylated bi-antennary without and with one fucose residue
  3. Mascot Distiller was used to calculate the XIC values and the corresponding ratios between aggressive (12) and indolent (12) samples
  4. Glycopeptide precursors contributing to a minimum 50% of the XIC peak area and passing the correlation threshold of 0.8 were only considered
*Output_Site_Fusion* (sent_index, protein, sugar, site):
Section : A large variety of informatics tools have been developed for automated glycopeptide analysis which advanced the glycoproteomics field

Content :
  1. However, recent reviews summarizing the glycoproteomics field in terms of available software tools suggested the need of a single software tool which could address the following concerns: (i) elucidation of both N- and O-linked glycopeptide spectra, (ii) matching glycopeptides to known protein sequences , (iii) scoring/ranking of potential glycopeptides , (iv) usage of product ion spectra, and (v) high-throughput and batch-wise analysis
  2. In this report, we addressed these concerns by using the widely applied Mascot search engine for automated glycopeptide analysis
  3. In principle, other protein search engines could be used as well, if additional letters can be defined for monosaccharides as described here
  4. The success of the software-assisted intact glycopeptide analysis also depends on the enrichment strategy and the information available in the MS2 spectra
  5. The enrichment strategy employed in this study worked well to enrich sialylated glycopeptides and the LC-MS data sets contained mainly glycopeptide spectra
  6. For any software tool, the MS2 spectra of intact glycopeptides should contain both peptide and glycan information in order to provide a scoring and ranking of potential glycopeptide identifications and matching the glycopeptides to protein sequences
  7. A considerable amount of research has been performed for developing efficient fragmentation tools for glycopeptide analysis
  8. Unlike the collision based fragmentation techniques, the glycan structure remains relatively intact in electron transfer dissociation (ETD) spectra, thus providing information about the peptide sequence
  9. The combination of collision based ( HCD /CID) and electron transfer (ETD/ ECD ) based fragmentation techniques provide complementary information about the glycopeptide sequences
  10. Data driven acquisition strategies, for example HCD-product dependent CID-and ETD fragmentation strategies have also been shown to be effective in intact glycopeptide analysis
  11. The recently introduced electron transfer and higher-energy collision induced dissociation (EThcD) technique seems to work quite well for intact glycopeptide analysis
  12. However, with the used Q Exactive mass spectrometer, we could only use HCD
  13. Therefore, we showed the advantages of using stepped HCD while generating glycopeptide MS2 spectra
  14. HCD mass spectra at lower energies (Fig. 2A,B) are typically dominated by glycosidic fragment ions, whereas at higher energies the mass spectra (Fig. 2C) mainly contained peptide cleavage ions, thereby hampering successful mapping of both glycan and peptide moieties
  15. The HCD mass spectra using stepped NCE provided information both at the glycan and peptide level (Fig. 2D)
  16. A recent study also showed the same effect using low and high energy CID on a Q-TOF instrument for synthetic glycopeptides and standard glycoproteins
  17. A large number of available software tools for glycopeptide annotation deals mainly with N-linked glycosylation
  18. Software tools that can automatically annotate both N-linked and O-linked glycopeptides are of great advantage
  19. For example, Mascot annotated a total of nine different mono-, di-, tri- and tetra-sialylated N-glycan structures on a single glycosylation site ( Asn 93 ) of serum alpha-1-acid glycoprotein 1 (Fig. 5)
  20. Though, the sialylated N-linked glycans were the main focus in this study, the presented Mascot approach can of course identify other types of N-glycan structures (Supplementary Fig. 7)
  21. O-linked glycosylation on the other hand is more difficult to study, due the inherent lack of a consensus motif
  22. The obtained results using bovine fetuin documented that the Mascot search engine can indeed be used for O-linked glycopeptide analysis
  23. Mono- and di-sialylated core-1 O-linked glycans were annotated to two different sequences
  24. According to UniProt and some recent publications, the peptide sequence HTFSGVASVESSSGEAFHVGK carries only phosphorylation on serine residues (320, 323 and 325)
  25. However, the data presented here (Fig. 3C,D), clearly indicated to the presence of mono- and di-sialylated O-linked glycans on this peptide sequence
  26. Due to the lack of a consensus glycosylation motif , while assembling the O-glycopeptide database , every serine and threonine peptide must be considered as a potential glycopeptide , thus challenging the large-scale O-glycoproteomics studies
  27. The established approach was further validated by analyzing LC-MS data sets generated from 24 serum samples
  28. Mascot annotated a total of 257 glycoproteins containing 4653 redundant N-linked sialylated glycopeptide variants with an estimated false discovery rate (FDR) of 8%
  29. The FDR estimation for intact glycopeptide identifications is debatable and especially in case of glycopeptide identifications in relatively small numbers, the accurate estimation of FDR values is not possible
  30. Moreover, FDR control of both glycan and peptide identifications is challenging and based on the analytical workflow used, some customized strategies have been proposed
  31. Provided fragmentation information of both peptide and glycans of all the glycopeptides , FDR tools provided in Mascot can be confidently used
  32. Therefore, we only considered positive hits if a Mascot ion score of 25, a top scoring match to a particular spectrum and a significance threshold p-value < 0.001 was achieved
  33. At this point, we suggest using more confident filters such as the significance threshold p-values/Mascot ion scores
  34. Moreover, it was even possible to extract the XIC values and quantitatively compare the glycopeptide identifications using Mascot Distiller
  35. The protein as well as the glycopeptide ratios indicated very little to no significant differences between indolent and aggressive serum prostate cancer samples
  36. Still, we showed here the possibility of high-throughput identification and relative quantification of intact glycopeptides using this large dataset of 24 LC-MS runs
  37. Due to the availability of well-established tools like Mascot Daemon, Mascot Distiller and Proteome Discoverer, relatively fast identification and comparison of multiple LC-MS glycopeptide data sets is possible
  38. Many of the available software tools for glycoproteomics lack this ability of high-throughput and batch-wise analysis of large datasets
  39. Despite the significant results obtained with this approach, some issues regarding intact glycopeptide analysis are yet to be solved and are worth discussing
  40. The majority of the glycopeptide identification strategies consider the glycan structures as monosaccharide com positions, whereas we defined in our approach the glycan structures as linear sequences that best represents their behavior in the glycopeptide MS2 spectra
  41. With any of these approaches, it is difficult to analyze glycan structures for example specifying linkage information and differentiating glycan topologies
  42. Manual interpretation of the MS2 spectra, in particular spectra of the glycans alone probably is the best way in such special cases
  43. With our approach, for example the presence or absence of fucose residues can be specified without prior knowledge
  44. Moreover, as shown in the results (Supplementary Fig. 4), no prior knowledge is required in defining the position of fucose residues , as Mascot automatically annotates the fucose residue to the core HexNAc or HexNAc residues following the core glycan structure
  45. In terms of differentiating glycan topologies, if these topologies exhibit different fragmentation behavior, this could be specified in the linear glycan sequences and thereby enabling the possibility of topology differentiation
  46. However, this should be experimentally verified and manual validation will still be required for confirmation
  47. The N- and O-glycan databases used in this study are relatively small
  48. Since, the samples were specifically enriched for sialylated glycans, the N-glycan databases used in the study considered only sialylated glycans
  49. Using the total human proteome and glycome databases in preparing custom glycoprotein databases would of course have an impact on the quality of assignments
  50. For example, keeping a constant glycan database and using varying sizes of the proteome databases, the obtained results (Supplementary Fig. 6) indicated that the overlap was much higher between focused plasma protein databases, compared to the whole human proteome
  51. A recent study scrutinizing the frequently used glycopeptide identification Byonic software, also indicated that the glycome size, proteome size and number of modifications can have a profound impact on the search outcome
  52. This indeed is a well-known observation, even with the regular proteome search engines, that the database size and number of variable modifications increases the search space exponentially thus influencing the search outcome
  53. Considering the complexity involved in glycoproteomics, at this point we suggest using the custom glycoprotein databases that closely represent the samples used in the study
  54. Iterative search approaches, for example provide an alternative opportunity to overcome this limitation
  55. The data could be searched against the database containing only N-linked sialylated glycans for the first search
  56. The unannotated MS spectra can then be searched against another database of other glycans of potential interest and this could be iteratively repeated
  57. Though the stepped HCD function used in the study provided both peptide and glycan information, we observed that this is not universal and for some glycopeptide sequences , no peptide fragmentation was observed
  58. We believe that this also has an impact on the search outcome, when utilizing larger glycoproteome databases
  59. Fragmentation methods that generate glycan and peptide fragments , irrespective of glycopeptide sequences , will open up the possibility of using general proteome databases
  60. One specific limitation applicable for the described approach is that since the glycan com positions are added to the peptide N-terminus, the peptide b-ions present in the spectrum cannot be used
  61. The confirmation of the peptide sequence arises only from the y-ions, which are typically dominant in tryptic peptides using CID and HCD
  62. As mentioned above, several computational tools have been developed for automated identification of glycopeptides and the following reviews provide a detailed overview
  63. A large number of academically developed computational tools showed potential on automated glycopeptide identification studies for example, GlyDB, GlyPID, GlycoFragWork, GlycoMaster DB, GlycoPeptideSearch, GlycoPep Detector, GlycoPep Evaluator, GlycoPep Grader, Integrated Glyco- Proteome Analyzer, MAGIC, pGlyco, Protein Prospector, SweetNET, Sweet-Heart and a few more
  64. Most of the academic tools are usually open-source, however academic tools are mostly designed for specific needs and the majority of the tools are not continually followed up
  65. Moreover, very often they are lacking an appropriate graphical interface making them less user-friendly and often need additional informatics assistance to utilize them
  66. Commercial tools on the other hand are designed to be user friendly, continuously developed further and upgraded based on the research demands
  67. SimGlycan, GlycoQuest and Byonic are among the commercially available glycopeptide identification tools
  68. SweetNET, a recently introduced bioinformatics workflow uses an iterative process where glycan derived oxonium ion are used to filter the MS2 data for glycopeptides , the resulting set is then searched against protein databases to generate molecular networks for intact large scale glycopeptide identification
  69. For the database search of N-glycopeptides using Mascot, the glycan variable modification was defined as 5Hex \+ 4HexNAc attached for asparagine residues and loss of 5Hex + 4HexNAc or 5Hex + 3HexNAc from b- and y-ions including the N-glycosylation site was included
  70. Despite using the same search engine, the general concept of SweetNet is completely different to our approach
  71. Byonic is one of the most frequently used software package for glycopeptide data analysis and successfully reported in several different glycoproteomics studies
  72. Byonic identifies glycopeptides at the level of peptide sequence and glycan com position by searching the predefined or user-defined separate glycan and protein databases
  73. Glycan residues are specified as monosaccharide com positions and the potential glycopeptide candidates are scored by placing each glycan on the consensus N-glycosylation motifs
  74. In addition to the peptide/glycopeptide fragments , the presence of common oxonium ions and glycopeptide ions (Pep \+ HexNAc) are also considered while scoring the glycopeptides
  75. In our approach unlike Byonic, first glycan structures are defined in a linear fashion, which at best represent their behavior in the MS2 spectra
  76. The linear glycans and the protein sequences are then curated into a single glycoprotein database
  77. Fragmentated glycopeptides are searched against this database and scored based on the peptide type b- and y-ions using the standard Mascot scoring algorithm
  78. When comparing the Byonic software with our approach, quite similar results were obtained
  79. Some of the recent large-scale glycoproteomics studies also displayed the successful identification of thousands of glycopeptides
  80. However, one of the major advantages of using Mascot for automated glycopeptide analysis is its wide distribution and easy to use nature compared to many of the available software tools for glycoproteomics analysis
  81. Mascot as a computational tool is continuously followed up since two decades, widely acclaimed and established in the proteomics community across the world and easily adaptable for glycopeptide analysis as described here
  82. The necessary changes to establish Mascot for glycopeptide analysis are simply done by defining the letters (O, J, U) in the unimod.xml file (Supplementary Fig. 1) and updating the Mascot server with the glycoprotein database
  83. The linear glycan sequences as well as the script to prepare a custom glycoprotein database are presented along with this report, and just needs to run a single command before the database is ready
  84. Thus, no specific informatics skills are required to establish this workflow and a typical single LC-MS file from e.g. serum need a couple of minutes until the glycopeptide identifications are obtained
  85. In conclusion, we showed that Mascot, a widely accepted and used software could be easily implemented for automated glycopeptide analysis
  86. Though, at this point it does not solve all the problems associated with glycoproteomics, this single tool collectively allows the (i) elucidation of both N- and O-linked glycopeptide spectra, (ii) matching glycopeptides to known protein sequences , (iii) scoring and ranking of potential glycopeptides , (iv) usage of product ion spectra, and (v) high-throughput and batch-wise analysis
*Output_Site_Fusion* (sent_index, protein, sugar, site):
  • 19. alpha-1-acid glycoprotein 1, -, Asn 93

 

 

Protein NCBI ID SENTENCE INDEX