Our large‐scale analysis mapped 3,055 O‐linked glycosylation sites from 1,060 glycoproteins in kidney tissues, T cells, and serum (Dataset EV6)
To compare the EXoO identified sites to that reported previously, 2,746 reported O‐GalNAcsites were collected from O‐GalNAc human SimpleCell glycoproteome DB (Steentoft et al, 2011, 2013), PhosphoSitePlus (Hornbeck et al, 2015), and UniProt database (UniProt Consortium, 2018)
Remarkably, EXoO identified 2,580 novel O‐linked glycosylation sites , an approximately 94% increase in the known sites , which however are mapped primarily using engineered cell lines
To determine sample‐specific O‐linked glycoproteome, the distribution of EXoO identified peptides in different samples was determined
Kidney tissue and T cells had a large number of unique peptides compared to that seen for serum, with more than half of peptides detected in serum also being identified in the tissue sample, possibly due to the presence of serum in tissue samples (Fig 2B)
To visualize the relative abundance of peptides in different samples, the PSM numbers of peptides, which are suggestive of relative abundance, were clustered by unsupervised hierarchical clustering (Fig 2C)
This showed that not only that the peptides differed between samples but also that their relative abundances were markedly divergent between samples (Fig 2C)
Interestingly, immunoglobulin heavy constant alpha 1 ( IGHA1 ) has the highest PSM number in the normal tissue and serum but had the second highest PSM number in the tumor tissue where versican core protein ( VCAN ) scored the highest PSM number suggesting their relatively high abundance for detection and aberrant O‐linked glycosylation of VCAN in tumor tissue
In the case of IGHA1 , four of the five known sites on Ser residues and two new sites on Thr residues were mapped supportive of EXoO's capacity to both localize known and discover new O‐linked glycosylation sites
Overall, these data suggest that protein O‐linked glycosylation is highly dynamic and may exhibit a disease‐specific signature
To identify possible O‐linked glycosylation motifs , the amino acids (±7 amino acids) at and surrounding 3,042 of the sites mapped in this study were analyzed
O‐linked glycan addition at Thr and Ser accounted for 67.6 and 22.4% of the sites , respectively (Fig 2D)
Analysis of the surrounding sequence motifs revealed that Pro was overrepresented at the + 3 and −1 positions irrespective of which amino acid ( Thr or Ser ) was glycosylated or sample type (Fig 2D and Appendix Fig S2)
Overall enrichment of Pro was observed in the amino acids surrounding O‐linked glycosylation sites (Appendix Fig S2)
Thirteen O‐linked glycosylation sites were not used in the motif analysis because they were located close to the termini of proteins concerned and consequently did not have enough surrounding amino acids to allow for full motif analysis
Gene ontology (GO) analysis of EXoO identified glycoproteins was carried out, and this showed that extracellular space, the cell surface, the ER lumen, and the Golgi membrane were the major cellular components for O‐linked glycoproteins (Fig 2E)
Analysis of biological process and molecular function suggested various activities and functionalities associated with O‐linked glycoproteins , consistent with their important role in different aspects of biology (Appendix Fig S3)
Specifically, extracellular matrix organization, cell adhesion, and platelet degranulation were the biological processes most represented in the glycoproteins identified (Appendix Fig S3), whereas heparin binding, calcium ion binding, and integrin binding were the top molecular functions identified (Appendix Fig S3)
To overview the position al distribution of the O‐linked glycosylation sites identified, the relative position of the sites in the proteins was determined and arranged relative to the N‐terminus of the glycoprotein in question (Fig 2F lower panel)
In addition, frequency of the sites at the relative position of proteins was calculated (Fig 2F upper panel)
It was found that the sites had relatively even distribution across the protein sequence but less frequent at protein termini (Fig 2F upper panel)
Strikingly, 20 proteins were seen to contain over 20 sites
Five proteins with the highest number of sites were zoomed for clear visualization in Fig 2F middle panel
These heavily glycosylated proteins appeared to show continuous clusters of many vincinal sites that nearly cover the whole proteins such as VCAN , mucin‐1 ( MUC1 ), and aggrecan core protein ( ACAN )
The cluster of sites could be relatively short while distributed evenly as seen in apolipoprotein ( LPA ) and Tenascin‐X ( TNXB )
Among these heavily O‐linked glycoproteins , VCAN contained the highest number of sites reaching 165 sites with distinct peptide sequences surrounding the sites, whereas MUC1 contained 161 sites , the second highest, but composited from only six distinct sequence repeats
ACAN , LPA , and TNXB were heavily O‐linked glycosylated to have 82, 73, and 44 sites, respectively
Analysis of the site distribution on glycoproteins demonstrated advantage of EXoO to study heavily O‐linked glycoproteins that is difficult to be analyzed by current analytical approach due to structural complexity and resistance to enzymatic digestion
To determine localization of the sites to protein structures, protein topological and structural annotations were retrieved from UniProt database and mapped to the EXoO identified sites
It was found that approximately 28.3 and 10.3% of the sites were predicted to localize in extracellular and luminal region , respectively (Appendix Fig S4)
In contrast, only approximately 1.6% of the sites were predicted in cytoplasmic compartment (Appendix Fig S4)
Approximately 5% of the sites were associated with Ser/Thr/Pro‐rich region but weaker correlation of the sites to other protein structures including repeats, coiled‐coil, beta strand, helix, turn, and signal peptides (Appendix Fig S4)
Close to none correlation of the sites to intra‐ and transmembrane region of proteins was observed
The structural correlation of the sites to extracellular, lumen, and Ser/Thr/Pro‐rich regions coincided with the location of O‐linked glycoproteins to present on extracellular space, the cell surface, the ER, and the Golgi lumen for various functionality