Table of Contents
eGIFT (Extracting Gene Information From Text) is designed for use by life scientists who are interested in rapidly finding information about a gene. eGIFT uses natural language processing techniques to retrieve iTerms (informative terms) relevant to a specific gene. We look at PubMed references (titles and abstracts), gather those references which focus on the given gene, and automatically identify terms which are statistically more likely to be relevant to this gene than to genes in general. In order to understand the relationship between a specific iTerm and the given gene, we allow the users to see all sentences mentioning the iTerm, as well as the abstracts from which these sentences were extracted.
The sections outlined below serve as a step-by-step guide of how to use eGIFT. Each section will be accompanied by example screenshots, and these can be clicked on for a full size view. The User Guide can be accessed from any page by clicking on the "Page Guide" link, shown on the top menu bar. By doing so, the User Guide will display the corresponding section relevant to that particular page from where we clicked.
The Home Page serves as an outline of the various functionalities of eGIFT. It briefly describes eGIFT, and shows the number of genes currently contained in the database.
1. To search for a gene, we can click on "Search for terms and documents relevant to a gene", which is written in the green circle, or we can simply click the "Gene Search" link in the top menu bar. Section 3 will describe this functionality in detail.
2. To find genes sharing a specific term, we can click on "Find genes for which a term is an iTerm", which is written in the first yellow circle, or click the "iTerm Search" link in the top menu bar. Section 8 will describe this functionality.
3. To request that a gene or a group of genes be added to the eGIFT database, we can click the second yellow circle, or click the "Add Gene" link from the bottom of the home page. This functionality will be described in detail in Section 10.
4. To perform analysis on a group of genes, we can click on "Find iTerms in documents for a group of genes" appearing in the third yellow circle, or click the "Gene Analysis" link from the top menu bar. Section 9 will describe this functionality.
5. Publications about eGIFT, including journal articles, conference papers, and posters can be found by clicking "Read articles about eGIFT" or by clicking the "Publications" link at the bottom of the home page.
6. Finally, this User Guide can be accessed in three different ways: by clicking "Learn how to use eGIFT", by clicking the "Page Guide" link from the top menu bar of every page, or by clicking the "User Guide" link at the bottom of the home page.
This page allows us to search for a gene in order to obtain the gene product information, relevant terms, and documents associated with it.
1. The Gene Search Page can be accessed by clicking the "Gene Search" link from the top menu bar.
2. If we want to find gene Groucho for example, we can select the letter G and find the list of genes whose names start with G. Alternatively, we can type either Groucho or Gro in the "Gene name" field, or provide the EntrezGene or UniProtKB identifier for gene Groucho (any species-specific identifier for this gene would work).
3. Upon clicking the Search button, a page with all the results is displayed, similar to the one shown in the third screenshot above. First, the Official Name Matches are listed, then the Synonym Matches are shown, and then the Partial Matches of Official Names or Synonyms are displayed. Clicking on the gene name will take us to the Gene Page, which will be described in the next section. If an EntrezGene or UniProtKB identifier is searched for, the resulting page is automatically the Gene Page for that gene.
The Gene Page contains links to the gene's iTerms (informative terms) and to documents in which the gene appears.
a. By clicking "See iTerms for this gene", we are shown a list of terms that frequently co-occur with the gene in the literature. These terms, called iTerms, will be described in more detail in Section 5.
b. To see all the documents mentioning the gene names, we can click on the first link under "Documents".
c. Because some documents are more relevant to the gene than others, we also provide a link to see the documents that are central to the gene. We call this set of documents the About Set. Abstracts are placed in the About Set if they satisfy one of the following four criteria:
- one of the gene names appears in the title
- one of the gene names appears in the first sentence
- one of the gene names appears in the last sentence
- the abstract contains three or more mentions of a gene's names
d. We can also see documents containing a specific gene name or synonym, by clicking the gene name from the "Names found in the literature" section. The other names by which the gene is known, but which were not found in any documents, are listed under "Other names".
iTerms (short for informative terms) are terms that occur frequently with the gene in the literature. eGIFT ranks iTerms about the gene based on a score which compares the frequency of occurrence of a term in the gene's literature to its frequency of occurrence in documents about genes in general. For more information about this approach, please see the manuscript entitled "eGIFT: Mining Gene Information from the Literature".
1. The iTerms are displayed in categories. For a list of the categories used in eGIFT, please see the "Categories of iTerms" sub-section shown below. We can also choose to see the iTerms uncategorized, or show only one category by following the "Select category" drop-down menu. Clicking on an iTerm takes us to a list of sentences containing the term. This will be explained in detail in Section 6.
2. By clicking on the triangle to the left of the iTerm, we can see additional details and statistics about the term:
- Textual variants (terms which represent the same concept)
- Bigrams (iTerm and other frequently occurring neighboring terms); some of these are also listed next to the iTerm
- Document Frequencies (number of abstracts in which the iTerm appears in the gene's literature).
3. We can also select a group of iTerms, by marking the checkbox to the left of them. Clicking the "See documents for selected iTerms" button will display a list of documents containing these iTerms. This documents page will be described in more detail in Section 7.
We group iTerms into different categories to allow users to hone in quickly to the type of information they seek:
- Functions and Processes: GO biological processes and molecular functions; UniProtKB keywords of types biological process and molecular function; iTerms ending with -sion, -tion, -sis, -or, -er, -ment, which do not belong to other categories; GO-related terms taken from synonyms of GO terms
- Domains and Motifs: NCBI's Conserved Domains; iTerms co-occurring with or including words "domain(s)", "motif(s), "repeat(s)", "tetrapeptide(s)"
- Pathways and Signaling: iTerms co-occurring with or including words "pathway(s)" or "signaling"
- Cellular Components: GO terms and UniProtKB keywords of type cellular component
- Phramaceuticals: Terms from MeSH ontology, category D; DrugBank's approved list of drugs
- Diseases and Malignancies: Terms from MeSH ontology, category C
- Gene (Family) Names: Names collected from EntrezGene; iTerms co-occurring with or including words "gene(s)" or "protein(s)"
- Cells, Cell Types and Cell Lines: Terms from MeSH ontology, category A11; iTerms ending in "cell(s)" or "cyte(s)"
- Techniques and Treatments: Terms from MeSH ontology, category E
- Anatomical Parts: Terms from MeSH ontology, category A01
- Species Names/Taxons: NCBI Taxonomy, restricted to species found in EntrezGene knowledge base
- Terms containing this gene: iTerms containing the given gene's names and synonyms
We mark the iTerms which match a GO term or an NCBI Taxonomy species name:
Please note that by listing these symbols next to an iTerm, we do not suggest that the GO/UniProtKB keyword be annotated for that particular gene. We simply mark iTerms with these symbols to inform the users that this keyword concept co-occurs frequently with the gene in the literature.
Although iTerms may seem similar to keywords, they are in fact quite different. The technical definition of keywords in information retrieval implies they are terms used for indexing. It is in this sense that UniProtKB keywords are used in UniProt entries and GO terms in Gene Ontology.
This distinction between iTerms and keywords has important implications for the design and evaluation of eGIFT. The controlled vocabulary and categories used in Gene Ontology and UniProt have a specific guideline to associate a keyword with a gene. Thus, if wd40 domain is associated with Groucho in one of these knowledge bases, a user would infer that Groucho contains this domain. However, eGIFT selects not only terms that can be associated with a gene in knowledge bases, but also terms that are highly relevant in some other way. One of the iTerms for Groucho is wd40 domain, but we also see wrpw motif which, although not part of Groucho, has a clear relevance to the understanding of Groucho: "This domain contains the wrpw motif that acts as a binding site for the transcriptional corepressor Groucho, which also localizes to the nuclear matrix" (PMID 11035023).
iTerms go beyond keywords, in that they inform users of a broader range of aspects of a gene. iTerms can be from categories other than those used in the manually curated knowledge bases. For example, iTerms can inform a user of relevant pathways, such as Notch signaling for Groucho. Even within the categories covered by knowledge bases, selected iTerms can cover concepts that are not part of the controlled vocabulary. Some examples are calcification for gene Bmp2 or tumorigenesis for Lmo2. Thus, there is no limitation for eGIFT: iTerms can be diseases, small molecules, phosphorylation sites, drugs, or any type that might be co-mentioned with a gene in text. Any of these could be meaningful to a user to meet their information needs.
1. Clicking on a gene's iTerm brings us to a list of sentences containing the iTerm. These sentences are extracted from the documents mentioning the given gene.
2. We can choose to see sentences for a given species if the abstract from which the sentence was extracted is associated with that species either through a MeSH term or because the species name appears in text. We can also go to read the entire abstract by clicking on the PMID below the sentence.
The Gene Documents Page can be accessed from the Gene Page (see Section 4, points b and c), or from the iTerms Page (see Section 5, point 3).
1. Documents containing the gene are listed in inverse chronological order, and the gene names are highlighted in text. The PubMed entry for a particular document can be accessed by clicking the "see in PubMed" link, which is shown right next to the PMID of that document.
2. By default, eGIFT shows only 10 abstracts per page. This number can be increased by selecting a different number from the drop-down list. The documents can also be filtered by selecting a specific species from the drop-down list, or by selecting the type of documents, i.e. Full or About (for a description of the About Set, please see Section 4, point c). By clicking the "save PMIDs" link, we can retrieve a list of all the PMIDs, shown one per line to facilitate an easy copy-and-paste.
The iTerm Search Page can be accessed by clicking the "iTerm Search" link from the top menu bar, or by clicking the "Find genes for which a term is an iTerm" circle from the home page.
1. We can search for a specific term to retrieve the genes that have it as an iTerm (click on the "Get genes" button) or to retrieve other iTerms frequently co-occurring with it (click on the "Get iTerms" button). The search can be for an exact match, for an iTerm that contains the searched characters, or for an iTerm that starts with the searched characters. In the last two cases, we list top 20 iTerms (based on the number of genes containing them) matching this criteria.
2. By clicking the "Get genes" button for a specific iTerm, we are now provided with the list of genes containing this iTerm. The genes are listed alphabetically. To get to the gene page for any of these genes, we can click the "gene page" link. Likewise, to get to the documents containing a gene and the iTerm that was searched for, we can click on the "gene/iTerm documents" link. By clicking on the name of a gene, we are automatically taken to the iTerms Page for that gene.
3. By clicking on the "Get iTerms" button for a specific search, we are now provided with a word cloud of other iTerms frequently co-occurring with the specified iTerm. The default number of iTerms displayed in the cloud is 100, but this number can be specified before clicking on the "Get iTerms" button. If more than 200 iTerms are selected for display, the iTerms are listed based on the number of genes they have in common, as opposed to displaying them in a word cloud.
The Gene Analysis Page can be accessed by clicking the "Gene Analysis" link on the top menu bar, or by clicking the "Find iTerms in documents for a group of genes" circle from the home page.
1. In order to perform an analysis on a select set of genes, a user can provide the list of the gene identifiers (EntrezGene ID, UniProtKB ID, or eGIFT ID), and choose the name of the database from the "Select identifiers" drop-down list. A maximum rank for these genes' iTerms can also be specified next. The analysis can be retrieved by clicking the button "Find iTerms for these genes".
2. A table of iTerms and their information is displayed. These iTerms are ranked based on a score which looks at the number of genes provided by the user that have the term as an iTerm, the rank for each of these genes, and the number of all eGIFT genes that have the term as an iTerm. This score is displayed in the table, together with the category the iTerm belongs to, the number of genes containing the term, as well as the genes themselves and the rank of the iTerm for each of these genes. A user can select to see only a certain category of iTerms, by using the drop-down list entitled "Choose category for display". The resulting table can also be saved as a CSV file by clicking the button "See CSV file".
3. Sometimes we see certain genes that frequently appear together for some of the iTerms. To restrict the analysis to only a subset of the initial list of genes, we can select the genes corresponding to certain iTerms, as is shown in the third image above, and then click the button "Apply analysis for selected genes" to start over the analysis for the selected genes.
eGIFT is continuously adding genes to the database. If a user is interested in a particular gene, but could not find it here, the user can request that the gene be added to the eGIFT database. The link for adding a gene is displayed at the bottom of the home page. Upon filling in the form and submitting the request, an automatic process will be started where the gene(s) requested by the user will be bumped up in the list of genes that are automatically added to eGIFT every day.
The feedback link is located on the top menu bar of any page. When clicking the feedback button, all the information relevant to the current page is sent together with the user's comment. So if a user would like to leave a comment about the iTerms Page for gene Groucho, for example, the user needs to click the "Feedback" link right from within this page and write the comment without having to explain which page they were on.