Gene-set Cohesion Analysis Tool (GCAT)
Home Announcements Help/FAQ References Contacts


What is GCAT?


GCAT utilizes Latent Semantic Indexing (LSI) of Medline abstracts to determine the functional coherence of gene sets.  LSI was shown to be robust in identifying both explicit and implicit gene relationships (Homayouni et al. 2005). Here, an LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene as of November 2010. Based on gene-to-gene LSI derived similarities, a literature p-value (LPv) is estimated using Fisher's exact test by comparing the cohesion of the given gene set to a random one. Therefore, the LPv calculated by GCAT represents the significance of functional cohesion supported by the biomedical literature.


What makes GCAT unique?


Since GCAT utilizes Latent Semantic Indexing (LSI), it can automatically determine both explicit and implicit (conceptual) relationships between genes from information in Medline abstracts. The ability to determine implied relationships is very useful in interpretation of genomic studies which identify new gene associations.

GCAT determines the inter-relationships between all genes in an experimental gene list. Unlike pathway oriented gene set enrichment approaches, GCAT can determine the global cohesion, considering cross-talk between genes in multiple pathways.  


How do I use GCAT?


1.      Prepare your input gene list, one gene per line, and paste into the text box labeled "Please enter your gene symbols or Entrez Gene IDs below". Example gene lists can be pasted in the box by clicking any of the four buttons above the text box.  Example gene lists are collected from the following publication:

Homayouni R, Heinrich K, Wei L, Berry M. Gene clustering by Latent Semantic Indexing of MEDLINE abstracts. Bioinformatics, 21(1):104-15, 2005.  Paper link

2.      Choose the organism from the drop down menu. Currently, only Mouse and Human genes are supported.

3.      After selecting organism, the reference gene subset needs to be selected for accurate calculation of LPv.  In addition to human and mouse genome, gene sets from Affymetrix GeneChip Mouse Genome 430 2.0 and Human Genome U133 Plus 2.0 arrays are also supported.

4.      Click "submit" button to begin the calculation. Depending on the size of the gene list, this step may take seconds to minutes to execute. NOTE: if the LPv and the network graph is not displayed in less than 2 minutes, then check your input list for strange formatting or symbols and resubmit. Common high frequency (log-entropy) terms can be obtained by clicking "Display terms" button below the text box of "Compute top 50 high frequency terms of the following genes".

5.      Common high frequency (log-entropy) terms can be obtained by clicking "Display terms" button in the right hand window.


How do I interpret the results?


GCAT returns:

1.      Number of input genes found or not found in the abstract collection. Genes not found may be due to differences in gene symbols/IDs between mouse and human, revision of gene symbols/IDs, or a lack of abstracts for the gene. If there are no abstracts tagged to the gene, then LSI cannot be used.

2.      Average and range of abstracts in the input gene list.

3.      Literature p-value which evaluates the functional significance derived from biomedical literature. A low LPv indicates high functional cohesion, whereas a high LPv indicates low cohesion.  Please NOTE: Due to nature of the algorithm used in GCAT, it is possible to obtain a high LPv if the gene set contains a large proportion of very well studied genes, such as p53, TNFalpha.

4.      A gene network is constructed and visualized using Cytoscape utility. The nodes in the network represent genes and the edge between two nodes represent an LSI cosine score >0.6. A higher cosine score is represented by darker gray shades. Information about the gene can be obtained by clicking on the node. By viewing the network, you may find sub-clusters of highly related genes within the input gene list.

5.      By clicking the "Display terms" button, you can view the top log-entropy (high specificity) terms associated with the input genes. Genes that have at least one neighbor are included in the log-entropy term calculation. This is a good way to summarize the functional themes associated with the input gene list.

6.      The entire gene correlation matrix can be downloaded by clicking "Download Full Matrix". This matrix contains all of the pairwise cosine values between the input genes.  

7.      7. At the bottom of the page, a table is presented that contains Gene ID, Gene Symbol, Gene Description, Abstract number for all input genes. Gene ID is hyperlinked to Entrez Gene for more information. The abstract number is hyperlinked to PubMed for viewing the abstracts used in the collection. Due to NCBI restrictions, only the top 500 abstracts are returned via this hyperlink.


What biological questions can GCAT answer?


1.      The literature cohesion p-value calculated by GCAT will enable researchers to compare feature selection methodologies for a given experiment based on biological support derived from the literature. By testing the cohesion of gene sets produced by different statistical thresholds or fold change thresholds, a researcher can help hone down large gene sets into more manageable sizes for further investigation.

2.      The network graph in GCAT may highlight sub-clusters of functionally related gene clusters, gene hubs, and/or master regulators within a larger gene set.

3.      Common log-entropy terms derived from the gene abstracts can provide insights into biological functions of the gene set.


How is the literature p-value calculated?


GCAT aims to determine the significance of the functional cohesion for a given gene set compared to that expected by chance. For testing the significance of cohesion, the observed number of gene relationships above a cosine threshold of 0.6 in the LSI model is used to calculate LPv. This 0.6 cut-off threshold was chosen based on examination of 1000 random gene sets (containing between 50~400 genes each) from the gene-by-gene LSI similarity matrix. The distribution of the similarity scores which ranged from 0 to 1 for the various sampled gene sets shows that approximately 5% (ranging from 5%~5.8%) were above 0.6 cosine value. Therefore, gene sets which have many cosine scores above 0.6 would be considered functionally cohesive. The functional connectivity of an experimental gene set can be compared to the connectivity expected by chance and a literature p-value (LPv) is derived using a Fisher's exact test.


How many genes and abstracts are in the collections?


20072010Percent Change
Human Collections
Mouse Collections


How do I cite GCAT?




The GCAT project is hosted by the Bioinformatics Program at the University of Memphis. Please contact for more information.