Sex bias in HKG

Methods

Bioinformatics and statistical analysis were run in R software v.3.5.0 [1]. Figure 1 summarizes the workflow followed in this study.

Figure 1. Data-analysis workflow. This study consisted of seven main block-steps: 1. The collection of public microarray information located at GEO (Gene Expression Omnibus) database with Python and R. 2. Raw data pre-processing and probe annotation. 3. Statistical data analysis with three different statistics to get the gene expression variability in adipose tissue samples of Hsa and Mmu, considering the biological sex as a variable. 4. Meta-analysis by Rank Products. 5. Functional annotation with Gene Ontology (GO) terms. 6. GTEX-based gene expression filtering, to select potential reference genes suitable to compare both sexes in gene expression analyses. 7. Experimental validation by qPCR.

1. Systematic review and collection

1.1 Literature review

A comprehensive systematic review was conducted to identify all available transcriptomics studies with adipose tissue samples at GEO. The review considered the fields: sample source (adipose) type of study (expression profiling by array), and organism of interest (Homo sapiens or Mus musculus). The search was carried out during the first quarter of 2020, with the review period covering the years 2000-2019. From the returned records, the study GSE ID, the platform GPL ID, and the study type were extracted using the Python 3.0 library Beautiful Soup. The R package GEOmetadb [2] was then used to identify microarray platforms and samples from adipose tissue. The top 4 and 5 most used platforms in Hsa (Analysis Summary, Human Platform summary) and Mmu (Analysis Summary, Mouse Platform summary), respectively, were selected. Given the complex nature of some of the studies, those with information regarding the sex of samples were manually determined, and the keywords used to annotate them homogenized. Finally, studies not meeting the following predefined inclusion criteria were filtered out:

include at least 10 adipose tissue samples
use one of the selected microarray platforms to analyze gene expression data
present data in a standardized way
not include duplicate sample records (as superseries).

2. Data Processing and Statistical Analyses

The normalized microarray expression data of the selected studies from GEO were downloaded using the GEOQuery R package. All the probe sets of each platform were converted to gene symbols, averaging expression values of multiple probe sets targeting the same gene to the median value.

Three statistical stability indicators were calculated for each gene in each study to determine the relative expression variability: the coefficient of variation (CV), the IQR/median, and the MAD/median. The CV, computed as the standard deviation divided by the mean, is used to compare variation between genes with expression levels at different orders of magnitude; however, extreme values can affect this value. Therefore, the interquartile range (IQR) divided by the median and the median absolute deviation (MAD) divided by the median (two statistics based on the median) were also considered. These measures provide more robustness in skewed distributions [3]. Both statistics were multiplied by a correction factor of 0.75 and 1.4826 to make them comparable to the CV in normal distributions. Lastly, the gene variability scores per platform were expressed as the median of all statistics from the studies analyzed with each platform. These median values were ranked by gene variability value for each platform, with lower ranks corresponding to higher stability levels.

The described analysis pipeline was performed on three three different sample groups based on sex and species: female Hsa, male Hsa, and all Mmu samples. The analysis was not performed separately for male and female mice due to the lack of female Mmu samples.

3. Meta-analysis

The gene variability ranks per for each platform were integrated using the Rank Product (RP) method [4,5], a non-parametric statistic identifying the elements that systematically occupy higher positions in ranked lists. This approach combines gene ranks rather than variability scores to create platform independence. The RankProd package [6,7] was used to calculate the RP score for each gene (Equation 1, where i is the gene, K the number of platforms, and rankij the position of gene i in the ranking of platform j). Three final rankings were obtained (one for each sample group [Mmu, Hsa female, and Hsa male samples]) by sorting the genes in increasing order of RP.

4. Selection of Candidate HKGs

To encounter appropriate suHKG candidates, male and female Hsa samples were randomly selected, and the Mmu group was discarded. Gene functional information was then incorporated to exclude genes involved in metabolic alterations. The AnnotationDbi and org.Hs.eg.db annotation packages converted Gene Symbol to Gene name. After removing pseudogenes and non-coding genes, the associated GO terms of the remaining genes were obtained using the GO.db annotation package. Related information from all three gene ontologies were included (Biological Process, Molecular Function, Cellular component). Genes related to physiopathological conditions were filtered out, and a unique ranking by sex was generated (the male and female MetaRankings), which averages the three statistical rankings (Equation 2).

The difference in the ranking positions occupied in by males and females was also calculated to reveal sex-based stability differences at a gene level.

Selecting stable suHKG with high levels of expression, followed several steps:

Download the GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz file from GTEx,
Select the adipose tissue samples,
Take the gene median transcript per million (TPM) value in visceral adipose tissue.
Filter out from our sex-specific MetaRankings genes with median TPM < 20.
Select the genes in the top 10% positions of each MetaRanking.
Intersect the two top lists to find stable and highly expressed genes common to both sexes.

5. Experimental validation

5.1 Study Selection and Sample Processing

Subjects were recruited by the endocrinology and surgery departments at the University Hospital Joan XXIII (Tarragona, Spain) in accordance with the Helsinki declaration. Visceral and subcutaneous adipose tissue samples were obtained during surgery from lean and obese male and female individuals. Total RNA was extracted from adipose tissue using the RNeasy lipid tissue midi kit (Qiagen Science). One microgram of RNA was reverse transcribed with random primers using the reverse transcription system (Applied Biosystems)33. Mouse adipose tissue was obtained from wild type and Irs2-/- [8] (insulin resistance and type 2 diabetes model) C57BL/6 littermates. According to the criteria outlined in the "Guide for the Care and Use of Laboratory Animals," all animals received humane care22. Total RNA was extracted from abdominal fat using a combined protocol including Trizol (Sigma) and RNeasy Mini Kit (Qiagen) with DNaseI Digestion. First-strand synthesis was performed using EcoDry Premix (Takara).

5.2. Gene Expression Analysis

Quantitative gene expression analysis was performed on 50 ng cDNA template. Real time-PCR was conducted in a LightCycler 480 Instrument IIR (Roche) using SYBR PreMix ExTaqTM (mi RNaseH Plus, Takara). Primers used in this study are noted in Table S4. Crossing point (Cp) values were analyzed for stability between samples and relative quantification using 2^-ΔCt. Statistical analyses were performed with GraphPad Prism 8 (Graphpad Software V 8.0). The results are expressed as arithmetic mean ± the standard error of the mean (SEM). When two data sets were compared, a Student's t-test was used. The differences observed were considered significant when: p-value < 0.05 (*), p-value < 0.01 (**) and p-value < 0.001 (***).

References

R Core Team R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2019.
Zhu Y, Davis S, Stephens R, Meltzer PS, Chen Y. GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus. Bioinformatics. 2008;24(23):2798-2800. doi:10.1093/bioinformatics/btn520 .
Arachchige C, Prendergast L, Staudte R. Robust analogs to the coefficient of variation. J Appl Stat. Published online August 20, 2020:1-23. doi:10.1080/02664763.2020.1808599.
Breitling R, Armengaud P, Amtmann A, Herzyk P. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett. 2004;573(1-3):83-92. doi:https://doi.org/10.1016/j.febslet.2004.07.055.
Mitchell L. A parallel implementation of the Rank Product method for R. :12.
Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J. RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics. 2006;22(22):2825-2827. doi:10.1093/bioinformatics/btl476.
Del Carratore F, Jankevics A, Eisinga R, Heskes T, Hong F, Breitling R. RankProd 2.0: a refactored bioconductor package for detecting differentially expressed features in molecular profiling datasets. Bioinformatics. 2017;33(17):2774-2775. doi:10.1093/bioinformatics/btx292.
Withers DJ, Gutierrez JS, Towery H, et al. Disruption of IRS-2 causes type 2 diabetes in mice. Nature. 1998;391(6670):900-904. doi:10.1038/36116