study_overview.knit

Outline of dataset selection

We performed an exhaustive search for human multiple sclerosis scRNA-seq and snRNA-seq datasets in different databases (Figure 1). After applying different exclusion criteria, three different datasets were ultimately included in the analysis: UCSC-MS, GSE144744c1 and GSE144744c3.

&nbsp

Figure 1. Systematic review workflow following PRISMA guidelines. Specification of the remaining studies number (n) through the identification, screening, eligibility and inclusion phases along with the corresponding exclusion justifications. PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses.

&nbsp

Summary of selected datasets

Each dataset represents a subtype of multiple sclerosis and a specific region of the body. Further details can be consulted in the table below:

Methods

Literature review and selection of studies

We conducted a systematic review in April 2022 following the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines (PMID: 19621072) to obtain a robust and standardised dataset search. Inclusion criteria were implemented by the keywords “multiple sclerosis”, “Homo sapiens”, “single cell” or “single nuclei” or “single nucleus” in the public databases GEO (PMID: 11752295), ArrayExpress (PMID: 31665515) and UCSC Cell Browser (PMID: 34244710). Web browser searching was also performed.

Then, the identified studies were manually filtered by the following exclusion criteria: i) data type: not containing scRNA-seq or snRNA-seq data, ii) disease examined: studies not based on MS, iii) experimental design: required MS patients as cases and healthy individuals as controls, iv) sex: sex variable not registered or not available, v) sample count: at least data from three different individuals at each condition and sex (control females, MS females, control males, MS males), and vi) unavailability of the gene expression matrix or metadata files. Finally, raw count matrices and sample metadata were downloaded for each of the selected studies.

Data acquisition, preprocessing and cell type annotation

The whole bioinformatic code was developed using the programming language R (version 4.1.2) (R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2020. https://www.R-project.org/). The version of the corresponding packages can be consulted in Supplementary Table 1 of the manuscript.

Gene expression matrix and metadata files were downloaded to develop the individual analyses. Firstly, we unified each set of data into a SingleCellExperiment object using the homonymous R package (PMID: 31792435). We also ensured the standardisation of the gene nomenclature, verifying that HUGO Gene Symbols were available in all datasets.

We performed afterwards a quality control analysis to incorporate solely data coming from viable cells and nuclei in the downstream processing. This step was adapted to suit the singularities of each dataset, maintaining homogeneity among studies as much as possible. The following parameters were calculated by sample: number of cells or nuclei, library size, number of expressed genes, and mitochondrial gene ratio. The graphical distribution of each parameter was represented before and after filtering to explore if some samples should be eliminated. Then, pertinent cutoffs were implemented to filter poor quality cells/nuclei. In detail, for UCSC-MS dataset we removed nuclei with less than 1000 library size, less than 500 gene expressed, anomalous values (outliers) for mitochondrial gene ratio identified with isOutlier function from scuttle R package (PMID: 28088763), or nuclei aggregates detected with scDblFinder R package (Germain P (2021). scDblFinder: scDblFinder. R package version 1.6.0, https://github.com/plger/scDblFinder). We also performed the quality control analysis at gene level. We removed genes expressed in less than 3 nuclei or genes other than protein-coding genes, microRNAs or lncRNAs. To identify these gene types we followed HGNC (HUGO Gene Nomenclature Committee) guidelines. On the other hand, the graphical distribution of parameters before filtering revealed that both GSE144744c1 and GSE144744c3 datasets were previously filtered by the authors (PMID: 33748804), corroborating that there were no outliers. Thus, we did not perform additional filtering steps.

Then, we normalised not discarded data by implementing a deconvolution approach (https://doi.org/10.1186/s13059-016-0947-7) and subsequently transformed the obtained values into a log2 scale. Next, we estimated for each gene the biological and technical variability considering batch effects reported in the metadata files. The 20% of the highest variable genes (HVG) based on biological heterogeneity were selected. We executed all these steps using scran R package functions (PMID: 27909575). Expression levels of HVG were used as input data to implement dimensional reductions approaches that further compact expression profiles. In detail, summarization was assessed by PCA (Principal Component Analysis) using scater R package (PMID: 28088763). The relevant principal components were selected applying the elbow point method with PCAtools R package (Blighe K, Lun A (2021). PCAtools: PCAtools: Everything Principal Components Analysis. R package version 2.4.0, https://github.com/kevinblighe/PCAtools). Moreover, the non-linear dimensionality reduction strategies tSNE (T-distributed Stochastic Neighbour Embedding) and UMAP (Uniform Manifold Approximation and Projection) were computed for visualisation with scater R package (PMID: 28088763).

Cells were classified in groups afterwards, based on similarity of the principal components previously selected. To this end, we constructed a graph built on the Shared Nearest Neighbours algorithm using scran R package (PMID: 27909575). We established groups as regions highly connected in the graph, considering each group as a different cell identity. Groups were defined with the walktrap algorithm with igraph R package (Csardi G, Nepusz T. The Igraph Software Package for Complex Network Research. InterJournal. 2005 Nov 30;Complex Systems:1695). We assigned the cell type to each cell and nucleus by contrasting its expression profile with public references. We used BRETIGEA R package (PMID: 29892006) for nervous tissue samples (UCSC-MS dataset) and SingleR package (PMID: 30643263) for PBMC samples (GSE144744c1 and GSE144744c3 datasets). For the latter, we selected the MonacoImmuneData reference from celldex R package (PMID: 30643263), which is a standard reference to analyse human PBMC samples. We discarded unassigned cells and neutrophils, basophils, progenitor cells, and T cells without classification in CD4+ or CD8+. Then, we screened for genes overexpressed in each cell type compared to the rest of the cell types. The statistical Wilcoxon test was implemented with scran R package (PMID: 27909575). Adjusted p-values were calculated by Benjamini-Hochberg (BH) method (PMID: 11682119) considering significant genes when False Discovery Rate (FDR) < 0.05. Finally, we corroborated that significantly overexpressed genes by cell type were marker genes described in scientific literature, increasing the robustness of the annotation.

Tested comparisons

We performed differential gene expression and several functional analyses to statistically identify differences in MS disease from a sex perspective (see following sections) testing three independent comparisons for each cell type analysed. Thoroughly, the comparison (MS.Female - Control.Female) assessed the impact of the disease in females (IDF), as it unveiled the differences among MS patients and healthy individuals being female. Likewise, the comparison (MS.Male - Control.Male) reported the impact of the disease in males (IDM), revealing the differences among MS patients and healthy individuals being male. Finally, SDID comparison ((MS.Female - Control.Female) - (MS.Male - Control.Male)) tested the sex-differential impact of the disease (SDID). With this comparison, where we identified the differences of IDF with respect to IDM, we discovered sex differences among MS patients without considering the inherent sex variability in healthy individuals.

Differential gene expression analysis

We carried out differential gene expression (DGE) analysis using MAST R package (PMID: 26653891). To this end, a hurdle model was implemented by zlm function. A logistic regression was considered to model if a gene would be expressed (discrete distribution), and a conditioned gaussian linear model to represent the level of expression (continuous distribution).

For every analysis, the model included a variable representing condition and sex conforming to four groups (control females, MS females, control males, MS males), and another variable containing the scaled number of genes (MAST recommendation). Moreover, we added additional covariables to minimise the effect of confounders in the gene expression patterns. For UCSC-MS comparisons we incorporated the following variables: sample, lesion state, age, affected cerebral region, capture batch effect, sequencing batch effect and cell cycle state. On another note, for GSE144744c1 and GSE144c3 datasets we reported the variables sample, age, previous treatments, batch effect and cell cycle state. The cell cycle state was a source of variability not provided in the corresponding metadata files in any dataset, so it was calculated using the cyclone function (PMID: 26142758) from scran R package (PMID: 27909575).

After modelling, we performed a likelihood-ratio test to calculate the statistical parameters for each dataset, cell type and comparison. We also determined the logFC values to know the magnitude of the differential expression (logFC absolute value), and in which direction of the comparison the genes were more expressed (logFC sign). We adjusted p-values by BH method (PMID: 11682119), considering statistical significance when FDR < 0.05 and logFC absolute value > 0.5.

Functional profiling

We performed protein-protein interaction (PPI) analyses using STRINGdb R package for each set of significant results from the DGE analyses (PMID: 30476243). Interaction networks were obtained keeping default parameters examining the total number of physical and functional interactions present in the database. For visualisation purposes, we represented weighted edges based on the confidence of the interaction (the greater the thickness, the greater the confidence) hiding disconnected proteins.

Additionally, we identified altered biological processes from the Gene Ontology (BP-GO) through a modular enrichment analysis implemented with the weight01 algorithm from topGO R package (PMID: 16606683). We worked with significant genes and BP-GO terms obtained from org.Hs.eg.db R package (Carlson, M. org.Hs.eg.db: Genome wide annotation for Human. R package version 3.8.2. (2019)) as input data. We calculated p.value and logFC statistics. The variety of graphical representations generated to display functional results were elaborated with the R packages simplifyEnrichment (PMID: 35680096), ComplexHeatmap (PMID: 27207943) and ggplot2 (ggplot2: Elegant Graphics for Data Analysis (3e), https://ggplot2-book.org/).

Signalling pathways analysis

We spotted statistical differences in signalling pathways activity using hipathia R package (PMID: 28042959). Signal transduction was calculated for every effector subpathway of each pathway from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (PMID: 36300620). Differential effector activation (DEA) analysis was performed with MAST R package (PMID: 26653891), following the specifications described in the DGE analysis preceding section. Adjusted p-values were estimated by BH methodology (PMID: 11682119), determining statistical significance when FDR < 0.05. logFC was also calculated, to determine the direction and the magnitude of the change. We selected the suitable pathways to be analysed . The criteria was established based on the type of tissue the samples came from (selected pathways in Supplementary Table 2 of the manuscript).

Cell-cell communication analysis

Interactions between ligand-receptor pairs were inferred using the R package CellChat (PMID: 33597522). The CellChatDB.Human database was evaluated to screen a total of 1939 ligand-receptor interaction pairs, involving 546 different ligands and 507 different receptors. We computed the communication probability for each interaction between pairs of cell types in each group of interest (control females, MS females, control males and MS males), being these values a measure of the interaction strength. The direction of interactions was assessed by establishing the cell type that provides the ligand and/or the receptor. Interactions were considered significant when FDR < 0.05. We then determined cellular communication networks by counting the total number of interactions, and calculated the differential number of interactions among females (MS females vs. control females comparison) and males (MS males vs. control males comparison).