Data Science Seminars: Bioinformatics Focus

These seminars take place every Tuesday at 14:15 in Alario seminar room (Building 21)
Researchers, students, and anyone interested in the proposed topics are more than welcome to attend!

VISMA: A computational workflow to quantify somatic mutagenesis around lentiviral integration sites at single-clone Resolution

Francesco Gazzo PhD student DSBlab @TIGET (E. Montini/M. Masseroli)

Lentiviral vectors (LVs) enable durable hematopoietic stem cell gene therapy (HSC-GT) by integrating therapeutic cassettes into the host genome. Because integration is semi-random, LV insertions can perturb nearby regulatory elements, and proliferative stress may further contribute to genetic instability, yet quantitative methods to connect clonal expansion with local mutagenesis are limited. We present VISMA (Vector Integration Site Mutation Analysis), a reproducible computational workflow that calls somatic variants in genomic regions flanking each integration site (IS), enabling clone-resolved mutation burden estimates across time and lineages. VISMA extends VISPA2-derived IS catalogs by preprocessing and artifact reduction (optical-duplicate removal, end trimming), IS-aware read assignment and aggregation across samples, SNV/indel calling with temporal “backtracing” to reconstruct occurrence and variant allele frequency, and stringent filtering to remove likely germline events and sequence-context artifacts. For downstream quantification, VISMA computes a covered bases normalized Mutation Rate and introduces a Mutation Index that additionally normalizes across the number of IS and overall clonal representation, yielding an interpretable global metric. We applied VISMA to a longitudinal mouse HSC-GT model (WT vs Cdkn2a-/-) using either a genotoxic LV or a GT-like non-genotoxic LV. cross >200,000 unique IS and >9 Gb of flanking sequence, VISMA detected a significantly increased mutation burden in genotoxic LV groups, with the strongest effect in Cdkn2a-/- + genotoxic LV, consistent with synergy between vector genotoxicity and impaired oncogene surveillance. Whole-genome sequencing supported higher mutagenesis in genotoxic conditions, corroborating VISMA’s flanking-region analysis. VISMA provides a practical computational framework for assessing vector-associated mutagenesis and genotoxic risk in HSC gene therapy. Moreover, we are now testing the workflow on our gene therapy patients, with promising results.

A graph machine learning approach for multi-omics representation learning

Leonardo De Grandis PhD student NECSTLab

Graphs are a powerful way to represent heterogeneous biological entities and their interactions. As a result, Graph Machine Learning (GML) is increasingly used to study biological networks. This talk covers the key steps of applying GML in a multi-omics context, from raw data processing to model development. Since data aggregation and harmonization remain bottlenecks for large scale artificial intelligence applications, two methods for consistently collecting high-quality genomic variants and multi-omics data will be presented. Subsequently, two different approaches for multi-omics GML applications will be introduced: a tool for compound-protein interaction prediction leveraging graph matching networks and a framework for toxicity detection on RNA-seq data. The discussion will conclude with a graph-based model for mass spectrometry data annotation. This work integrates ideas from the previous approaches into a comprehensive multi-modal pipeline.

Multi-omics analysis of drug response for precision therapy

Keying Qiao PhD student DSBlab

Cancer heterogeneity contributes to diverse therapeutic responses across patients and remains a major challenge in precision therapy. Because drug response is regulated by complex interactions across multiple molecular layers including the genome, transcriptome, proteome, and epigenome, single-omics analyses are often insufficient to comprehensively characterize these biological processes. In this seminar, I will present multi-omics approaches for drug-response analysis and precision treatment strategy discovery. First, we developed OncoMICS, a multi-omics analysis platform for cancer precision therapy. The platform stratifies samples based on pathological and molecular characteristics and associates transcriptomic, genomic, and proteomic data with drug sensitivity profiles to identify potential therapeutic targets and combination treatment strategies for specific cancer subgroups. Using KRAS-mutant non-small cell lung cancer as a case study, the platform identified several potential therapeutic targets and drug combinations, including the combination of Trametinib and Navitoclax. After predicting therapeutic strategies for stratified cancer subgroups, we further aimed to understand the underlying regulatory mechanisms of drug response. To achieve this, we constructed interpretable multi-omics drug-response networks through factor analysis–based integration of transcriptomic, genomic, proteomic, and epigenomic data using Multi-Omics Factor Analysis (MOFA). By identifying key factors associated with drug sensitivity and analyzing downstream network relationships, we explored the molecular mechanisms underlying drug response and further interpreted the rationale of the predicted combination therapies. Overall, this work combines multi-omics analysis, drug sensitivity association analysis, and network modeling to support drug-response interpretation and precision treatment strategy discovery.

Tracing disease dynamics through cell-free DNA tissue-of-origin analysis

Carlo Cipriani PhD student DSBlab @TIGET (D. Cesana/M. Masseroli)

Non-invasive biomarkers constitute an emerging approach for monitoring human disease and evaluating treatment response. Among these, cell-free DNA (cfDNA) is especially promising, as it serves as a reservoir of genetic information released by dying cells across the body. As a result, cfDNA can provide a systemic snapshot of the patient’s physiological and pathological state. In clinical practice, cfDNA is already used to track tumor relapse through known cancer-associated mutations and to monitor organ rejection in transplantation. However, many diseases or treatment-related complications are not defined by specific genetic variants, making mutation-based approaches insufficient. In these contexts, cfDNA tissue of origin deconvolution offers a complementary strategy by estimating the tissue and cell-type contributions to the circulating cfDNA pool. In this talk, we will explore how cfDNA deconvolution can be used to study disease dynamics and patient trajectories over time, both before and after treatment. In the first part, we will discuss currently available deconvolution methods and how their performance can be improved for low-depth sequencing samples. In the second part, we will focus on how deconvolution results can provide biological insights into cancer-bearing patients, helping to reveal tissue damage, tumor-associated signals, and treatment-related changes at the whole-organism level.

Using routine healthcare data to predict future health

Catalina Vallejos Meneses – University of Edinburgh

Can we identify who will experience an adverse health event (e.g. disease onset) weeks, months or even years before it happens? Questions like this are at the core of health data science research and have been empowered by the increasing ability to securely access routinely collected electronic health records (EHR). A key exemplar in Scotland is SPARRAv4 (Scottish Patients at Risk of Readmission and Admission version 4), a population-wide model that will be soon deployed to support anticipatory care planning. I will discuss some of the practical and methodological challenges that arise in the development and evaluation of such models, focusing on time-to-event outcomes. I will introduce the “C-index multiverse”, highlighting how different conceptual and implementation choices can affect model comparison and hinder reproducibility. Finally, I will introduce landmaRk as a flexible tool to perform dynamic risk prediction in the presence of latent population heterogeneity.

Leveraging multi-omics to improve prediction and prevention of cardiometabolic diseases

Scott Ritchie – University of Cambridge

Efforts to reduce cardiometabolic diseases focus on controlling major modifiable risk factors through targeted intervention in people identified at high risk. However, our understanding of their aetiology remains incomplete, hindering our ability to both predict and prevent these diseases. To address this, we leverage multi-omics data in population cohorts and biobanks to identify new potentially modifiable molecular targets as well as to assess evidence for the potential for multi-omics to enhance existing clinical risk prediction tools.

Machine learning analysis of 2D grid data for investigating retinal inner plexiform layer changes in multiple sclerosis

Linda Maldera – PhD student @MIND Lab (E. Salvi/M. Masseroli)

Multiple Sclerosis (MS) is a demyelinating disease of the central nervous system, classified into three distinct phenotypes: relapsing-remitting (RRMS), secondary progressive (SPMS) and primary progressive (PPMS). SPMS is diagnosed in more than half of RRMS patients when subtle disease progression independent of relapses occurrences. Early definition of SPMS in RRMS patients is still a challenge, as no reliable biomarkers are available in early stages of disease progression. Retinal neural tissue provides a unique window on MS progression as recent studies have showed that the inner plexiform layer (IPL) is thinner in people with MS with respect to healthy controls, and thinning of IPL characterizes patients with RRMS who progressed to SPMS. Data on IPL thickness can be easily obtainable through optical coherence tomography (OCT), however, there are no available tools or pipelines to analyze OCT data in a longitudinal manner. In this seminar we will illustrate the RePlayMS study, from theorization to execution and its preliminary results, discussing open points and yet unanswered methodological aspects.

From biological traces to probabilistic evidence: workflow and statistical interpretation in forensic DNA analysis

Riccardo Pizzichemi PhD student DSBlab @TIGET (E. Montini/M. Masseroli)

Forensic genetics uses DNA profiles to identify individuals in judicial investigations, providing evidence that can link biological traces to potential contributors. Modern forensic DNA analysis is primarily based on the examination of short tandem repeats (STR), highly polymorphic loci that allow discrimination between individuals. While the laboratory workflow that generates STR profiles is well established, the interpretation of forensic DNA evidence often presents substantial analytical and statistical challenges, particularly when dealing with degraded samples, low-template DNA, or mixtures of genetic material from multiple individuals. In this seminar, we first introduce the general workflow of forensic DNA analysis, from biological trace collection and DNA extraction to STR amplification and capillary electrophoresis, which produces the electropherogram used for allele identification. As a case study, we discuss the investigation of the Yara Gambirasio homicide in Italy, one of the most prominent forensic DNA cases in Europe. The analysis and interpretation of the DNA profile as “Ignoto 1” played a central role in the investigation.

3D Convolutional Neural Network for assessment of skin biopsy innervation

Simone Callegarin – CSE thesis @MIND Lab (M. Masseroli/S. Tomè)

Small Fibre Neuropathy (SFN) is a neurological disorder involving small somatic and autonomic nerve fibres, leading to chronic neuropathic pain and reduced quality of life. Since these fibres cannot be assessed by routine nerve conduction studies, diagnosis relies on skin biopsy with manual quantification of intraepidermal nerve fibre density (IENFD). Although considered the reference standard, this method is time-consuming and operator- dependent, highlighting the need for more efficient and objective approaches. In this context, deep learning techniques, particularly Convolutional Neural Networks (CNNs), offer a promising solution for automated and reproducible analysis. This work presents a fully automated framework for IENFD quantification from immunofluorescence skin biopsy images using three-dimensional CNNs (3D CNNs). In collaboration with the Fondazione I.R.C.C.S. Istituto Neurologico Carlo Besta, a dedicated dataset was created from 60 biopsies collected from 20 patients. A comprehensive preprocessing pipeline was developed to cover the entire workflow, from raw volumetric data to patient-level diagnostic inference. It was designed to standardize and facilitate the manual annotation process through orientation correction, denoising, and signal enhancement, and also to optimize model performance by reducing irrelevant variability and refining the input data for deep learning training. A 3D CNN regression model was trained to directly predict intraepidermal fibre counts from three-dimensional fields of view (FOVs). The training strategy included cross-validation, customized loss, and ablation studies to assess the contribution of methodological components. The proposed model achieves accurate fibre count prediction at the FOV level and provides clinically reliable IENFD estimates at the biopsy level. Overall, this work demonstrates the feasibility and effectiveness of a fully automated volumetric deep learning framework for IENFD quantification, representing a concrete step toward integrating artificial intelligence into the diagnostic workflow of SFN.

Modeling the senescence-proliferation continuum in prostate cancer

Denisa Sufaj – BCG thesis @IOR (M. Troiani/M. Masseroli)

Cellular senescence and proliferation are commonly described as mutually exclusive and discrete biological states. However, in cancer, transcriptional programs associated with these phenotypes often coexist and vary in a graded manner, challenging binary classifications. In human prostate cancer, the relationship between proliferative activity and senescence-associated signaling remains insufficiently resolved at single-cell resolution.
In this thesis, we model the senescence–proliferation continuum in human prostate cancer by reconstructing a transcriptomic axis that captures gradual transitions from highly proliferative to senescent-like cellular programs. Using single-cell RNA sequencing data derived from human tumor samples, we integrate pathway-level scoring and data-driven latent representations to define a continuous biological landscape. We then translate this axis into discrete cellular states through statistical modeling approaches, enabling robust classification while preserving the underlying continuum structure.
Our results show that senescence-associated programs do not form a strictly separable compartment but instead emerge along a structured gradient opposing proliferative activity. Beyond canonical markers, we identify a set of genes whose expression dynamics strongly align with the reconstructed axis, highlighting candidate regulators and effectors that shape the senescence–proliferation balance in human prostate cancer cells. Discrete states thus arise as emergent properties of an underlying continuous transcriptomic space. This work provides a quantitative framework to model senescence in human prostate cancer, bridging continuous biological variation and discrete state assignment, and offering refined molecular insight into tumor cell heterogeneity.

Big healthcare data and cardiovascular risk: the role of complex modalities in disease prediction

Andrea Mario Vergani – PhD work @HT (E. Di Angelantonio/F. Ieva/M. Masseroli)

The growing availability of biobank-scale data offers invaluable opportunities for studying the impacts of big health data modalities on biological mechanisms and disease. The aim of the talk is to assess the relevance of multi-modal healthcare data in the cardiovascular field, and the opportunities to exploit novel features and the relationships between heterogeneous data sources towards personalized risk prediction.
Specifically, the first part of the talk will discuss how a cross-modal representation of cardiac imaging, electrocardiogram, and genetic data can predict the future occurrence of cardiovascular events, thus shedding light on the relevance of multi-modal integration of medical test data for risk definition. The second part, instead, will explore the role of unconventional phenotypes to predict incident disease, particularly focusing on deep representation learning-derived factors from cardiac magnetic resonance imaging and single-nucleotide polymorphism data.

Automated generation of digital twins and specifications for healthcare

Bruno Guindani – PostDoc @ DEIB DeepSE lab (M. Bersani)

The talk presents results from the SAFEST project (PRIN). Medical cyber-physical systems (CPSs) that integrate Patients, Devices, and healthcare personnel (Physicians) form safety-critical PDP triads whose dependability is challenged by system heterogeneity and uncertainty in human and physiological behavior. While existing clinical decision support systems support clinical practice, there remains a need for proactive, reliability-oriented methodologies capable of identifying and mitigating failure scenarios before patient safety is compromised.
The talk introduces GENGAR, a methodology based on a closed-loop Digital Twin (DT) paradigm for the dependability assurance of medical CPSs. It combines Stochastic Hybrid Automata modeling, data-driven learning of patient dynamics, fuzzing-based model-space exploration, and clustering in an offline critical-scenario detection phase. In a second phase, it provides automated synthesis of mitigation strategies, enabling runtime feedback and control within the DT loop.
GENGAR is evaluated through a representative use case involving a pulmonary ventilator. Results show that, in most evaluated scenarios, strategies synthesized through formal game-theoretic analysis stabilize patient vital metrics at least as effectively as human decision-making, while keeping relevant metrics closer to nominal healthy values on average. The talk will also briefly introduce MARACTUS, a related methodology that automates the extraction of medical procedures and guidelines into machine-readable representations. This action is achieved by transforming unstructured clinical documents into analyzable action models for integration into model-driven pipelines and clinical decision support systems.

Knowledge-based machine learning via semi-amortized neural network and differentiable convex optimization layers

Daniele Bottazzi – CSE thesis @Saez-Rodriguez lab (P. Rodríguez Mier)

Integrating domain knowledge into machine learning models remains challenging, particularly in biological applications where physical constraints, conservation laws, and specific mechanistic relationships are known but difficult to incorporate into powerful and expressive neural network architectures.
We propose a general purpose end-to-end differentiable framework that couples neural amortization with a structured convex optimization layer formulated as a single quadratic program (QP). The amortizer network generates condition-specific outputs from input features, which are then refined by the QP through a proximal-style optimization, that incorporate knowledge-derived constraints while minimizing deviation from the network’s raw predictions. With mild strong convexity, the resulting solution map is unique and locally Lipschitz, ensuring stable gradients and exact feasibility.
We validate the proposed framework with (i) a proof-of-concept on a classical max-flow problem, illustrating how the convex layer enforces feasibility while the amortizer learns to produce accurate solutions, and (ii) a biological application: cell growth prediction from media composition via flux balance analysis (FBA) stoichiometric and reaction-bound constraints. Experiments show near-zero constraint violations, competitive predictive accuracy, and improved interpretability compared with purely data-driven baselines, while scaling efficiently with GPU-batched training.

A new data modeling approach for alignment-free biological applications

Sergio Lifschitz – Visiting Professor from Pontifícia Universidade Católica do Rio de Janeiro

Identifying and grouping homologous proteins are fundamental tasks in biology, currently dominated by tools that rely on DNA or amino acid sequence data. However, these tasks require the detection of complex evolutionary patterns that are often difficult to capture automatically using traditional methods. This talk presents a data modeling approach that leverages evolutionary patterns for homology searching, ranking, and clustering through an alignment-free process using image similarity algorithms. Our strategy proves valuable even for distant homologs and offers inherent advantages for data privacy and security. Practical experiments show that our approach achieves good and comparable results with traditional methods besides extra visual semantics information.

Spatial transcriptomics analysis and denoising with SpaTIM

Sofia Mongardi – PhD student DSBlab & Visiting @Teesside University (A. Occhipinti/M. Masseroli)

Recent advances in spatial transcriptomics (ST) have made it possible to measure gene expression while preserving the spatial organization of cells within tissue samples. Alongside technological improvements, a growing number of computational approaches have been proposed to analyze ST data. Many of these approaches rely on graph-based models, mainly graph neural networks (GNNs), to capture the relationships between neighboring spots and to learn meaningful representations for downstream tasks like spatial domain identification. While these models achieve state-of-the-art performance, they all rely on a predefined and uniformly-constructed graph, usually built using spatial proximity between spots. This approach assumes that spatially adjacent spots are functionally similar, an assumption that does not always hold. To address these limitations, we propose SpaTIM, a novel computational approach for spatial transcriptomics analysis based on GNNs that incorporates additional morphological context to improve graph construction and representation learning. Using morphological information to refine the graph, we ensure that connected spots have similar morphological features, allowing the model to dynamically adjust graph connectivity beyond simple spatial proximity. This allows the model to filter out noisy connections and enhance biologically meaningful relationships, potentially improving the accuracy of spatial domain identification and other downstream tasks.

Navigating heterogeneity in polygenic risk score models: a structured approach to PRS model prioritization

Diana Martinez Minguet – Visiting PhD Student from Universitat Politècnica de València

Polygenic Risk Scores (PRSs) estimate the genetic risk for complex diseases, based on the combined impact of many genetic variants. The particular set of genetic variants and their effect sizes associated to a specific disease is determined by a PRS model. These models are derived from GWAS studies using diverse statistical methods to adjust variant weights so that they can be aggregated in a single measure. Currently, there are no best practices or standards for constructing and reporting PRS models, resulting in substantial variability across models, even for the same disease.
This heterogeneity poses a significant challenge for clinical translation, where a single PRS model must often be selected from among many alternatives. Differences in domain terminology and the need to balance multiple, heterogeneous, and often conflicting evaluation criteria complicate direct comparison and prioritization of PRS models, making model selection a demanding and time-consuming task. In this seminar we discuss how Conceptual Modeling, Multi-Criteria Decision Analysis and LLM-based data extraction techniques can allow for an adequate prioritization of PRS Models, aiming to streamline the PRS Model selection process.

RAAVioli: a comprehensive approach to characterizing AAV vector integrations and rearrangements

Carlo Cipriani – PhD student DSBlab @TIGET (D. Cesana/M. Masseroli)

In this study, we present RAAVioli, a computational method for the comprehensive analysis of AAV integration sites and vector rearrangements across both long- and short-read sequencing platforms. Through in silico benchmarking and in vivo validation in a xenogeneic human hepatocyte model transduced with AAV, we demonstrate the robustness and versatility of RAAVioli in accurately identifying AAV–genome junctions and reconstructing complex vector rearrangements across diverse experimental workflows. RAAVioli can be applied across different contexts, ranging from gene addition to gene editing. It has already been used to characterize vector integrations in a mouse model of Wilson disease, to study integrations detected in cell-free DNA in patients treated with AAV vectors and from non-human primates across multiple gene-therapy settings, and enabling non-invasive safety and efficacy monitoring.

Deep learning approaches for alpha-synucleinopathies classification using skin biopsy images

Luca Zanotto – CSE thesis @MIND Lab (S. Tomè/S. Mazzetti/M. Masseroli)

Parkinson’s Disease (PD) and Multiple System Atrophy (MSA) are neurodegenerative a-synucleinopathies that share early symptomatology but require distinct treatments. Recent evidence highlights that cutaneous sweat-gland synaptic innervation, assessed via skin biopsies, may offer a cost-effective and minimally invasive biomarker for differentiation. This study develops a deep learning pipeline analyzing confocal microscopy images of these glands to distinguish PD from MSA. Specifically, images were used to train several deep learning architectures, evaluated across different training strategies. Additionally, explainability methods were applied to better understand the decision-making process of the models. Results demonstrate the efficacy of these architectures, achieving promising performance, especially in the clinically challenging task of differentiating the MSA-parkinsonian subtype (MSA-P) from PD.

Convolutional neural networks for [18f]FDG PET imaging-based prediction of clinical relapse in patients with Takayasu arteritis

Sara Resta – CSE thesis @HSR (M. Picchio/C. Bezzi/M. Masseroli)

Takayasu arteritis is a rare chronic inflammatory disease that mainly affects the aorta and its major branches. Although [18F]FDG PET is the imaging technique with the highest sensitivity for detecting vascular lesions, its role in monitoring disease status remains debated. The highly heterogeneous and sparse distribution of lesions across multiple vascular sites poses challenges in delineating volumes of interest and limits the application of quantitative analysis techniques commonly used in oncology, such as SUV (Standardized Uptake Value) metrics and radiomics. This study investigates the use of Convolutional Neural Networks (CNNs) applied directly to [18F]FDG PET scans to predict patient relapse within 12 months after imaging. A new liver-based standardization approach was optimized to minimize biases during training. As it is not known how the information about the likelihood of a patient to experience flare is encoded in the scans, CNNs were first trained to classify scans according to the presence of pathological uptake in the arteries. Then, these models were leveraged for flare prediction through transfer learning. The model EfficientNetB0 showed promise in predicting complete remission potentially allowing physicians to identify patients that can avoid aggressive therapies and stringent follow-ups.

An integrative bioinformatics and machine learning approach for non-coding RNA-based signatures in amyotrophic lateral sclerosis

Alessandro Cacciatore – CSE thesis @MIND Lab (E. Salvi/L. Maldera)

Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disorder that affects motor neurons, leading to a progressive loss of voluntary muscle control. Depending on the symptoms at onset, ALS can be classified as either spinal (affecting the limbs) or bulbar (affecting speech and swallowing). These two subtypes exhibit distinct histopathological, anatomical, and prognostic features, but their underlying biological differences remain poorly characterized, limiting precise diagnosis and treatment. Epigenetic alterations are increasingly recognized as key modulators of disease mechanisms in neurodegeneration. In particular, non-coding RNAs (ncRNAs), such as microRNAs (miRNAs) and long non-coding RNAs (lncRNAs), play essential regulatory roles in neuronal function, stress response, and inflammation, and their dysregulation has been associated with ALS pathogenesis. Investigating these molecules may therefore uncover epigenetic signatures that support patient stratification within a precision medicine framework. The aim of this thesis project is to develop a reproducible computational framework for the identification and validation of RNA-based biomarkers. The study initially focuses on discovering differentially expressed ncRNAs, and subsequently explores their potential to derive discriminative signatures capable of distinguishing between the two ALS subtypes. Samples were obtained from blood serum of 40 ALS patients (30 with bulbar onset and 10 with spinal onset) and 10 healthy controls. The proposed workflow for the analysis of RNA biomarkers integrates various steps: data pre-processing and normalization, missing value imputation, ensemble feature selection, feature orthogonalization, model optimization and validation with class imbalance correction, and biological validation and interpretation of findings. In ensemble feature selection, the output from four independent algorithms were employed: Random Forest, Recursive Feature Elimination, LASSO, and K-Best, allowed to construct a robust and stable ranking of features, which was then filtered by an orthogonalization step to keep only those features that provide non-redundant information. Multi-Omics Factor Analysis (MOFA) was used to perform a comparison with a common integrational technique in this field. Although the MOFA-derived factors achieved a great differentiation of ALS patients from controls, they mainly reflected the global disease-related variance rather than the molecular distinctions specific to subtypes, corroborating the proposed approach as more suitable in this case. In fact, the Partial Least Squares (PLS) model developed in this study, comprising five components, indicated that the second component had a limited yet noticeable ability to differentiate between bulbar and spinal patients. In conclusion, the results confirm that the proposed computational workflow is a trustworthy and biologically interpretable tool for the discovery of RNA biomarkers in ALS, combining statistical robustness and biological relevance. This framework offers a firm ground for subsequent developments, such as the experimental confirmation and the integration of further omics layers for a deeper comprehension of the molecular heterogeneity of ALS.

Mapping the genetic architecture of complex traits with genome-wide association studies and machine learning

Simone Tomè – PhD student DSBlab @MIND lab (E. Salvi/M. Masseroli)

Complex traits are phenotypes not attributable to a single genetic variant, but to multiple variants, often interacting with environmental factors. Chronic pain exemplifies such a trait, with substantial genetic contributions that remain incompletely understood. Genome-wide association studies (GWAS) aim to uncover the genetic basis of complex traits by testing variants across the genome for associations with the phenotype, typically using statistical learning models. In this study, we present the results of a GWAS conducted on a cohort of chronic pain patients from IRCCS Istituto Neurologico “Carlo Besta”, subsequently validated in an independent cohort from the University of Maastricht. Within this GWAS context, we explore applications of machine learning strategies, with a particular focus on disease gene prediction, an emerging approach that leverages existing knowledge to identify potential disease-associated genes.

Transcriptomic analysis to improve diagnosis and patient stratification in primary mitochondrial myopathies

Francesca Conti – BCG thesis @MIND Lab (D. Ghezzi/A. Legati/M. Masseroli)

Primary Mitochondrial Myopathies (PMM) are a group of mitochondrial disorders characterized by exclusive or predominant skeletal muscle involvement and caused by pathogenic variants in either mitochondrial or nuclear genes. These disorders exemplify the remarkable clinical and genetic heterogeneity typical of mitochondrial diseases, where identical mutations may give rise to distinct phenotypes and similar clinical presentations may originate from different genetic defects, making diagnosis particularly challenging. Transcriptome sequencing (RNA-seq) has recently emerged as a powerful complementary tool to genomic approaches, enabling the detection of transcriptional abnormalities that directly reflect the functional consequences of genetic variation.
In this study, RNA-seq was applied to skeletal muscle biopsies from patients affected by PMM to identify aberrant splicing and gene expression events using the FRASER and OUTRIDER algorithms, respectively. This approach aimed to improve the diagnostic yield and provide new insights into the molecular mechanisms underlying unresolved cases. In parallel, comprehensive computational analyses were developed to explore mitochondrial transcriptomic alterations in patients carrying multiple mtDNA deletions, with a focus on the potential formation of chimeric transcripts and their translation into aberrant proteins. Finally, on this same cohort of patients, clustering analyses based on mitochondrial gene expression and deletion profiles were performed to identify patient subgroups characterized by specific molecular signatures associated with distinct causative genes and histological features.