Data Science Seminars: Bioinformatics Focus

These seminars take place every Tuesday at 14:15 in Alario seminar room (Building 21)
Researchers, students, and anyone interested in the proposed topics are more than welcome to attend!

Modeling the senescence-proliferation continuum in prostate cancer: from transcriptomic axes to discrete cellular states.

Denisa Sufaj – BCG thesis @IOR/HSR (M. Troiani/P. Delfino)

Cellular senescence and proliferation are commonly described as mutually exclusive and discrete biological states. However, in cancer, transcriptional programs associated with these phenotypes often coexist and vary in a graded manner, challenging binary classifications. In human prostate cancer, the relationship between proliferative activity and senescence-associated signaling remains insufficiently resolved at single-cell resolution.
In this thesis, we model the senescence–proliferation continuum in human prostate cancer by reconstructing a transcriptomic axis that captures gradual transitions from highly proliferative to senescent-like cellular programs. Using single-cell RNA sequencing data derived from human tumor samples, we integrate pathway-level scoring and data-driven latent representations to define a continuous biological landscape. We then translate this axis into discrete cellular states through statistical modeling approaches, enabling robust classification while preserving the underlying continuum structure.
Our results show that senescence-associated programs do not form a strictly separable compartment but instead emerge along a structured gradient opposing proliferative activity. Beyond canonical markers, we identify a set of genes whose expression dynamics strongly align with the reconstructed axis, highlighting candidate regulators and effectors that shape the senescence–proliferation balance in human prostate cancer cells. Discrete states thus arise as emergent properties of an underlying continuous transcriptomic space. This work provides a quantitative framework to model senescence in human prostate cancer, bridging continuous biological variation and discrete state assignment, and offering refined molecular insight into tumor cell heterogeneity.

Big healthcare data and cardiovascular risk: The role of complex modalities in disease prediction

Andrea Mario Vergani – PhD work @HT (Di Angelantonio/Ieva/Masseroli)

The growing availability of biobank-scale data offers invaluable opportunities for studying the impacts of big health data modalities on biological mechanisms and disease. The aim of the talk is to assess the relevance of multi-modal healthcare data in the cardiovascular field, and the opportunities to exploit novel features and the relationships between heterogeneous data sources towards personalized risk prediction.
Specifically, the first part of the talk will discuss how a cross-modal representation of cardiac imaging, electrocardiogram, and genetic data can predict the future occurrence of cardiovascular events, thus shedding light on the relevance of multi-modal integration of medical test data for risk definition. The second part, instead, will explore the role of unconventional phenotypes to predict incident disease, particularly focusing on deep representation learning-derived factors from cardiac magnetic resonance imaging and single-nucleotide polymorphism data.

Automated Generation of Digital Twins and Specifications for Healthcare

Bruno Guindani – PostDoc @ DEIB DeepSE lab (Bersani)

The talk presents results from the SAFEST project (PRIN). Medical cyber-physical systems (CPSs) that integrate Patients, Devices, and healthcare personnel (Physicians) form safety-critical PDP triads whose dependability is challenged by system heterogeneity and uncertainty in human and physiological behavior. While existing clinical decision support systems support clinical practice, there remains a need for proactive, reliability-oriented methodologies capable of identifying and mitigating failure scenarios before patient safety is compromised.
The talk introduces GENGAR, a methodology based on a closed-loop Digital Twin (DT) paradigm for the dependability assurance of medical CPSs. It combines Stochastic Hybrid Automata modeling, data-driven learning of patient dynamics, fuzzing-based model-space exploration, and clustering in an offline critical-scenario detection phase. In a second phase, it provides automated synthesis of mitigation strategies, enabling runtime feedback and control within the DT loop.
GENGAR is evaluated through a representative use case involving a pulmonary ventilator. Results show that, in most evaluated scenarios, strategies synthesized through formal game-theoretic analysis stabilize patient vital metrics at least as effectively as human decision-making, while keeping relevant metrics closer to nominal healthy values on average. The talk will also briefly introduce MARACTUS, a related methodology that automates the extraction of medical procedures and guidelines into machine-readable representations. This action is achieved by transforming unstructured clinical documents into analyzable action models for integration into model-driven pipelines and clinical decision support systems.

Knowledge-Based Machine Learning via Semi-Amortized Neural Network and Differentiable Convex Optimization Layers

Daniele Bottazzi – CSE thesis @Saez-Rodriguez lab (Pablo Rodríguez Mier)

Integrating domain knowledge into machine learning models remains challenging, particularly in biological applications where physical constraints, conservation laws, and specific mechanistic relationships are known but difficult to incorporate into powerful and expressive neural network architectures.
We propose a general purpose end-to-end differentiable framework that couples neural amortization with a structured convex optimization layer formulated as a single quadratic program (QP). The amortizer network generates condition-specific outputs from input features, which are then refined by the QP through a proximal-style optimization, that incorporate knowledge-derived constraints while minimizing deviation from the network’s raw predictions. With mild strong convexity, the resulting solution map is unique and locally Lipschitz, ensuring stable gradients and exact feasibility.
We validate the proposed framework with (i) a proof-of-concept on a classical max-flow problem, illustrating how the convex layer enforces feasibility while the amortizer learns to produce accurate solutions, and (ii) a biological application: cell growth prediction from media composition via flux balance analysis (FBA) stoichiometric and reaction-bound constraints. Experiments show near-zero constraint violations, competitive predictive accuracy, and improved interpretability compared with purely data-driven baselines, while scaling efficiently with GPU-batched training.

A New Data Modeling Approach for Alignment-free Biological Applications

Sergio Lifschitz – Visiting Professor from Pontifícia Universidade Católica do Rio de Janeiro

Identifying and grouping homologous proteins are fundamental tasks in biology, currently dominated by tools that rely on DNA or amino acid sequence data. However, these tasks require the detection of complex evolutionary patterns that are often difficult to capture automatically using traditional methods. This talk presents a data modeling approach that leverages evolutionary patterns for homology searching, ranking, and clustering through an alignment-free process using image similarity algorithms. Our strategy proves valuable even for distant homologs and offers inherent advantages for data privacy and security. Practical experiments show that our approach achieves good and comparable results with traditional methods besides extra visual semantics information.

Spatial transcriptomics analysis and denoising with SpaTIM

Sofia Mongardi – PhD student DSBlab & Visiting @Teesside University (Occhipinti/Masseroli)

Recent advances in spatial transcriptomics (ST) have made it possible to measure gene expression while preserving the spatial organization of cells within tissue samples. Alongside technological improvements, a growing number of computational approaches have been proposed to analyze ST data. Many of these approaches rely on graph-based models, mainly graph neural networks (GNNs), to capture the relationships between neighboring spots and to learn meaningful representations for downstream tasks like spatial domain identification. While these models achieve state-of-the-art performance, they all rely on a predefined and uniformly-constructed graph, usually built using spatial proximity between spots. This approach assumes that spatially adjacent spots are functionally similar, an assumption that does not always hold. To address these limitations, we propose SpaTIM, a novel computational approach for spatial transcriptomics analysis based on GNNs that incorporates additional morphological context to improve graph construction and representation learning. Using morphological information to refine the graph, we ensure that connected spots have similar morphological features, allowing the model to dynamically adjust graph connectivity beyond simple spatial proximity. This allows the model to filter out noisy connections and enhance biologically meaningful relationships, potentially improving the accuracy of spatial domain identification and other downstream tasks.

Navigating Heterogeneity in Polygenic Risk Score Models: A Structured Approach to PRS Model Prioritization

Diana Martinez Minguet – Visiting PhD Student from Universitat Politècnica de València

Polygenic Risk Scores (PRSs) estimate the genetic risk for complex diseases, based on the combined impact of many genetic variants. The particular set of genetic variants and their effect sizes associated to a specific disease is determined by a PRS model. These models are derived from GWAS studies using diverse statistical methods to adjust variant weights so that they can be aggregated in a single measure. Currently, there are no best practices or standards for constructing and reporting PRS models, resulting in substantial variability across models, even for the same disease.
This heterogeneity poses a significant challenge for clinical translation, where a single PRS model must often be selected from among many alternatives. Differences in domain terminology and the need to balance multiple, heterogeneous, and often conflicting evaluation criteria complicate direct comparison and prioritization of PRS models, making model selection a demanding and time-consuming task. In this seminar we discuss how Conceptual Modeling, Multi-Criteria Decision Analysis and LLM-based data extraction techniques can allow for an adequate prioritization of PRS Models, aiming to streamline the PRS Model selection process.

RAAVioli: A Comprehensive Approach to Characterizing AAV Vector Integrations and Rearrangements

Carlo Cipriani – PhD student DSBlab @TIGET (Cesana/Masseroli)

In this study, we present RAAVioli, a computational method for the comprehensive analysis of AAV integration sites and vector rearrangements across both long- and short-read sequencing platforms. Through in silico benchmarking and in vivo validation in a xenogeneic human hepatocyte model transduced with AAV, we demonstrate the robustness and versatility of RAAVioli in accurately identifying AAV–genome junctions and reconstructing complex vector rearrangements across diverse experimental workflows. RAAVioli can be applied across different contexts, ranging from gene addition to gene editing. It has already been used to characterize vector integrations in a mouse model of Wilson disease, to study integrations detected in cell-free DNA in patients treated with AAV vectors and from non-human primates across multiple gene-therapy settings, and enabling non-invasive safety and efficacy monitoring.

Deep Learning Approaches for Alpha-Synucleinopathies Classification Using Skin Biopsy Images

Luca Zanotto – CSE thesis @MIND (Tomè/Mazzetti/Masseroli)

Parkinson’s Disease (PD) and Multiple System Atrophy (MSA) are neurodegenerative a-synucleinopathies that share early symptomatology but require distinct treatments. Recent evidence highlights that cutaneous sweat-gland synaptic innervation, assessed via skin biopsies, may offer a cost-effective and minimally invasive biomarker for differentiation. This study develops a deep learning pipeline analyzing confocal microscopy images of these glands to distinguish PD from MSA. Specifically, images were used to train several deep learning architectures, evaluated across different training strategies. Additionally, explainability methods were applied to better understand the decision-making process of the models. Results demonstrate the efficacy of these architectures, achieving promising performance, especially in the clinically challenging task of differentiating the MSA-parkinsonian subtype (MSA-P) from PD.

Convolutional Neural Networks for [18f]FDG PET imaging-based prediction of clinical relapse in patients with Takayasu Arteritis

Sara Resta – CSE thesis @HSR (Picchio/Bezzi/Masseroli)

Takayasu arteritis is a rare chronic inflammatory disease that mainly affects the aorta and its major branches. Although [18F]FDG PET is the imaging technique with the highest sensitivity for detecting vascular lesions, its role in monitoring disease status remains debated. The highly heterogeneous and sparse distribution of lesions across multiple vascular sites poses challenges in delineating volumes of interest and limits the application of quantitative analysis techniques commonly used in oncology, such as SUV (Standardized Uptake Value) metrics and radiomics. This study investigates the use of Convolutional Neural Networks (CNNs) applied directly to [18F]FDG PET scans to predict patient relapse within 12 months after imaging. A new liver-based standardization approach was optimized to minimize biases during training. As it is not known how the information about the likelihood of a patient to experience flare is encoded in the scans, CNNs were first trained to classify scans according to the presence of pathological uptake in the arteries. Then, these models were leveraged for flare prediction through transfer learning. The model EfficientNetB0 showed promise in predicting complete remission potentially allowing physicians to identify patients that can avoid aggressive therapies and stringent follow-ups.

An integrative bioinformatics and machine learning approach for non-coding RNA-based signatures in Amyotrophic Lateral Sclerosis

Alessandro Cacciatore – CSE thesis @MIND (Salvi/Maldera)

Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disorder that affects motor neurons, leading to a progressive loss of voluntary muscle control. Depending on the symptoms at onset, ALS can be classified as either spinal (affecting the limbs) or bulbar (affecting speech and swallowing). These two subtypes exhibit distinct histopathological, anatomical, and prognostic features, but their underlying biological differences remain poorly characterized, limiting precise diagnosis and treatment. Epigenetic alterations are increasingly recognized as key modulators of disease mechanisms in neurodegeneration. In particular, non-coding RNAs (ncRNAs), such as microRNAs (miRNAs) and long non-coding RNAs (lncRNAs), play essential regulatory roles in neuronal function, stress response, and inflammation, and their dysregulation has been associated with ALS pathogenesis. Investigating these molecules may therefore uncover epigenetic signatures that support patient stratification within a precision medicine framework. The aim of this thesis project is to develop a reproducible computational framework for the identification and validation of RNA-based biomarkers. The study initially focuses on discovering differentially expressed ncRNAs, and subsequently explores their potential to derive discriminative signatures capable of distinguishing between the two ALS subtypes. Samples were obtained from blood serum of 40 ALS patients (30 with bulbar onset and 10 with spinal onset) and 10 healthy controls. The proposed workflow for the analysis of RNA biomarkers integrates various steps: data pre-processing and normalization, missing value imputation, ensemble feature selection, feature orthogonalization, model optimization and validation with class imbalance correction, and biological validation and interpretation of findings. In ensemble feature selection, the output from four independent algorithms were employed: Random Forest, Recursive Feature Elimination, LASSO, and K-Best, allowed to construct a robust and stable ranking of features, which was then filtered by an orthogonalization step to keep only those features that provide non-redundant information. Multi-Omics Factor Analysis (MOFA) was used to perform a comparison with a common integrational technique in this field. Although the MOFA-derived factors achieved a great differentiation of ALS patients from controls, they mainly reflected the global disease-related variance rather than the molecular distinctions specific to subtypes, corroborating the proposed approach as more suitable in this case. In fact, the Partial Least Squares (PLS) model developed in this study, comprising five components, indicated that the second component had a limited yet noticeable ability to differentiate between bulbar and spinal patients. In conclusion, the results confirm that the proposed computational workflow is a trustworthy and biologically interpretable tool for the discovery of RNA biomarkers in ALS, combining statistical robustness and biological relevance. This framework offers a firm ground for subsequent developments, such as the experimental confirmation and the integration of further omics layers for a deeper comprehension of the molecular heterogeneity of ALS.

Mapping the Genetic Architecture of Complex Traits with Genome-Wide Association Studies and Machine Learning

Simone Tomè – PhD student DSBlab @MIND (Salvi/Masseroli)

Complex traits are phenotypes not attributable to a single genetic variant, but to multiple variants, often interacting with environmental factors. Chronic pain exemplifies such a trait, with substantial genetic contributions that remain incompletely understood. Genome-wide association studies (GWAS) aim to uncover the genetic basis of complex traits by testing variants across the genome for associations with the phenotype, typically using statistical learning models. In this study, we present the results of a GWAS conducted on a cohort of chronic pain patients from IRCCS Istituto Neurologico “Carlo Besta”, subsequently validated in an independent cohort from the University of Maastricht. Within this GWAS context, we explore applications of machine learning strategies, with a particular focus on disease gene prediction, an emerging approach that leverages existing knowledge to identify potential disease-associated genes.

Transcriptomic analysis to improve diagnosis and patient stratification in primary mitochondrial myopathies

Francesca Conti – BCG thesis @MIND (Ghezzi/Legati/Masseroli)

Primary Mitochondrial Myopathies (PMM) are a group of mitochondrial disorders characterized by exclusive or predominant skeletal muscle involvement and caused by pathogenic variants in either mitochondrial or nuclear genes. These disorders exemplify the remarkable clinical and genetic heterogeneity typical of mitochondrial diseases, where identical mutations may give rise to distinct phenotypes and similar clinical presentations may originate from different genetic defects, making diagnosis particularly challenging. Transcriptome sequencing (RNA-seq) has recently emerged as a powerful complementary tool to genomic approaches, enabling the detection of transcriptional abnormalities that directly reflect the functional consequences of genetic variation.
In this study, RNA-seq was applied to skeletal muscle biopsies from patients affected by PMM to identify aberrant splicing and gene expression events using the FRASER and OUTRIDER algorithms, respectively. This approach aimed to improve the diagnostic yield and provide new insights into the molecular mechanisms underlying unresolved cases. In parallel, comprehensive computational analyses were developed to explore mitochondrial transcriptomic alterations in patients carrying multiple mtDNA deletions, with a focus on the potential formation of chimeric transcripts and their translation into aberrant proteins. Finally, on this same cohort of patients, clustering analyses based on mitochondrial gene expression and deletion profiles were performed to identify patient subgroups characterized by specific molecular signatures associated with distinct causative genes and histological features.