
Data Science Seminars: Bioinformatics Focus
These seminars take place every Tuesday at 14:15 in Alario seminar room (Building 21)
Researchers, students, and anyone interested in the proposed topics are more than welcome to attend!

Catalina Vallejos Meneses – University of Edinburgh
Can we identify who will experience an adverse health event (e.g. disease onset) weeks, months or even years before it happens? Questions like this are at the core of health data science research and have been empowered by the increasing ability to securely access routinely collected electronic health records (EHR). A key exemplar in Scotland is SPARRAv4 (Scottish Patients at Risk of Readmission and Admission version 4), a population-wide model that will be soon deployed to support anticipatory care planning. I will discuss some of the practical and methodological challenges that arise in the development and evaluation of such models, focusing on time-to-event outcomes. I will introduce the “C-index multiverse”, highlighting how different conceptual and implementation choices can affect model comparison and hinder reproducibility. Finally, I will introduce landmaRk as a flexible tool to perform dynamic risk prediction in the presence of latent population heterogeneity.
Scott Ritchie – University of Cambridge
Efforts to reduce cardiometabolic diseases focus on controlling major modifiable risk factors through targeted intervention in people identified at high risk. However, our understanding of their aetiology remains incomplete, hindering our ability to both predict and prevent these diseases. To address this, we leverage multi-omics data in population cohorts and biobanks to identify new potentially modifiable molecular targets as well as to assess evidence for the potential for multi-omics to enhance existing clinical risk prediction tools.
Linda Maldera – PhD student @MIND Lab (E. Salvi/M. Masseroli)
Multiple Sclerosis (MS) is a demyelinating disease of the central nervous system, classified into three distinct phenotypes: relapsing-remitting (RRMS), secondary progressive (SPMS) and primary progressive (PPMS). SPMS is diagnosed in more than half of RRMS patients when subtle disease progression independent of relapses occurrences. Early definition of SPMS in RRMS patients is still a challenge, as no reliable biomarkers are available in early stages of disease progression. Retinal neural tissue provides a unique window on MS progression as recent studies have showed that the inner plexiform layer (IPL) is thinner in people with MS with respect to healthy controls, and thinning of IPL characterizes patients with RRMS who progressed to SPMS. Data on IPL thickness can be easily obtainable through optical coherence tomography (OCT), however, there are no available tools or pipelines to analyze OCT data in a longitudinal manner. In this seminar we will illustrate the RePlayMS study, from theorization to execution and its preliminary results, discussing open points and yet unanswered methodological aspects.
Riccardo Pizzichemi PhD student DSBlab @TIGET (E. Montini/M. Masseroli)
Forensic genetics uses DNA profiles to identify individuals in judicial investigations, providing evidence that can link biological traces to potential contributors. Modern forensic DNA analysis is primarily based on the examination of short tandem repeats (STR), highly polymorphic loci that allow discrimination between individuals. While the laboratory workflow that generates STR profiles is well established, the interpretation of forensic DNA evidence often presents substantial analytical and statistical challenges, particularly when dealing with degraded samples, low-template DNA, or mixtures of genetic material from multiple individuals. In this seminar, we first introduce the general workflow of forensic DNA analysis, from biological trace collection and DNA extraction to STR amplification and capillary electrophoresis, which produces the electropherogram used for allele identification. As a case study, we discuss the investigation of the Yara Gambirasio homicide in Italy, one of the most prominent forensic DNA cases in Europe. The analysis and interpretation of the DNA profile as “Ignoto 1” played a central role in the investigation.
Simone Callegarin – CSE thesis @MIND Lab (M. Masseroli/S. Tomè)
Small Fibre Neuropathy (SFN) is a neurological disorder involving small somatic and autonomic nerve fibres, leading to chronic neuropathic pain and reduced quality of life. Since these fibres cannot be assessed by routine nerve conduction studies, diagnosis relies on skin biopsy with manual quantification of intraepidermal nerve fibre density (IENFD). Although considered the reference standard, this method is time-consuming and operator- dependent, highlighting the need for more efficient and objective approaches. In this context, deep learning techniques, particularly Convolutional Neural Networks (CNNs), offer a promising solution for automated and reproducible analysis. This work presents a fully automated framework for IENFD quantification from immunofluorescence skin biopsy images using three-dimensional CNNs (3D CNNs). In collaboration with the Fondazione I.R.C.C.S. Istituto Neurologico Carlo Besta, a dedicated dataset was created from 60 biopsies collected from 20 patients. A comprehensive preprocessing pipeline was developed to cover the entire workflow, from raw volumetric data to patient-level diagnostic inference. It was designed to standardize and facilitate the manual annotation process through orientation correction, denoising, and signal enhancement, and also to optimize model performance by reducing irrelevant variability and refining the input data for deep learning training. A 3D CNN regression model was trained to directly predict intraepidermal fibre counts from three-dimensional fields of view (FOVs). The training strategy included cross-validation, customized loss, and ablation studies to assess the contribution of methodological components. The proposed model achieves accurate fibre count prediction at the FOV level and provides clinically reliable IENFD estimates at the biopsy level. Overall, this work demonstrates the feasibility and effectiveness of a fully automated volumetric deep learning framework for IENFD quantification, representing a concrete step toward integrating artificial intelligence into the diagnostic workflow of SFN.
Denisa Sufaj – BCG thesis @IOR (M. Troiani/M. Masseroli)
Cellular senescence and proliferation are commonly described as mutually exclusive and discrete biological states. However, in cancer, transcriptional programs associated with these phenotypes often coexist and vary in a graded manner, challenging binary classifications. In human prostate cancer, the relationship between proliferative activity and senescence-associated signaling remains insufficiently resolved at single-cell resolution.
In this thesis, we model the senescence–proliferation continuum in human prostate cancer by reconstructing a transcriptomic axis that captures gradual transitions from highly proliferative to senescent-like cellular programs. Using single-cell RNA sequencing data derived from human tumor samples, we integrate pathway-level scoring and data-driven latent representations to define a continuous biological landscape. We then translate this axis into discrete cellular states through statistical modeling approaches, enabling robust classification while preserving the underlying continuum structure.
Our results show that senescence-associated programs do not form a strictly separable compartment but instead emerge along a structured gradient opposing proliferative activity. Beyond canonical markers, we identify a set of genes whose expression dynamics strongly align with the reconstructed axis, highlighting candidate regulators and effectors that shape the senescence–proliferation balance in human prostate cancer cells. Discrete states thus arise as emergent properties of an underlying continuous transcriptomic space. This work provides a quantitative framework to model senescence in human prostate cancer, bridging continuous biological variation and discrete state assignment, and offering refined molecular insight into tumor cell heterogeneity.
Andrea Mario Vergani – PhD work @HT (E. Di Angelantonio/F. Ieva/M. Masseroli)
The growing availability of biobank-scale data offers invaluable opportunities for studying the impacts of big health data modalities on biological mechanisms and disease. The aim of the talk is to assess the relevance of multi-modal healthcare data in the cardiovascular field, and the opportunities to exploit novel features and the relationships between heterogeneous data sources towards personalized risk prediction.
Specifically, the first part of the talk will discuss how a cross-modal representation of cardiac imaging, electrocardiogram, and genetic data can predict the future occurrence of cardiovascular events, thus shedding light on the relevance of multi-modal integration of medical test data for risk definition. The second part, instead, will explore the role of unconventional phenotypes to predict incident disease, particularly focusing on deep representation learning-derived factors from cardiac magnetic resonance imaging and single-nucleotide polymorphism data.
Bruno Guindani – PostDoc @ DEIB DeepSE lab (M. Bersani)
The talk presents results from the SAFEST project (PRIN). Medical cyber-physical systems (CPSs) that integrate Patients, Devices, and healthcare personnel (Physicians) form safety-critical PDP triads whose dependability is challenged by system heterogeneity and uncertainty in human and physiological behavior. While existing clinical decision support systems support clinical practice, there remains a need for proactive, reliability-oriented methodologies capable of identifying and mitigating failure scenarios before patient safety is compromised.
The talk introduces GENGAR, a methodology based on a closed-loop Digital Twin (DT) paradigm for the dependability assurance of medical CPSs. It combines Stochastic Hybrid Automata modeling, data-driven learning of patient dynamics, fuzzing-based model-space exploration, and clustering in an offline critical-scenario detection phase. In a second phase, it provides automated synthesis of mitigation strategies, enabling runtime feedback and control within the DT loop.
GENGAR is evaluated through a representative use case involving a pulmonary ventilator. Results show that, in most evaluated scenarios, strategies synthesized through formal game-theoretic analysis stabilize patient vital metrics at least as effectively as human decision-making, while keeping relevant metrics closer to nominal healthy values on average. The talk will also briefly introduce MARACTUS, a related methodology that automates the extraction of medical procedures and guidelines into machine-readable representations. This action is achieved by transforming unstructured clinical documents into analyzable action models for integration into model-driven pipelines and clinical decision support systems.
Daniele Bottazzi – CSE thesis @Saez-Rodriguez lab (P. Rodríguez Mier)
Integrating domain knowledge into machine learning models remains challenging, particularly in biological applications where physical constraints, conservation laws, and specific mechanistic relationships are known but difficult to incorporate into powerful and expressive neural network architectures.
We propose a general purpose end-to-end differentiable framework that couples neural amortization with a structured convex optimization layer formulated as a single quadratic program (QP). The amortizer network generates condition-specific outputs from input features, which are then refined by the QP through a proximal-style optimization, that incorporate knowledge-derived constraints while minimizing deviation from the network’s raw predictions. With mild strong convexity, the resulting solution map is unique and locally Lipschitz, ensuring stable gradients and exact feasibility.
We validate the proposed framework with (i) a proof-of-concept on a classical max-flow problem, illustrating how the convex layer enforces feasibility while the amortizer learns to produce accurate solutions, and (ii) a biological application: cell growth prediction from media composition via flux balance analysis (FBA) stoichiometric and reaction-bound constraints. Experiments show near-zero constraint violations, competitive predictive accuracy, and improved interpretability compared with purely data-driven baselines, while scaling efficiently with GPU-batched training.
Sergio Lifschitz – Visiting Professor from Pontifícia Universidade Católica do Rio de Janeiro
Identifying and grouping homologous proteins are fundamental tasks in biology, currently dominated by tools that rely on DNA or amino acid sequence data. However, these tasks require the detection of complex evolutionary patterns that are often difficult to capture automatically using traditional methods. This talk presents a data modeling approach that leverages evolutionary patterns for homology searching, ranking, and clustering through an alignment-free process using image similarity algorithms. Our strategy proves valuable even for distant homologs and offers inherent advantages for data privacy and security. Practical experiments show that our approach achieves good and comparable results with traditional methods besides extra visual semantics information.
Sofia Mongardi – PhD student DSBlab & Visiting @Teesside University (A. Occhipinti/M. Masseroli)
Recent advances in spatial transcriptomics (ST) have made it possible to measure gene expression while preserving the spatial organization of cells within tissue samples. Alongside technological improvements, a growing number of computational approaches have been proposed to analyze ST data. Many of these approaches rely on graph-based models, mainly graph neural networks (GNNs), to capture the relationships between neighboring spots and to learn meaningful representations for downstream tasks like spatial domain identification. While these models achieve state-of-the-art performance, they all rely on a predefined and uniformly-constructed graph, usually built using spatial proximity between spots. This approach assumes that spatially adjacent spots are functionally similar, an assumption that does not always hold. To address these limitations, we propose SpaTIM, a novel computational approach for spatial transcriptomics analysis based on GNNs that incorporates additional morphological context to improve graph construction and representation learning. Using morphological information to refine the graph, we ensure that connected spots have similar morphological features, allowing the model to dynamically adjust graph connectivity beyond simple spatial proximity. This allows the model to filter out noisy connections and enhance biologically meaningful relationships, potentially improving the accuracy of spatial domain identification and other downstream tasks.
Diana Martinez Minguet – Visiting PhD Student from Universitat Politècnica de València
Polygenic Risk Scores (PRSs) estimate the genetic risk for complex diseases, based on the combined impact of many genetic variants. The particular set of genetic variants and their effect sizes associated to a specific disease is determined by a PRS model. These models are derived from GWAS studies using diverse statistical methods to adjust variant weights so that they can be aggregated in a single measure. Currently, there are no best practices or standards for constructing and reporting PRS models, resulting in substantial variability across models, even for the same disease.
This heterogeneity poses a significant challenge for clinical translation, where a single PRS model must often be selected from among many alternatives. Differences in domain terminology and the need to balance multiple, heterogeneous, and often conflicting evaluation criteria complicate direct comparison and prioritization of PRS models, making model selection a demanding and time-consuming task. In this seminar we discuss how Conceptual Modeling, Multi-Criteria Decision Analysis and LLM-based data extraction techniques can allow for an adequate prioritization of PRS Models, aiming to streamline the PRS Model selection process.
Carlo Cipriani – PhD student DSBlab @TIGET (D. Cesana/M. Masseroli)
In this study, we present RAAVioli, a computational method for the comprehensive analysis of AAV integration sites and vector rearrangements across both long- and short-read sequencing platforms. Through in silico benchmarking and in vivo validation in a xenogeneic human hepatocyte model transduced with AAV, we demonstrate the robustness and versatility of RAAVioli in accurately identifying AAV–genome junctions and reconstructing complex vector rearrangements across diverse experimental workflows. RAAVioli can be applied across different contexts, ranging from gene addition to gene editing. It has already been used to characterize vector integrations in a mouse model of Wilson disease, to study integrations detected in cell-free DNA in patients treated with AAV vectors and from non-human primates across multiple gene-therapy settings, and enabling non-invasive safety and efficacy monitoring.
Luca Zanotto – CSE thesis @MIND Lab (S. Tomè/S. Mazzetti/M. Masseroli)
Parkinson’s Disease (PD) and Multiple System Atrophy (MSA) are neurodegenerative a-synucleinopathies that share early symptomatology but require distinct treatments. Recent evidence highlights that cutaneous sweat-gland synaptic innervation, assessed via skin biopsies, may offer a cost-effective and minimally invasive biomarker for differentiation. This study develops a deep learning pipeline analyzing confocal microscopy images of these glands to distinguish PD from MSA. Specifically, images were used to train several deep learning architectures, evaluated across different training strategies. Additionally, explainability methods were applied to better understand the decision-making process of the models. Results demonstrate the efficacy of these architectures, achieving promising performance, especially in the clinically challenging task of differentiating the MSA-parkinsonian subtype (MSA-P) from PD.
Sara Resta – CSE thesis @HSR (M. Picchio/C. Bezzi/M. Masseroli)
Takayasu arteritis is a rare chronic inflammatory disease that mainly affects the aorta and its major branches. Although [18F]FDG PET is the imaging technique with the highest sensitivity for detecting vascular lesions, its role in monitoring disease status remains debated. The highly heterogeneous and sparse distribution of lesions across multiple vascular sites poses challenges in delineating volumes of interest and limits the application of quantitative analysis techniques commonly used in oncology, such as SUV (Standardized Uptake Value) metrics and radiomics. This study investigates the use of Convolutional Neural Networks (CNNs) applied directly to [18F]FDG PET scans to predict patient relapse within 12 months after imaging. A new liver-based standardization approach was optimized to minimize biases during training. As it is not known how the information about the likelihood of a patient to experience flare is encoded in the scans, CNNs were first trained to classify scans according to the presence of pathological uptake in the arteries. Then, these models were leveraged for flare prediction through transfer learning. The model EfficientNetB0 showed promise in predicting complete remission potentially allowing physicians to identify patients that can avoid aggressive therapies and stringent follow-ups.
Alessandro Cacciatore – CSE thesis @MIND Lab (E. Salvi/L. Maldera)
Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disorder that affects motor neurons, leading to a progressive loss of voluntary muscle control. Depending on the symptoms at onset, ALS can be classified as either spinal (affecting the limbs) or bulbar (affecting speech and swallowing). These two subtypes exhibit distinct histopathological, anatomical, and prognostic features, but their underlying biological differences remain poorly characterized, limiting precise diagnosis and treatment. Epigenetic alterations are increasingly recognized as key modulators of disease mechanisms in neurodegeneration. In particular, non-coding RNAs (ncRNAs), such as microRNAs (miRNAs) and long non-coding RNAs (lncRNAs), play essential regulatory roles in neuronal function, stress response, and inflammation, and their dysregulation has been associated with ALS pathogenesis. Investigating these molecules may therefore uncover epigenetic signatures that support patient stratification within a precision medicine framework. The aim of this thesis project is to develop a reproducible computational framework for the identification and validation of RNA-based biomarkers. The study initially focuses on discovering differentially expressed ncRNAs, and subsequently explores their potential to derive discriminative signatures capable of distinguishing between the two ALS subtypes. Samples were obtained from blood serum of 40 ALS patients (30 with bulbar onset and 10 with spinal onset) and 10 healthy controls. The proposed workflow for the analysis of RNA biomarkers integrates various steps: data pre-processing and normalization, missing value imputation, ensemble feature selection, feature orthogonalization, model optimization and validation with class imbalance correction, and biological validation and interpretation of findings. In ensemble feature selection, the output from four independent algorithms were employed: Random Forest, Recursive Feature Elimination, LASSO, and K-Best, allowed to construct a robust and stable ranking of features, which was then filtered by an orthogonalization step to keep only those features that provide non-redundant information. Multi-Omics Factor Analysis (MOFA) was used to perform a comparison with a common integrational technique in this field. Although the MOFA-derived factors achieved a great differentiation of ALS patients from controls, they mainly reflected the global disease-related variance rather than the molecular distinctions specific to subtypes, corroborating the proposed approach as more suitable in this case. In fact, the Partial Least Squares (PLS) model developed in this study, comprising five components, indicated that the second component had a limited yet noticeable ability to differentiate between bulbar and spinal patients. In conclusion, the results confirm that the proposed computational workflow is a trustworthy and biologically interpretable tool for the discovery of RNA biomarkers in ALS, combining statistical robustness and biological relevance. This framework offers a firm ground for subsequent developments, such as the experimental confirmation and the integration of further omics layers for a deeper comprehension of the molecular heterogeneity of ALS.
Simone Tomè – PhD student DSBlab @MIND lab (E. Salvi/M. Masseroli)
Complex traits are phenotypes not attributable to a single genetic variant, but to multiple variants, often interacting with environmental factors. Chronic pain exemplifies such a trait, with substantial genetic contributions that remain incompletely understood. Genome-wide association studies (GWAS) aim to uncover the genetic basis of complex traits by testing variants across the genome for associations with the phenotype, typically using statistical learning models. In this study, we present the results of a GWAS conducted on a cohort of chronic pain patients from IRCCS Istituto Neurologico “Carlo Besta”, subsequently validated in an independent cohort from the University of Maastricht. Within this GWAS context, we explore applications of machine learning strategies, with a particular focus on disease gene prediction, an emerging approach that leverages existing knowledge to identify potential disease-associated genes.
Francesca Conti – BCG thesis @MIND Lab (D. Ghezzi/A. Legati/M. Masseroli)
Primary Mitochondrial Myopathies (PMM) are a group of mitochondrial disorders characterized by exclusive or predominant skeletal muscle involvement and caused by pathogenic variants in either mitochondrial or nuclear genes. These disorders exemplify the remarkable clinical and genetic heterogeneity typical of mitochondrial diseases, where identical mutations may give rise to distinct phenotypes and similar clinical presentations may originate from different genetic defects, making diagnosis particularly challenging. Transcriptome sequencing (RNA-seq) has recently emerged as a powerful complementary tool to genomic approaches, enabling the detection of transcriptional abnormalities that directly reflect the functional consequences of genetic variation.
In this study, RNA-seq was applied to skeletal muscle biopsies from patients affected by PMM to identify aberrant splicing and gene expression events using the FRASER and OUTRIDER algorithms, respectively. This approach aimed to improve the diagnostic yield and provide new insights into the molecular mechanisms underlying unresolved cases. In parallel, comprehensive computational analyses were developed to explore mitochondrial transcriptomic alterations in patients carrying multiple mtDNA deletions, with a focus on the potential formation of chimeric transcripts and their translation into aberrant proteins. Finally, on this same cohort of patients, clustering analyses based on mitochondrial gene expression and deletion profiles were performed to identify patient subgroups characterized by specific molecular signatures associated with distinct causative genes and histological features.
