Helmholtz AI consultants @ Helmholtz Munich

Helmholtz AI consultants @ Helmholtz Munich

Marie Piraud

Team leader

 

Health-focused AI consultants

The Helmholtz AI central unit is also the local unit for Health, and includes a team of health-focused consultants. They are key actors in achieving the Helmholtz AI goal of empowering scientists to use AI in their research. For that, they advise and support research teams in using machine learning and deep learning. The consultants master a broad range of methods and tools, and offer help at all stages of the data analysis pipeline, from project conceptualisation to actual implementation. They provide reusable code and technical reports, and strive to enable their scientific collaborators to leverage the methods themselves, by proposing pair programming and code review sessions for example. They also play a key role in disseminating knowledge, by contributing open-source software to the community and proposing trainings adapted to the needs of the Health research community.

Questions or ideas? Feel free to reach out to us!

consultant-helmholtz.ainoSp@m@helmholtz-munich.de

GITHUB Helmholtz AI consultants @ Helmholtz Munich

 

 

 

Team members

Oops, an error occurred! Code: 202410012151377da236f7

Selected ongoing voucher projects

Automatic scoring of vitiligo in dermatological 3D full body scans using swarm learning

  • Challenge: Vitiligo is a skin disease that can be diagnosed and monitored through patient history visual inspection. However, visual inspection is limited when it comes to monitoring treatment response and quantifying disease progression. Clinical images are routinely taken, but can be sometimes difficult to compare due to different technical circumstances. A relatively new tool in Dermatology is the 3D full body scanner, which can almost image the whole skin surface in a highly standardized way and with high quality. Dealing with these full body scans however presents other challenges, including data protection since anonymization is not feasible.
  • Approach: We are building an automated approach for real-world hospital settings when data can not be shared between hospitals due to patient confidentiality and privacy. Using full-body scans from the Department of Dermatology and Allergy at the Klinikum Rechts der Isar in Munich, we will first develop an automated method that can detect and segment vitiligo using Deep Learning. We will then use the Swarm Learning framework developed by our collaborators at DZNE for decentralized training of the final model to enable real-world collaboration among hospitals. As a first use case, we will then train the algorithm with full-body scans from the Dermatology Department of the university hospital Erlangen.
  • Consultants: Gerome Vivar, Florian Kofler

 

Contraction analysis of neuromuscular organoids

  • Challenge: Neuromuscular organoids capture key morphological and functional features of the in vivo tissue at unprecedented level, including contraction. Analysing the contractile activity of neuromuscular organoids will help us unravel the mechanisms of interaction between these tissues during development and disease. Establishment of such an assay will also allow us to use neuromuscular organoids as tools for screening and developing effective drugs and treatments for neuromuscular disorders and diseases, such as Spinal Muscular Atrophy (SMA). To this end, we aim to develop a tool for analysing the contraction of organoids, captured in video recordings. To gain valuable insights from these recordings, our tool is designed to accurately address challenges such as disregarding organoid drift in the extracted signal. Once the signals are successfully extracted, our analysis will concentrate on feature extraction, as well as univariate and bivariate assessments of these features, to understand the behaviour of organoids under varying conditions.
  • Approach: We developed a stable pipeline for extracting and analysing organoid contraction signals from video data, split into two parts. The first part focuses on the signal extraction from the video; in the second, the signal is processed and features are extracted, for analysing the type of contraction. Moreover, we offer an interactive tool that enables experts to focus on the insights gained from the data, while the time series extraction and analysis, together with the technical details, are automatically executed. As a result, the functional profile of neuromuscular organoids is generated, facilitating better understanding of the organoid properties under different biological scenaria.
  • Consultants: Donatella Cea, Christina Bukas

 

Embedding Ethics in scRNA-seq-based biomarker creation

  • Challenge: Recent advances in scRNA-seq models provide the possibility to investigate the transcriptomic profile of individual patients on single-cell resolution. They thus are promising leads to new forms of, for example, biomarker generation. However, scRNA-seq data used for such analyses is prone to biases, and when used for biomarker development or other application areas, the insights will be too.
  • Approach: We apply the embedded ethics approach to a scRNA seq model project by our partners at Helmholtz Munich. By accompanying their research, we gain an understanding of the potential ethical issues in the context of scRNA-seq in this nascent field. Together with our partner, we work on creating metrics to control for and mitigate the issues we find.
  • Consultants: Theresa Willem

 

Quantification and size measurement of lung organoids

  • Challenge: Manual quantification of the lung organoids is considered a huge bottleneck for higher throughput analysis. Automatising the process of lung organoid quantification can significantly reduce the time required for expert annotators to manually quantify the organoids. The resulting approach should be able to count the number of organoids on a plate and extract properties regarding their size and shape. Such systems can be further used for perturbation studies as well as drug screening and testing.
  • Approach: We developed a robust deep learning pipeline to detect lung organoids by implementing the Faster-RCNN object detection model, along with a plugin for an image visualisation and analysis tool called Napari, which allows users to easily run the algorithm, validate results and extract useful features. Our model was trained on a dataset of more than 40,000 organoids, outperforming our original approach, based on traditional image processing techniques. The plugin along with a user manual was delivered to the collaborators, allowing for an easy introduction into the usage of the new software.
  • Consultants: Harshavardhan Subramanian, Christina Bukas, Florian Kofler

 

Unleashing the Power of Machine Learning for Network Analysis of Bulk and Single-Cell Data

  • Challenge: Gene coexpression networks and gene modules derived from them are a popular approach to represent gene expression data, allowing to compare and integrate data sets from different platforms and labs and enabling downstream functional analyses at various levels of cellular processes to understand molecular disease characteristics. However, a correlation-based approach to coexpression analysis fails e.g. due to covariate effects or due to dropouts in single-cell measurements.
  • Approach: Applying machine learning, we obtain robust models of gene expression, which explicitly address these problems. Furthermore, we exploit these models to automatize the coexpression analysis as much as possible, replacing manual cut-off selections by data-driven decisions.
  • Consultants: Elisabeth Georgii

Selected completed voucher projects

CRISPRi guide efficiency prediction in bacteria

  • Challenge: CRISPR interference (CRISPRi), the targeting of a catalytically dead Cas protein to block transcription, is the leading technique to silence gene expression in bacteria. Genome-scale CRISPRi essentiality screens provide one data source from which rules for guide design can be extracted. However, depletion confounds guide efficiency with effects from the targeted gene.
  • Approach: Together with our collaborators from Helmholtz Centre for Infection Research, we could show that depletion can be predicted using machine learning models and a combination of guide and gene features, with expression of the target gene having an outsized influence. Further, integrating data across independent CRISPRi screens improves performance. We develop a mixed-effect random forest regression model that learns from multiple datasets and isolates effects manipulable in guide design, and apply methods from explainable AI to infer interpretable design rules. Our method outperforms the state-of-the-art in predicting depletion in an independent saturating screen targeting purine biosynthesis genes in Escherichia coli. Our approach provides a blueprint for the development of predictive models for CRISPR technologies in bacteria.
  • Consultants: Lisa Barros de Andrade e Sousa, Erinc Merdivan

 

Automatic Cell Counting in cell migration experiments

  • Challenge: Cell migration is central to many physiological and pathological processes such as embryonic development, wound repair, and tumor metastasis. Boyden Chamber assay is the most widely accepted cell migration technique for the characterization of cell motility. Cell motility is quantified by counting the cell numbers in the microscopic images. Such images normally contain many cells and therefore counting manually is quite time consuming, laborious and error prone.
  • Approach: An automatic cell counter algorithm is provided to count crystal violet cell numbers in 2D microscopic images. In addition, a graphical user interface is also implemented for further manual correction of the automatic results. Our solution permits to speed up the analysis in cell migration experiments by a factor of 10!
  • Consultants: Ruolin Shen

 

Systematic evaluation of cell-type deconvolution pipelines

  • Challenge: DNA methylation analysis by sequencing is becoming increasingly popular, yielding methylomes at single-base pair and single-molecule resolution. It has tremendous potential for cell-type heterogeneity analysis using intrinsic read-level information. Although diverse deconvolution methods were developed to infer cell-type composition based on bulk sequencing-based methylomes, systematic evaluation has not been performed yet.
  • Approach: Together with our collaboration partners from the German Cancer Research Center, we thoroughly benchmark six previously published methods: Bayesian epiallele detection, DXM, PRISM, csmFinder+coMethy, ClubCpG and MethylPurify, together with two array-based methods, MeDeCom and Houseman, as a comparison group. We found that array-based methods, both reference-based and reference-free, generally outperformed sequencing-based methods, despite the absence of read-level information. This implies that the current sequencing-based methods still struggle with correctly identifying cell-type-specific signals and eliminating confounding methylation patterns, which needs to be handled in future studies.
  • Consultants: Lisa Barros de Andrade e Sousa, Dominik Thalmeier

 

End-to-end multi-class semantic segmentation network for Electron Microscopy data

  • Challenge: While genetically encoded reporters are common for fluorescence microscopy, equivalent multiplexable gene reporters for electron microscopy (EM) are still scarce. By installing a variable number of fixation-stable metal-interacting moieties in the lumen of encapsulin nanocompartments of different sizes, a suite of spherically symmetric and concentric barcodes (EMcapsulins) was developed that are readable by standard EM techniques. After imaging six different classes of such EMcapsulins in drosophila and mice brains using 2D EM, the collaborator contacted us to segment and quantify them because their existing two-stage segmentation+classification pipeline failed to correctly segment and classify many EMcapsulins.
  • Approach: We implemented an end-to-end hierarchical multi-class segmentation U-Net to segment and classify EMcapsulins within a single step. The U-Net significantly outperformed the existing two-step segmentation and classification pipeline, possibly due to its ability to also encode contextual information. Further, instance-wise segmentation quality metrics were implemented. This enabled segmentation and quantification of the EMcapsulins, so the collaborator could qualitatively and quantitatively demonstrate their usefullnes.
  • Consultants: Florian Kofler

 

Quantum chemically refined database of experimental protein-ligand complexes

  • Challenge: Databases of protein-ligands complex structures play an essential role in machine learning research supporting drug discovery. However, available ligand structures are not quantum chemically refined, resulting in ligands with inaccurate 3D structures, protonation, and charges. The new database includes ligands with accurate 3D structure, increasing the data quality for 3D Deep Learning models in drug discovery.
  • Approach: We integrated three quantum chemically refined databases of protein ligands and their physicochemical properties generated by our partners at Helmholtz Munich. We created benchmarks for graph-level predictions for ligand properties and node-level predictions for protein adaptability (how much protein changes its shape when the ligand is bound) using graph neural networks to promote the enhanced database among the AI community.
  • Consultants: Erinc Merdivan

 

Subgroup identification in spinocerebellar ataxias

  • Challenge: Spinocerebellar ataxias (SCAs) are rare, autosomal dominantly inherited neurological diseases with onset in adult age. The most common SCAs, SCA1, 2, 3 and 6, together account for more than half of all affected families worldwide. Clinical hallmarks are progressive loss of balance and coordination, accompanied by slurred speech. Patients affected by SCA suffer substantial restrictions of mobility and communicative skills. Predicting the disease progression from genetic features, demographic information and the current status of neurological symptoms paves the way toward potential stratification markers and is important for anticipating optimal windows regarding the start of preventive treatments.
  • Approach: Using a multi-cohort data set with clinical time courses of different established neurological scales, comprising a total of 39 single items, we trained predictive models by regularized Cox regression and survival forests. For each of the most common SCAs, we extracted relevant features and characterized its progression with respect to the multitude of neurological symptoms. The loss of the ability of free walking is a transition of high clinical impact and was analyzed in detail to support future monitoring and decision making.
  • Consultants: Elisabeth Georgii

Softwares and resources

 

Oligo Designer Toolsuite

Oligonucleotides (abbrev. oligos) are short, synthetic strands of DNA or RNA that have many application areas, ranging from research to disease diagnosis or therapeutics, and need to be designed individually based on the intended application and experimental design. We developed the Oligo Designer Toolsuite, which is a collection of modules that provide all basic functionalities for custom oligo design pipelines within a flexible Python framework. All modules have a standardized I/O format and can be combined individually depending on the required processing steps, like the generation of custom-length oligo sequences, the filtering of oligo sequences based on thermodynamic properties as well as the selection of an optimal set of oligos.

  • This tool is available on GitHub here.
  • The implemented oligo design pipeline for SCRINSHOT, Merfish and SeqFISH probes were published alongside the following paper: Kuemmerle, L. B., Luecken, M. D., Firsova, A. B., Barros de Andrade e Sousa, L., Straßer, L., Heumos, L., Mekki, I., …, Piraud, M., ... & Theis, F. J. (2022). Probe set selection for targeted spatial transcriptomics. bioRxiv, 2022-08.
  • Consultants: Lisa Barros de Andrade e Sousa, Isra Mekki, Francesco Campi

 

Data Centric Platform

High quality labeled datasets are critical for advances in biology and medicine. However, the labelling process is often intensive and time consuming and therefore, our collaborators often have very few annotated data. By developing a data-centric platform for microscopy images we try to minimise the time experts need to fully annotate the data and disentangle ourselves from the train-correct-train loop. We do this by first applying baseline models (such as CellPose), which can already give meaningful output. Then, we fine tune them by allowing the user to check, confirm or fix the model-produced annotations. The interface to perform this is as simple as it gets, tailored to each collaborator and it requires no prior knowledge of machine or deep learning methods.

 

Forest-Guided Clustering

Standard explainability methods for Random Forest (RF) models, like permutation feature importance, are commonly used to pinpoint the individual contribution of features to the model performance but often miss the role of correlated features or feature interactions in the model’s decision-making process. The Forest-Guided Clustering algorithm computes feature importance based on subgroups of instances that follow similar decision paths within the RF model, thus focusing on pattern-driven rather than performance-driven importance. By doing so, our method avoids the misleading interpretation of correlated features, allows the detection of feature interactions and gives a sense for the generalizability of identified patterns.

 

Quicksetup-ai

Oftentimes, in machine learning (ML) projects, researchers tend to focus on the ML code. As the project grows, many challenges arise, such as difficulties in communication, collaboration, and tracking of the experiments. This can lead to lack of reproducibility of the results and difficulties in deploying the pipelines. Through Quicksetup-ai, we propose a flexible template as a quick setup for deep learning projects in research. The template combines established and widely used tools and libraries to provide a clean, simple and reusable baseline with a wide range of features.

 

PySDDR

PySDDR combines the interpretability of a statistical model with the prediction power of deep neural networks in an easy-to-use python package. It is the python implementation of the Semi-Structured Deep Distributional Regression (SDDR) framework which enhances Generalized Additive Models (GAMs) with neural networks. This extends the use of GAMs to model high-dimensional nonlinear patterns in the data and, furthermore, to be applied to multimodal data (e.g. a combination of tabular and image data). The framework is written in PyTorch and accepts any number of neural networks, of any type (FC, CNN, LSTM, ...).

Teaching projects

Introduction to eXplainable Artificial Intelligence

In this course, we discuss the reasons why explainability is important and we introduce several model-agnostic and model-specific methods for tabular data, images and 1D data like text or signals. This course was held at the following events: Summer Academy 2022, ml4earth 2022, MALTAomics Summer School 2023, Summer Academy 2023, Forschungszentrum Jülich Training 2023

  • Course website here.
  • Consultants: Donatella Cea, Lisa Barros de Andrade e Sousa, Helena Pelin, Christina Bukas, Harshavardhan Subramanian

 

ChatGPT in Action – Enhancing your Workflow

In this workshop we want to share tips on prompt engineering in order to obtain better outputs from ChatGPT and be aware of potential pitfalls and biases. We split the workshop in a writing part, were we focus on how to use this tool for writing email, papers, brainstorm idea; and a coding part to produce new code, review, document and test existing code (in python). This course was held at the following events: Helmholtz Munich Career Development

  • Consultants: Donatella Cea, Erinc Merdivan

 

BraTS inpainting Challenge

In this challenge, we encourage participants to experiment with algorithmic solutions to synthesize 3D healthy brain tissue in the area affected by glioma. We cast the problem of synthesizing healthy tissue as inpainting within the tumor area. This Challenge was held at the following events: MICCAI 2023

Featured Publications