Skip to main content

Workshops and tutorials

Monday, 16 September - morning session
Advancing Equality in Biomedicine and Healthcare: Tackling Sex and Gender Bias in Artificial Intelligence

Read description

In the rapidly advancing field of artificial intelligence (AI), the prevalence of sex and gender biases poses significant ethical and operational challenges. This workshop is designed to delve into the multifaceted issue of sex and gender bias in AI, examining its origins, implications, and mitigation strategies. It targets both AI experts and novices from related fields such as machine learning and data science, offering an interdisciplinary platform for discussion.

The event will start with an overview of AI technologies, highlighting case studies that demonstrate sex and gender biases in decision-making processes. Participants will explore how biases in data and algorithms can mirror or intensify societal inequalities. A key feature of the workshop will be a demo session that introduces methods for detecting and mitigating bias at various stages—training data, model, and outcomes. This includes feature attribution methods, fairness metrics, and bias detection algorithms. The session aims to equip participants with skills essential for developing fair AI systems and will encourage interactive discussions to facilitate collaboration and idea exchange.

The workshop will conclude with a ”question box” session where participants can anonymously submit questions throughout the event. Selected questions will be discussed collectively, enabling participants to apply their newly acquired knowledge in practical scenarios. By the end of the workshop, attendees will have gained both theoretical knowledge and practical experience necessary for promoting and developing equitable AI systems.

Monday, 16 September - morning session
Exploring nf-core: a best-practice framework for Nextflow pipelines

Read description

Long-term use and uptake of workflows by the life-science community requires that the workflows, and their tools, are findable, accessible, interoperable, and reusable (FAIR). To address these challenges, workflow management systems such as Nextflow have become an indispensable part of a computational biologist’s toolbox. The nf-core project is a community effort to facilitate the creation and sharing of high-quality workflows and workflow components written in Nextflow. Workflows hosted on the nf-core framework must adhere to a set of strict best-practice guidelines that ensure reproducibility, portability, and scalability. Currently, there are more than 57 released pipelines and 27 pipelines under development as part of the nf-core community, with more than 400 code contributors.

In this tutorial, we will introduce the nf-core project and how to become part of the nf-core community. We will showcase how the nf-core tools help create Nextflow pipelines starting from a template. We will highlight the best-practice components in Nextflow pipelines to ensure a reproducible and portable workflow, such as CI-testing, modularization, code linting, and containerization. We will introduce the helper functions for interacting with over 1000 ready-to-use modules and subworkflows. In the final practical session, we will build a pipeline using the tooling and components we have introduced.

This workshop will be taught by nf-core project core administrators and nextflow ambassadors who are experienced Nextflow developers and contributors to the nf-core community.

Target audience: This workshop is designed to appeal to researchers and bioinformaticians seeking to develop pipelines using best practices. It specifically targets those interested in the practical aspects of developing nf-core pipelines to enhance their work. Whether participants are already familiar with nf-core or are newcomers eager to understand the different development options, the workshop provides a platform for learning the best practices for developing Nextflow pipelines. The ideal audience member will have familiarity with the command line, genomics terminology, and standard high-throughput sequencing data formats. Basic experience using Nextflow is required. To get started with Nextflow, see https://training.seqera.io/. Prerequisites for this tutorial: – Have a GitHub account and join the nf-core GitHub organisation beforehand – Check that your Gitpod account linked to GitHub is active – Join the nf-core slack: https://nf-co.re/join – Basic understanding of command line usage.

Monday, 16 September - morning session
Analysis of highly multiplexed fluorescence images

Read description

Recent advances in spatial proteomics enable highly-multiplexed profiling of tissues at single-cell resolution and unprecedented scale. However, specialised workflows are necessary to extract meaningful insights in a robust and automated manner. In this workshop, we will introduce participants to the analysis of highly-multiplexed immunohistochemistry data, covering the end-to-end workflow needed to transform multichannel images into single-cell data and perform subsequent spatial analyses.

Reproducible analysis of multiplexed imaging data poses substantial computational challenges. Tissue microarrays, for example, must be subdivided into constituent cores before they can be subjected to cell segmentation, the process of dividing the image into individual cells. Subsequently, staining intensities in each channel and features such as cell morphology are extracted to generate a single-cell expression matrix. This table may serve as an input to machine learning algorithms to phenotype individual cells. Finally, we highlight the importance of studying tissue composition and cell-to-cell contacts as a means to uncover underlying disease mechanisms.

The goal of this workshop is to familiarise researchers with current approaches to analyse highly-multiplexed images. Beyond discussing each stage of the process of transforming multichannel images into single-cell data, we demonstrate the use of a probabilistic model which incorporates prior knowledge about cell types and their associated markers. Lastly, we show how to chain the analysis steps together in the form of a modular pipeline that is robust and scalable.

Target audience: Computational biologists with experience in python.

Monday, 16 September - morning session
Improving FAIRability and reproducibility for research software with machine-actionable Software Management Plans

Read description

This tutorial focuses on best practices for research software (RS) leveraged by Software Management Plans (SMPs) and facilitated by tools implementing machine-actionable SMP (maSMP) metadata. SMPs comprise questions to help researchers oversee the software development lifecycle and follow a minimum of best practices (e.g., license, releases, citation). maSMPs complement SMPs by mapping questions and software to structured metadata. This tutorial will use the SMP created by the Software Best Practices ELIXIR group, the maSMPs based on schema.org and developed by ZB MED / NFDI4DataScience, the Software Management Wizard prepared in collaboration with ELIXIR-CZ, and the software metadata extraction tool supported by OpenEBench. Other SMP platforms and software extraction tools will also be considered (e.g.,Research Data Management Organiser RDMO, Software Metadata Extraction Framework SOMEF). With a mix of talks and hands-on, we will show how RS can benefit and get better from an SMP-based metadata enrichment cycle. This tutorial is suitable for researchers who write code and want to learn more about RS best practices, metadata and FAIRness. We expect participants to have some knowledge on GitHub and JSON. Learning outcomes include: practical understanding of FAIR principles for research software, creation of SMPs and usage of maSMPs for own GitHub repos. The overall goal is providing tools for researchers to make better software. Although this tutorial fits to any of the six ECCB conference themes (Genomes, Proteins, Systems biology, Single-cell omics, Microbiome, Digital Health), we will likely showcase the metadata enrichment cycle with some microbiome related software.

Presenter: Leyla Jael Castro

Schedule:

09:00-09:05  Welcome – Organizers
09:05-09:20  Q&A – Introduction to ELIXIR SMP – Fotis Psomopoulos
09:20-09:35  Q&A – Introduction to schema.org-based maSMP metadata schema – Leyla Jael Castro
09:35-09:50  Q&A – Introduction to OpenEBench and metadata extraction/insertion in GitHub – Eva Martin
09:50-10:05  Q&A – Demo Software Management Wizard with metadata enrichment – Eva Martin
10:05-10:20  Q&A – Other approaches to metadata extraction/insertion – Leyla Jael Castro
10:20-10:30  Introduction to hands-on – Fotis Psomopoulos
10:30-11:00  Coffee break and networking
11:00-11:50  Hands-on: Bring your own software, create its SMP and enrich it with maSMP metadata – All
11:50-12:00  Wrap-up – All
12:00  End of the session and heading for lunch

Target audience: Researchers who develop software Prerequisites: some knowledge on GitHub and JSON.

Monday, 16 September - afternoon session
Orchestrating Microbiome Analysis with Bioconductor

Read description

This tutorial introduces the latest advances in Bioconductor tools and data structures supporting microbiome analysis. We will show how this can enhance interoperability across omics and walk through a typical microbiome data science workflow. You will learn from active developers how to access open microbiome data resources, utilize optimized data structures, assess community diversity and composition, integrate taxonomic and functional hierarchies, and visualize microbiome data. We will follow the gitbook “Orchestrating Microbiome Analysis with Bioconductor” and Bioconductor SummarizedExperiment framework, supporting optimized analysis and integration of hierarchical, multi-domain microbiome data. Participants are encouraged to install the latest versions of R and Bioconductor.

Presenters:

Leo Lahti, professor in data science; Bioconductor Community Advisory Board. University of Turku, Finland.

Pande Erawijantari, PhD, Postdoctoral  researcher of Turku Collegium for Science, Medicine, and Technology, University of Turku, Finland.

Tuomas Borman, Bioconductor developer. University of Turku, Finland

Stefanie Peschel, Bioconductor developer. University of Munich.

Giulio Benedetti, Bioconductor developer. University of Jyväskylä, Finland

Renuka Potbhare, Bioconductor developer, Savitribai Phule Pune University, India

Schedule:

13:00-14:30 Tutorial

  • Bioconductor resources for microbiome data science (Leo Lahti)
  • Analysis of community diversity and composition (Pande Erawijantari)
  • Microbiome data integration (Tuomas Borman)

14:30-15:00 Coffee Break

15:00-16:00 Tutorial

  • Microbial network analysis (Stefanie Peschel)
  • Interactive microbiome data exploration with iSEEtree (Giulio Benedetti)

16:00-16:30 Recap and Q&A

Target audience: Intermediate level. The tutorial is primarily intended for bioinformaticians as well as experimentalists working with microbiome data; basic knowledge of data analysis and microbiome research is expected.

Monday, 16 September - afternoon session
Exploring and analysing the protein universe with 3D-Beacons and AlphaFold DB

Read description

Emerging structural predictions, including over 200 million protein sequences via the AlphaFold Database, are transforming our biological insights. The 3D-Beacons network enhances this revolution by providing straightforward access to a wide array of protein structures, both experimentally determined and computationally predicted.

Foldseek, a tool for searching structurally similar proteins, facilitates the exploration of the protein universe by enabling the rapid comparison of large sets of structures. During the workshop, we will employ Foldseek to navigate structures within the AlphaFold and PDBe databases. Participants will gain hands-on experience with the AlphaFold Database and integrate PDBe-KB for an enriched, knowledge-based analysis. Through hands-on exercises, we will demonstrate how to derive biological insights and utilise these databases for research, focusing on protein variants and ligand interactions. Our goal is to enhance attendees’ ability to conduct structural biology data analysis, improving their understanding of computational models’ reliability and applications.

By delving into both experimentally determined and predicted macromolecular structure models via the 3D-Beacons network, this session offers a comprehensive exploration of the AlphaFold database and PDBe, supplemented by Foldseek. Attendees will leave with a broader toolkit for structural biology, ready to apply computational predictions and knowledge-based insights to their research, thereby deepening their understanding of the protein universe.

Schedule:

13:00-13:10  Introduction: Enabling discoveries using macromolecular structure data – Paulyna Magaña

  • An introduction to leveraging macromolecular structure data in scientific discovery. Accessing predicted and experimentally determined structures via 3D-Beacons.

13:10-13:30  Exploring developments and applications of the AlphaFold Database – Paulyna Magaña

  • Focus on recent developments and implementations in the database.

13:30-14:00  Hands-on on AFDB – Paulyna Magaña

14:00-14:15  Exploring the protein landscape with Foldseek – Paulyna Magaña

  • Discover how to efficiently navigate the enriched AlphaFold Database, now featuring more data and integrated Foldseek capabilities.

14:15-14:45  Hands-on Foldseek – Paulyna Magaña

14:45-15:00  Break

15:00-15:10  Biological Insights with PDBe, PDBe-KB – Joseph Ellaway

15:10-15:40  Hands -on PDBe-KB – Joseph Ellaway

15:40-16:00  Concluding insights and final Q&A session – Joseph Ellaway and Paulyna Magaña

  • A final opportunity for participants to ask questions and clarify doubts

Target audience: This tutorial is aimed at bioinformaticians or participants interested in accessing and understanding structural biology data. Attendees should have a background in biology or computational biology and some familiarity with structural biology data. Hands-on elements will include scripting around REST APIs, so basic knowledge in Python or other scripting languages is essential. Participants must bring a laptop and a Google Account to participate in the practical sessions.

Monday, 16 September - afternoon session
Practical and pragmatic FAIRification targeting ELIXIR Communities

Read description

This workshop will equip participants with a systematic, practical and easy to adopt approach to implementing the FAIR principles in research projects at any scale. The approach has been validated through a wide range of real-world applications across disciplines, together with industry partners. While the workshop will showcase worked examples from ELIXIR’s metabolomics community, the learnings will easily translate to other data types and methodologies. The purpose of the FAIR principles and overall aims of this workshop is to enable practical, effective and scalable data sharing practices.

Abstract: 

The FAIR principles were established in 2016 and have seen widespread adoption across the life sciences, being seen as a set of guidelines to promote good data management and stewardship. However, despite wide adoption of the principles themselves, practical details on how to implement the principles to improve FAIR levels are often too generic and lack the level of domain-specificity that facilitates actual execution of practical steps towards achieving real value-added FAIRified data. We have developed a practical and systematic framework for data FAIRification, which is being further refined through the ELIXIR Interoperability Platform programme. Our approach consists of applying a set of generalised processes that can be adopted, in a domain-agnostic manner, by data-generating projects, as well as the application of a FAIR maturity model, establishing the current state and desired target state of a dataset. These data states are systematically defined using our process, informed by practical domain-specific requirements. In this workshop, and aligned with the ELIXIR Interoperability programme of work, we will target ELIXIR communities to improve the FAIRness state of their research outputs. Participants will learn how to assess their own level of FAIRness against the FAIR principles and how to practically and easily apply FAIRplus methodology and tools to improve their research.

In this workshop, we will present a retrospective analysis of FAIRification in an AMR (Anti-microbial resistance) context, as well as one in the metabolomics’ domain where participants can follow along in the practical work. It should be noted that the generic approach used is widely applicable across domains, and participants from other non-ELIXIR and non-Life Sciences communities are equally welcome.

Schedule:

13:00-13:20 Part 1: General Introductions & Roundtable (20 mins)

13:20-13:35 Use case presentation (Tooba)

13:35-13:50 Introductions to FAIRplus FAIRification process and tools, DSM introduction (10 mins)

13:50-14:00 Group formation for practical

14:00-14:30 Practical work (parallel groups) – establish FAIRification goals

14:30-15:00 Coffee

15:00-15:30 Practical work (parallel groups) – FAIR maturity assessments

15:30-16:00 Practical work (parallel groups) – Design FAIR implementations

16:00- Soft close (closing remarks/summary)

 

Target Audience: While this workshop is nominally designed for ELIXIR communities, it is open to all ECCB2024 participants. It is equally relevant to bioinformatics experts, researchers, and data management professionals with an interest in fostering effective and scalable data sharing practices.

Tuesday, 17 September - morning session
Tidyomics: a modern omic analysis ecosystem

Read description

The exponential growth of omic data presents data manipulation, analysis, and integration challenges. Addressing these challenges, Bioconductor offers an extensive data analysis platform and community, while R tidy programming provides a standard data organisation and manipulation that has revolutionised data science. Bioconductor and tidy R have mostly remained independent; bridging these two ecosystems would streamline omic analysis, ease learning and encourage cross-disciplinary collaborations. In this workshop, we introduce thetidyomicsecosystem—a suite of interoperable software that brings the vast tidy software ecosystem to omic data analysis.

Presenter: Stefano Mangiola

Schedule:

09:00-09:05  Welcome – Organizers

09:05-09:20  Q&A – Use of tidyverse and single-cell/spatial omics

09:20-09:45   Introduction to tidySingleCellExperiment

09:45-10:10  Signature visualisation

10:10-10:30  Hands-on exercises

10:30-11:00  Coffee break and networking

11:00-11:25  Introduction to tidySpatialExperiment

11:25-11:35  Introduction to Genomics and transcriptomics data integration

11:35 -11:50  Hands-on exercises

11:50-12:00  Q&A and Wrap-up

12:00  End of the session and heading for lunch

Target audience: Computational biologists, R users.

Tuesday, 17 September - morning session
Computational approaches for identifying context-specific transcription factors using single-cell multi-omics datasets

Read description

Identifying transcription factors (TFs) that govern lineage specification and functional states is crucial for comprehending biological systems in both health and disease. A diverse array of TF activity inference approaches has emerged, leveraging input data from single-cell RNA sequencing (scRNA-seq), single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq), Multiome, and Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq). These approaches aim to infer the activity of transcription factors within individual cells, integrating information from upstream epigenetic and signaling cascades. This integration enhances our understanding of both cell type-specific and consensus (multi-cell type) TF regulation.

In this tutorial, we will explore these TF activity inference methods using single cell omics datasets (e.g. scRNA-seq, scATAC-seq, Multiome, and CITE-seq) through a combination of hybrid lectures and hands-on training sessions. Emphasis will be placed on elucidating the principles that underlie these methods, understanding their inherent assumptions, and evaluating the associated trade-offs. Participants will actively apply multiple inference methods, gaining practical experience in result interpretation. Additionally, strategies for in silico validation will be discussed to ensure robust analyses. By the end of the tutorial, the audience will possess practical knowledge and essential skills, enabling them to independently conduct TF activity inference on their own datasets. Furthermore, participants will be adept at interpreting results, fostering a comprehensive understanding of TF regulation in diverse cellular contexts.

Presenter: Hatice Osmanbeyoglu

Schedule:

9:00 – Welcome remarks and tutorial overview

9:05 – Basic principles behind TF activity inference methods

  • Overview of the importance of context-specific TF regulation in biological systems.
  • Significance of TF dynamics in health and disease.
  • Single-cell multi-omics and spatial transcriptomics technologies for TF activity inference (scRNA-seq, scATAC-seq, Multiome and CITE-seq)

9:30 – Overview of computational TF inference methods based on single cell omics

10:00 – Coffee Break

10:30 – Hands-on experience in applying tools and interpreting results using multiple TF activity inference methods using public scRNA-seq and spatial transcriptomics

11:00 – Hands-on experience in applying tools and interpreting results using multiple TF activity inference methods using public scATAC-seq and multiome

11:30 – Hands-on experience in applying tools and interpreting results using TF activity inference methods using public CITE-seq

12:00 – Lunch break

Target audience: This tutorial is designed for individuals at the beginner to intermediate level, specifically targeting bioinformaticians or computational biologists with some prior experience in analyzing single-cell RNA sequencing (scRNA-seq), single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq), Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq), and Multiome data, or those familiar with next-generation sequencing (NGS) methods. A foundational understanding of basic statistics is assumed. While participants are expected to be beginners, a minimum level of experience in handling NGS datasets is required. The workshop will be conducted using Python and JupyterLab, necessitating prior proficiency in Python programming and familiarity with command-line tools. To facilitate the learning process, participants will be provided with pre-processed count matrices derived from real datasets. All analyses, including JupyterLab notebooks and tutorial steps, will be available on GitHub for reference. The tutorial will employ publicly accessible data, with examples showcased using datasets that will be made available through repositories such as the Gene Expression Omnibus or similar public platforms. This hands-on workshop aims to equip participants with practical skills and knowledge, enabling them to navigate and analyze complex datasets in the field of single-cell omics.

Tuesday, 17 September - morning session
From zero to federated discovery: beaconize your biomedical data

Read description

Genomic variations significantly affect disease risks and trajectories, holding immense potential for human health. The complexity of human genomes requires large numbers of genomic analyses for efficient use of modern genome technologies. Yet, genomic data’s widespread dispersion in disconnected silos hinders effective data sharing.

The Beacon (v2) is a Global Alliance for Genomics and Health standard for biomedical genomics data discovery services with a focus on biomedical genomics, to drive responsible and broad sharing of biomedical health data. The standard is designed to support networks of beacon instances, to swiftly identify the global availability of records, enhancing responsible data sharing in biomedicine.

However, not all institutions (infrastructures, research, or medical) are equal when implementing a beacon: some may not have the human and computational resources necessary to do so. To facilitate the adoption of the standard, we are offering a tutorial presenting a series of tools for parties interested in deploying a beacon. These tools are centred around the Beacon Reference Implementation (RI), commissioned by ELIXIR to share existing and, so far, siloed datasets.

In this tutorial, our expert panellists will guide the participants through (1) the “Extract, transform, and load” (ETL) process from data to the Beacon Model or directly leveraging OMOP instances, (2) making a beacon accessible to users by adjusting queries and responses, and deploying a user interface, and (3) joining in a network of beacon instances for federated discovery.

Speakers: Gemma Milla, Liina Nagirnaja, Oriol López-Doriga

Data Sponsor: DATOS-CAT

Schedule:

09:00  Welcome and introductions (Milla, Nagirnaja, López-Doriga)

  • Presentation of the speakers and EGA
  • Introduction to DATOS-CAT data
  • Delivery of the documentation for the tutorial

09:05  Introduction to Beacon v2 (Nagirnaja)

  • A brief introduction to GA4GH standards
  • Exploring the standards

09:15  Introduction to Beacon v2 Reference Implementation (Nagirnaja)

  • An ELIXIR driver project
  • Explanation of the composition: tools, API and UI
  • What is BFF?

09:25  Deployment of Beacon v2 Reference Implementation (Nagirnaja)

09:35  Extraction of data (Nagirnaja)

  • Introduction to beacon RI tools v2
  • Reading clinical data and loading to beacon RI tools v2

09:45  Transforming data (Nagirnaja)

  • Transforming CSV files for metadata
  • Transforming VCF files for genomic data

09:55  Loading data to the API (López-Doriga)

  • Explaining the bash commands
  • Executing a semi-automated script to load data

10:05  Beacon v2 API + MongoDB (López-Doriga)

  • Explaining the API and its endpoints
  • Introduction to MongoDB
  • Generating the filtering terms
  • Indexing MongoDB

10:15  Configuring the API (López-Doriga)

  • Editing the API information
  • Linking ids to a dataset and setting its security level
  • Managing authentication and authorization: Keycloak and LifeScience login use cases

10:30  COFFEE BREAK

11:00  Querying the API (López-Doriga)

  • Genomic queries
  • Phenotypic queries
  • Cross queries
  • Join queries

11:10  Beacon UI (Milla)

  • Introduction to the UI
  • Authenticating to the UI
  • Describing the functionalities

11:20  Querying with the Beacon UI (Milla)

  • Basic queries
  • Advanced queries
  • Phenotypic queries
  • Genotypic queries
  • Visualization and explanation of the results obtained

11:35  Beacon Network (Milla)

  • Introduction to Beacon Network concept
  • Showing Elixir Beacon Network UI

11:40  Steps to enter a Beacon Network (Milla)

  • Validation of a beacon: executing beacon verifier
  • Governance model: creating specific rules that apply to beacons in a BN to solve divergency problems

11:45  Closing the tutorial (Milla, Nagirnaja, López-Doriga)

  • Questions
  • Facilitating documentation and contact information
  • Appreciation and credits

Target audience: This workshop / tutorial is aimed at bioinformaticians, developers, and system administrators in biomedical research or clinical facilities, who are interested in making their data discoverable. Due to the 3h limitation, it will consist of a stepwise tutorial rather than a hands-on workshop, so there are no prerequisites to attending this workshop. However, participants who have a good grasp on the data they want to make discoverable will benefit more from this workshop.

Tuesday, 17 September - morning session
Learn the essentials of research data management and microbial genome submissions

Read description

Research Data Management is the process of managing research data. It is important for quality, integrity, and reproducibility of research results. Microbial genome submission is the process of submitting genome data and metadata to public databases, such as GenBank/EMBL/DDJB. It is often required by funders and journals.

ELIXIR-DE is the German node of ELIXIR, the European infrastructure for life science data. ELIXIR-DE provides various services and resources for researchers, such as data storage, analysis tools, training materials, and standards. ELIXIR-DE also supports researchers in creating and implementing Data Management Plans (DMPs), which describe how data will be managed during and after a research project.

ELIXIR-DE has developed this training course to teach you the basics of RDM and DMPs, and help you submit your microbial genome data to public databases. The course is designed for researchers and students in the life sciences who want to improve their data management skills and practices.

The course consists of following modules:

• RDM and DMPs: Learn the basics, importance, and standards of RDM and DMPs, such as the FAIR data principles.

• DMPs: Explore and create your own DMP using the DMP tools.

• Microbial genome submission: Learn how to prepare and submit your genome data and metadata to public databases.

By the end of the course, you will have gained an understanding of RDM and DMPs, and you will know how to obtain an accession number for your genomic data.

Join the course and learn the essentials of RDM and microbial genome submissions!

Target audience: The course is designed for researchers and students in the life sciences who want to improve their data management skills and practices. 

Tuesday, 17 September - afternoon session
Machine Learning Meets Omics: Explainable AI Strategies for Omics Data Analysis

Read description

Due to the increasing amount of publicly available omics datasets we nowadays have the unique opportunity to integrate omics datasets from multiple sources into large-scale omics datasets. The complexity and size of those datasets typically requires an analysis with machine learning (ML) algorithms. The increasing interest in using ML for omics data analysis comes along with the need for education on how to train and interpret the results of such ML models. Especially, since complex supervised ML models are often considered to be “Black Boxes”, offering limited insight into how their outputs are derived.

In this tutorial, we will use the Google Collaboratory environment and Python programming language to showcase how to train, optimize and evaluate a ML classifier on large-scale omics datasets and how to subsequently apply state-of-the-art XAI methods to dissect the model’s output. We will illustrate ML best practices and coding principles in a hands-on session on a transcriptomics dataset using sklearn and other relevant Python package. We will further dive into important aspects of model training like hyperparameter optimization or validation strategies for multi-source datasets. Additionally, we will discuss common pitfalls when applying XAI methods, ensuring that the participants not only gain technical proficiency but also a critical perspective on the interpretability of ML models in scientific research. We will conclude the tutorial with a practical session where the participants will get the opportunity to apply their learnings to the transcriptomics dataset or their own dataset.

For the practical sessions we will use a publicly available datasets that comes along with a reference publication: Warnat-Herresthal, S. et al. (2020). https://doi.org/10.1016/j.isci.2019.100780

Presenters:

  • Lisa Barros de Andrade e Sousa
  • Elisabeth Georgii
  • Till Richter
  • Fatemeh Hashemi

Schedule:

13:00 – 13:30: ML Best-Practices for Omics Data

  • Introduction to the Transcriptomics Dataset
  • Model Training with Hyperparameter Optimization
  • Model Evaluation for Multi-Source Datasets

13:30 – 14:30: Interpretation with Model-Agnostic XAI Methods

  • Introduction to eXplainable AI (XAI)
  • Permutation Feature Importance
  • SHAP (SHapley Additive exPlanations)
  • Optional: LIME (Local Interpretable Model-Agnostic Explanations)

14:30 – 15:00: Coffee Break

15:00 – 15:30: Interpretation with Model-Specific XAI Methods

  • Forest Guided Clustering

15:30 – 16:15: XAI Best-Practices for Omics Data

  • Hands-on Session: Application of Different XAI Methods on the Transcriptomics Dataset or Your Own Dataset
  • Final Discussion: Pro and Cons of XAI Methods

 

Note for attendees: Basic knowledge of Python and access to google collab is required to actively participate in the hands-on sessions. For participants without access to google collab, we will provide a requirements file at the beginning of the tutorial for installation of required packages.

Target audience: Individuals familiar with preprocessing omics data and are eager to improve their skills with machine learning tools and explainable AI methods. Basic knowledge of Python is required to actively participate in the hands-on sessions.

Tuesday, 17 September - afternoon session
The single cell ATAC-seq data analysis using ArchR and Avocato

Read description

The single-cell ATAC-seq method provides a detailed profile of the chromatin accessibility landscape at a single-cell resolution across many cell types. It can analyse cellular and regulatory diversity and map enhancers, promoters and other regulatory elements within different cell populations. The ability to profile the epigenomic landscape of many cell types makes the scATAC-seq method a great candidate to interpret the results of genome-wide association studies(GWAS). GWAS is commonly-used-approach to understand which genetic variants cause which diseases. Since most GWAS findings are located under the non-coding regions in the human genome, scATAC-seq can expand our ability to interpret these funding functionally.

This tutorial consists of two parts. The first part will focus on understanding the scATAC-seq data and how to analyse them using ArchR software. ArchR is a commonly used R package for comprehensive analysis of scATAC-seq data. We will discuss filtering the data based on quality control metrics, dimensionality reduction, forming clustering and visualization for the scATAC-seq PBMC dataset. In the second part, we will introduce AVOCATO, a snakemake workflow for understanding the relationship between genetic variants and their affected diseases. AVOCATO provides automated identification of disease-relevant cell clusters and fine-mapping of genetics using scATAC-seq data. It also provides a user-friendly dynamic interface for users to interact with scATAC-seq data and all analysis results.

This tutorial will provide participants proficiency in leveraging ArchR and Avocato tools to analyze scATAC-seq data, enabling them to identify cell populations, assess regulatory landscapes, and uncover potential disease-relevant cell clusters.

Presenters: Dr E. Ravza Gur and Simone G. Riva

Schedule:

13:00-14:30 Tutorial:

  1. Introduction to single cell epigenomics and fundamental single cell ATAC-seq data analysis using ArchR

1.1 Single cell ATAC-seq assay and why do we need single cell approach?

1.2 How does scATAC-seq data analysis look like?

1.3 scATAC-seq data analysis workflow from fastq to clustering

1.4 scATAC-seq data analysis using ArchR

  • Filtering cells based on quality control parameters.
  • Implementing an iterative LSI dimensionality reduction
  • Forming clusters
  • Visualization (UMAP) of single cells
  • Assigning cell type identity to clusters

14:30-15:00  Coffee Break

15:00-16:00 Tutorial:

2. Advanced single cell ATAC-seq data analysis and visualization using Avocato

2.1 Cell type prioritization

2.2 Variant prioritization

2.3 Highly dynamic and interactive visualization

16:00-16:30 Recap and Q&A

 

Target audience: Intermediate level. The target audience comprises individuals with backgrounds in computational biology, bioinformatics and data scientists. This workshop is designed for individuals interested in delving into the field of single-cell epigenomics and high-dimensional data visualization. Prerequisites include a basic understanding of molecular biology concepts, familiarity with R programming language, and an eagerness to explore the intricacies of epigenetic data analysis and visualization techniques. Additionally, familiarity with Snakemake and bash coding will be beneficial for streamlining computational workflows (optional).

Tuesday, 17 September - afternoon session
Advancing synthetic data generation and dissemination for Life Sciences

Read description

This workshop aims to spotlight state-of-the-art synthetic data production and applications for Life Sciences while addressing common challenges and opportunities. With a focus on collaboration and research, the workshop will showcase efforts and initiatives in this domain driven by (1) use cases developed within ELIXIR with hands-on demos by the workshop organisers, and (2) use cases presented by selected workshop participants or on-the-fly flash-talks with participants sharing their firsthand experiences with synthetic data.

To engage participants in discussions around key challenges in synthetic data generation, a World Café session will be organised too. At the World Café, each of the up to four discussion questions will have its own corner equipped with whiteboards for idea-sharing, with participants rotating between them and table moderators summarising the discussions at the end. Example discussion questions include accurate data variability representation, benchmarking, privacy and utility balance, temporal dependencies, susceptibility to adversarial attacks, user needs and dissemination. Results of the World Café will be reported by workshop moderators and later compiled into a comprehensive report, serving as a valuable resource for the ECCB community. The report will provide shared insights, recommendations, and avenues for future exploration in synthetic data generation.

The workshop aims to advance computational biology capabilities in synthetic data generation, dissemination, and research outcomes by facilitating knowledge exchange, tool demonstrations, and collaborative problem-solving among researchers. This initiative seeks to foster synergies, enhance research outcomes, and propel the Life Sciences community towards the forefront of synthetic data advancements.

Target audience: Researchers, practitioners, and professionals not restricted to the ELIXIR network who are involved in or interested in synthetic data generation and its applications, such as individuals working on machine learning, data science, bioinformatics, and related fields. Learning Objectives: (1) Identify the main assets of synthetic data in life sciences and pinpoint some of their drawbacks. (2) Gain practical insights for utilising synthetic data in attendees’ research. (3) Define a roadmap for synthetic data generation in Life Sciences.

Tuesday, 17 September - afternoon session
Navigating the ELIXIR Genomic Data Highway: Applying population genomics to health research

Read description

ECCB ELIXIR Workshop

Embark on a journey through the future of European genomic datasets and unified discovery and access. Within the 1+ Million Genome (1+MG) initiative, the Genome of Europe (GoE) population genomics project aims to capture the genetic diversity of our continent with data from at least 500,000 citizens, which is expected to start this year. It will be the first major federated cohort of data available across Europe based on the 1+MG Infrastructure. It will build on the last 10 years’ effort to advance common standards and open-source solutions through the ELIXIR research infrastructure for life sciences, funded by Member States and the EC. ELIXIR is coordinating the European Genomic Data Infrastructure (GDI) project to operationalise the infrastructure across 24 countries, aligning 1+MG and ELIXIR efforts.

The workshop starts with an overview of the architecture and the use cases, amounts, and types of data that will be available. Through a hands-on tutorial, we’ll dive into practical applications, including what information will be available, how to access it, and how to apply it to e.g. polygenic risk scores and cancer. As the infrastructure is still being developed and deployed, the workshop will give you the opportunity to influence the functionalities that will be made available, making sure the infrastructure resonates with your research needs. Get a sneak peek and provide important feedback on this major project in ways that will maximise the utility of the European data infrastructure for computational health research.

Coordinators: Melissa Konopko (ELIXIR Hub), Salvador Capella-Gutierrez (Barcelona Supercomputing Center), Juan Arenas (ELIXIR Hub)

Workshop Agenda: Future of European Genomic Datesets

12:00 – 12:15: Welcome and Introduction

  • Introduction to the Workshop and Objectives
  • Overview of the 1+ Million Genome (1+MG) Initiative

12:15 – 12:45: Overview of the Genome of Europe (GoE) Project

  • Goals and Scope of the GoE Project
  • The Importance of Genetic Diversity
  • Expected Data Contributions from 500,000 Citizens

12:45 – 13:15: Architecture and Infrastructure

  • Detailed Overview of the 1+MG Infrastructure
  • Federated Cohorts and Data Availability Across Europe

13:15 – 13:45: Use Cases and Data Types

  • Types and Amounts of Data to be Available
  • Practical Applications in Research
    • Polygenic Risk Scores
    • Cancer Research

13:45 – 14:15: Hands-On Tutorial: Accessing and Using the Data

  • Step-by-Step Guide to Accessing Genomic Data
  • Exploring Data Sets and Tools
  • Practical Applications and Examples

14:15 – 14:30: Q&A Session

  • Open Floor for Questions and Clarifications
  • Participant Input and Discussion

14:30 – 15:00: Coffee Break

15:00 – 15:30: Influence Future Functionalities

  • Discussion on Needed Functionalities and Features
  • Collecting Feedback from Participants
  • Aligning Infrastructure with Research Needs

15:30 – 16:00: Feedback and Next Steps

  • Recap of Key Points and Learnings
  • Feedback Collection
  • Overview of Future Developments and How to Stay Involved
  • Closing Remarks

16:00: Workshop Ends