Skip to main content

Workshops and tutorials

Monday, 16 September - morning session
TBC
Monday, 16 September - morning session
Exploring nf-core: a best-practice framework for Nextflow pipelines

Read description

Long-term use and uptake of workflows by the life-science community requires that the workflows, and their tools, are findable, accessible, interoperable, and reusable (FAIR). To address these challenges, workflow management systems such as Nextflow have become an indispensable part of a computational biologist’s toolbox. The nf-core project is a community effort to facilitate the creation and sharing of high-quality workflows and workflow components written in Nextflow. Workflows hosted on the nf-core framework must adhere to a set of strict best-practice guidelines that ensure reproducibility, portability, and scalability. Currently, there are more than 57 released pipelines and 27 pipelines under development as part of the nf-core community, with more than 400 code contributors.

In this tutorial, we will introduce the nf-core project and how to become part of the nf-core community. We will showcase how the nf-core tools help create Nextflow pipelines starting from a template. We will highlight the best-practice components in Nextflow pipelines to ensure a reproducible and portable workflow, such as CI-testing, modularization, code linting, and containerization. We will introduce the helper functions for interacting with over 1000 ready-to-use modules and subworkflows. In the final practical session, we will build a pipeline using the tooling and components we have introduced.

This workshop will be taught by nf-core project core administrators and nextflow ambassadors who are experienced Nextflow developers and contributors to the nf-core community.

Monday, 16 September - morning session
Analysis of highly multiplexed fluorescence images

Read description

Recent advances in spatial proteomics enable highly-multiplexed profiling of tissues at single-cell resolution and unprecedented scale. However, specialised workflows are necessary to extract meaningful insights in a robust and automated manner. In this workshop, we will introduce participants to the analysis of highly-multiplexed immunohistochemistry data, covering the end-to-end workflow needed to transform multichannel images into single-cell data and perform subsequent spatial analyses.

Reproducible analysis of multiplexed imaging data poses substantial computational challenges. Tissue microarrays, for example, must be subdivided into constituent cores before they can be subjected to cell segmentation, the process of dividing the image into individual cells. Subsequently, staining intensities in each channel and features such as cell morphology are extracted to generate a single-cell expression matrix. This table may serve as an input to machine learning algorithms to phenotype individual cells. Finally, we highlight the importance of studying tissue composition and cell-to-cell contacts as a means to uncover underlying disease mechanisms.

The goal of this workshop is to familiarise researchers with current approaches to analyse highly-multiplexed images. Beyond discussing each stage of the process of transforming multichannel images into single-cell data, we demonstrate the use of a probabilistic model which incorporates prior knowledge about cell types and their associated markers. Lastly, we show how to chain the analysis steps together in the form of a modular pipeline that is robust and scalable.

Monday, 16 September - morning session
Improving FAIRability and reproducibility for research software with machine-actionable Software Management Plans

Read description

This tutorial focuses on best practices for research software (RS) leveraged by Software Management Plans (SMPs) and facilitated by tools implementing machine-actionable SMP (maSMP) metadata. SMPs comprise questions to help researchers oversee the software development lifecycle and follow a minimum of best practices (e.g., license, releases, citation). maSMPs complement SMPs by mapping questions and software to structured metadata. This tutorial will use the SMP created by the Software Best Practices ELIXIR group, the maSMPs based on schema.org and developed by ZB MED / NFDI4DataScience, the Software Management Wizard prepared in collaboration with ELIXIR-CZ, and the software metadata extraction tool supported by OpenEBench. Other SMP platforms and software extraction tools will also be considered (e.g.,Research Data Management Organiser RDMO, Software Metadata Extraction Framework SOMEF). With a mix of talks and hands-on, we will show how RS can benefit and get better from an SMP-based metadata enrichment cycle. This tutorial is suitable for researchers who write code and want to learn more about RS best practices, metadata and FAIRness. We expect participants to have some knowledge on GitHub and JSON. Learning outcomes include: practical understanding of FAIR principles for research software, creation of SMPs and usage of maSMPs for own GitHub repos. The overall goal is providing tools for researchers to make better software. Although this tutorial fits to any of the six ECCB conference themes (Genomes, Proteins, Systems biology, Single-cell omics, Microbiome, Digital Health), we will likely showcase the metadata enrichment cycle with some microbiome related software.

Monday, 16 September - afternoon session
Orchestrating microbiome data science with Bioconductor

Read description

Microbiome analysis has become an integral element of omics research. A vast array of computational techniques have been proposed to deal with special properties of microbiome data in the recent years. Integrating heterogeneous multi-assay datasets, choosing optimal analysis strategies and constructing coherent workflows can benefit from a systematic data science framework.

The tutorial provides guidance for bioinformaticians as well as experimentalists working with microbiome data. We will walk through a typical microbiome data science workflow using demonstration data from open repositories and show how the latest advances in the Bioconductor ecosystem enhance interoperability between methods and data from various omics. We will demonstrate key steps and methodological considerations of a typical microbiome data science workflow. In particular, participants will learn how to use the R/Bioconductor ecosystem to:

– access open microbiome data from e.g. EBI/MGnify and curatedMetagenomicData

– do basic microbiome data wrangling (subsetting, aggregation, transformations)

– identify differences in community diversity and composition

– integrate taxonomic and functional profiles

– incorporate phylogenetic trees and other hierarchical structures

– visualize microbiome data

This tutorial aims to provide a solid foundation in using the latest Bioconductor tools and data structures supporting microbiome analysis. The session will follow the online book “Orchestrating Microbiome Analysis with Bioconductor”, a joint effort from many contributors from the Bioconductor community.

Monday, 16 September - afternoon session
Global biodata resources: challenges to long-term sustainability of a crucial data infrastructure

Read description

Life science data resources are numerous, distributed and interconnected, forming a singular—and arguably the largest—infrastructure for biological research globally. These resources are critical for guaranteeing reproducibility and integrity for life sciences research. Despite their importance, individual biodata resources—and hence the infrastructure as a whole—are primarily supported using short-term funding mechanisms. Unlike other scientific infrastructures there is no overall coordination, and new biodata resources emerge in organic fashion as the scientific community responds to its data needs, further increasing competition for funding. Sustainably funding this disseminated infrastructure is a key challenge: the Global Biodata Coalition (GBC) is working with the funders who support many of these resources to ensure long-term funding for existing infrastructure, while also channelling support to underpin future growth in data volumes and new technologies.

In this session GBC will present an overview of its work to characterise the worldwide biodata infrastructure (Global Core Biodata Resources and an inventory of biodata resources) and will invite managers of data resources and aggregators to demonstrate the context of the entire infrastructure and to explore the scope and scale of connections and dependencies with other resources; the funding sources for the resources; and the impacts arising from the funding uncertainty associated with the underlying resources.

The goals of the session will be to raise awareness of the criticality of globally-connected infrastructure, its dependence on distributed resources, and the relative fragility of an infrastructure that is generally taken for granted by its users.

Monday, 16 September - afternoon session
Exploring and analysing the protein universe with 3D-Beacons and AlphaFold DB

Read description

Emerging structural predictions, including over 200 million protein sequences via the AlphaFold Database, are transforming our biological insights. The 3D-Beacons network enhances this revolution by providing straightforward access to a wide array of protein structures, both experimentally determined and computationally predicted.

Foldseek, a tool for searching structurally similar proteins, facilitates the exploration of the protein universe by enabling the rapid comparison of large sets of structures. During the workshop, we will employ Foldseek to navigate structures within the AlphaFold and PDBe databases. Participants will gain hands-on experience with the AlphaFold Database and integrate PDBe-KB for an enriched, knowledge-based analysis. Through hands-on exercises, we will demonstrate how to derive biological insights and utilise these databases for research, focusing on protein variants and ligand interactions. Our goal is to enhance attendees’ ability to conduct structural biology data analysis, improving their understanding of computational models’ reliability and applications.

By delving into both experimentally determined and predicted macromolecular structure models via the 3D-Beacons network, this session offers a comprehensive exploration of the AlphaFold database and PDBe, supplemented by Foldseek. Attendees will leave with a broader toolkit for structural biology, ready to apply computational predictions and knowledge-based insights to their research, thereby deepening their understanding of the protein universe.

Monday, 16 September - afternoon session
Practical and pragmatic FAIRification targeting ELIXIR Communities (ELIXIR)

Read description

We will demonstrate our FAIRification approach for improving the FAIRness of life science data through systematic, practical and easy to adopt processes, tools and recommendations. This approach has been defined and verified through real-world FAIR implementations developed to enable and support FAIR Data stewardship in life science research projects (IMI) and pharmaceutical research environments.

Abstract: including topic and its relevance for the conference participants as well as session goals, motivation, and scope of the workshop.

The FAIR principles were established in 2016 and have seen widespread adoption across the life sciences, being seen as a set of guidelines to promote good data management and stewardship. However, despite wide adoption of the principles themselves, practical details on how to implement the principles to improve FAIR levels are often too generic and lack the level of domain-specificity that facilitates actual execution of practical steps towards achieving real value-added FAIRified data. Within FAIRplus, we developed a practical and systematic framework for data FAIRification, which is being further refined through the ELIXIR Interoperability Platform programme. Our approach consists of applying a set of generalised processes that can be adopted, in a domain-agnostic manner, by data-generating projects, as well as the application of a FAIR maturity model, establishing the current state and desired target state of a dataset. These data states are systematically defined using our process, informed by practical domain-specific requirements. In this workshop, and aligned with the ELIXIR Interoperability programme of work, we will target ELIXIR communities to improve the FAIRness state of their research outputs. Participants will learn how to assess their own level of FAIRness against the FAIR principles and how to practically and easily apply FAIRplus methodology and tools to improve their research.

Tuesday, 17 September - morning session
Tidyomics: a modern omic analysis ecosystem

Read description

The exponential growth of omic data presents data manipulation, analysis, and integration challenges. Addressing these challenges, Bioconductor offers an extensive data analysis platform and community, while R tidy programming provides a standard data organisation and manipulation that has revolutionised data science. Bioconductor and tidy R have mostly remained independent; bridging these two ecosystems would streamline omic analysis, ease learning and encourage cross-disciplinary collaborations. In this workshop, we introduce thetidyomicsecosystem—a suite of interoperable software that brings the vast tidy software ecosystem to omic data analysis.

Tuesday, 17 September - morning session
Computational approaches for identifying context-specific transcription factors using single-cell multi-omics datasets

Read description

Identifying transcription factors (TFs) that govern lineage specification and functional states is crucial for comprehending biological systems in both health and disease. A diverse array of TF activity inference approaches has emerged, leveraging input data from single-cell RNA sequencing (scRNA-seq), single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq), Multiome, and Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq). These approaches aim to infer the activity of transcription factors within individual cells, integrating information from upstream epigenetic and signaling cascades. This integration enhances our understanding of both cell type-specific and consensus (multi-cell type) TF regulation.

In this tutorial, we will explore these TF activity inference methods using single cell omics datasets (e.g. scRNA-seq, scATAC-seq, Multiome, and CITE-seq) through a combination of hybrid lectures and hands-on training sessions. Emphasis will be placed on elucidating the principles that underlie these methods, understanding their inherent assumptions, and evaluating the associated trade-offs. Participants will actively apply multiple inference methods, gaining practical experience in result interpretation. Additionally, strategies for in silico validation will be discussed to ensure robust analyses. By the end of the tutorial, the audience will possess practical knowledge and essential skills, enabling them to independently conduct TF activity inference on their own datasets. Furthermore, participants will be adept at interpreting results, fostering a comprehensive understanding of TF regulation in diverse cellular contexts.

Tuesday, 17 September - morning session
From zero to federated discovery: beaconize your biomedical data

Read description

Genomic variations significantly affect disease risks and trajectories, holding immense potential for human health. The complexity of human genomes requires large numbers of genomic analyses for efficient use of modern genome technologies. Yet, genomic data’s widespread dispersion in disconnected silos hinders effective data sharing.

The Beacon (v2) is a Global Alliance for Genomics and Health standard for biomedical genomics data discovery services with a focus on biomedical genomics, to drive responsible and broad sharing of biomedical health data. The standard is designed to support networks of beacon instances, to swiftly identify the global availability of records, enhancing responsible data sharing in biomedicine.

However, not all institutions (infrastructures, research, or medical) are equal when implementing a beacon: some may not have the human and computational resources necessary to do so. To facilitate the adoption of the standard, we are offering a tutorial presenting a series of tools for parties interested in deploying a beacon. These tools are centred around the Beacon Reference Implementation (RI), commissioned by ELIXIR to share existing and, so far, siloed datasets.

In this tutorial, our expert panellists will guide the participants through (1) the “Extract, transform, and load” (ETL) process from data to the Beacon Model or directly leveraging OMOP instances, (2) making a beacon accessible to users by adjusting queries and responses, and deploying a user interface, and (3) joining in a network of beacon instances for federated discovery.

Tuesday, 17 September - morning session
Learn the essentials of research data management and microbial genome submissions

Read description

Research Data Management is the process of managing research data. It is important for quality, integrity, and reproducibility of research results. Microbial genome submission is the process of submitting genome data and metadata to public databases, such as GenBank/EMBL/DDJB. It is often required by funders and journals.

ELIXIR-DE is the German node of ELIXIR, the European infrastructure for life science data. ELIXIR-DE provides various services and resources for researchers, such as data storage, analysis tools, training materials, and standards. ELIXIR-DE also supports researchers in creating and implementing Data Management Plans (DMPs), which describe how data will be managed during and after a research project.

ELIXIR-DE has developed this training course to teach you the basics of RDM and DMPs, and help you submit your microbial genome data to public databases. The course is designed for researchers and students in the life sciences who want to improve their data management skills and practices.

The course consists of following modules:

• RDM and DMPs: Learn the basics, importance, and standards of RDM and DMPs, such as the FAIR data principles.

• DMPs: Explore and create your own DMP using the DMP tools.

• Microbial genome submission: Learn how to prepare and submit your genome data and metadata to public databases.

By the end of the course, you will have gained an understanding of RDM and DMPs, and you will know how to obtain an accession number for your genomic data.

Join the course and learn the essentials of RDM and microbial genome submissions!

Tuesday, 17 September - afternoon session
Machine Learning Meets Omics: Explainable AI Strategies for Omics Data Analysis

Read description

Due to the increasing amount of publicly available omics datasets we nowadays have the unique opportunity to integrate omics datasets from multiple sources into large-scale omics datasets. The complexity and size of those datasets typically requires an analysis with machine learning (ML) algorithms. The increasing interest in using ML for omics data analysis comes along with the need for education on how to train and interpret the results of such ML models. Especially, since complex supervised ML models are often considered to be “Black Boxes”, offering limited insight into how their outputs are derived.

In this tutorial, we will use the Google Collaboratory environment and Python programming language to showcase how to train, optimize and evaluate a ML classifier on large-scale omics datasets and how to subsequently apply state-of-the-art XAI methods to dissect the models output. We will illustrate ML best practices and coding principles in a hands-on session on a transcriptomics dataset using sklearn and other relevant Python package. We will further dive into important aspects of model training like hyperparameter optimization or validation strategies for multi-source datasets. Additionally, we will discuss common pitfalls when applying XAI methods, ensuring that the participants not only gain technical proficiency but also a critical perspective on the interpretability of ML models in scientific research. We will conclude the tutorial with a practical session where the participants will get the opportunity to apply their learnings to a genomics dataset or their own dataset.

Datasets reference

1. Warnat-Herresthal (2020). https://doi.org/10.1016/j.isci.2019.100780

2. Grešová (2023). https://doi.org/10.1186/s12863-023-01123-8

Tuesday, 17 September - afternoon session
The single cell ATAC-seq data analysis using ArchR and Avocato

Read description

The single-cell ATAC-seq method provides a detailed profile of the chromatin accessibility landscape at a single-cell resolution across many cell types. It can analyse cellular and regulatory diversity and map enhancers, promoters and other regulatory elements within different cell populations. The ability to profile the epigenomic landscape of many cell types makes the scATAC-seq method a great candidate to interpret the results of genome-wide association studies(GWAS). GWAS is commonly-used-approach to understand which genetic variants cause which diseases. Since most GWAS findings are located under the non-coding regions in the human genome, scATAC-seq can expand our ability to interpret these funding functionally.

This tutorial consists of two parts. The first part will focus on understanding the scATAC-seq data and how to analyse them using ArchR software. ArchR is a commonly used R package for comprehensive analysis of scATAC-seq data. We will discuss filtering the data based on quality control metrics, dimensionality reduction, forming clustering and visualization for the scATAC-seq PBMC dataset. In the second part, we will introduce AVOCATO, a snakemake workflow for understanding the relationship between genetic variants and their affected diseases. AVOCATO provides automated identification of disease-relevant cell clusters and fine-mapping of genetics using scATAC-seq data. It also provides a user-friendly dynamic interface for users to interact with scATAC-seq data and all analysis results.

This tutorial will provide participants proficiency in leveraging ArchR and Avocato tools to analyze scATAC-seq data, enabling them to identify cell populations, assess regulatory landscapes, and uncover potential disease-relevant cell clusters.

Level: intermediate.

Tuesday, 17 September - afternoon session
Advancing synthetic data generation and dissemination for Life Sciences

Read description

This workshop aims to spotlight state-of-the-art synthetic data production and applications for Life Sciences while addressing common challenges and opportunities. With a focus on collaboration and research, the workshop will showcase efforts and initiatives in this domain driven by (1) use cases developed within ELIXIR with hands-on demos by the workshop organisers, and (2) use cases presented by selected workshop participants or on-the-fly flash-talks with participants sharing their firsthand experiences with synthetic data.

To engage participants in discussions around key challenges in synthetic data generation, a World Café session will be organised too. At the World Café, each of the up to four discussion questions will have its own corner equipped with whiteboards for idea-sharing, with participants rotating between them and table moderators summarising the discussions at the end. Example discussion questions include accurate data variability representation, benchmarking, privacy and utility balance, temporal dependencies, susceptibility to adversarial attacks, user needs and dissemination. Results of the World Café will be reported by workshop moderators and later compiled into a comprehensive report, serving as a valuable resource for the ECCB community. The report will provide shared insights, recommendations, and avenues for future exploration in synthetic data generation.

The workshop aims to advance computational biology capabilities in synthetic data generation, dissemination, and research outcomes by facilitating knowledge exchange, tool demonstrations, and collaborative problem-solving among researchers. This initiative seeks to foster synergies, enhance research outcomes, and propel the Life Sciences community towards the forefront of synthetic data advancements.

Tuesday, 17 September - afternoon session
Navigating the ELIXIR Genomic Data Highway: Applying population genomics to health research

Read description

ECCB ELIXIR Workshop

Coordinators: Melissa Konopko (ELIXIR Hub), Salvador Capella-Gutierrez (Barcelona Supercomputing Center), Juan Arenas (ELIXIR Hub)

Embark on a journey through the future of European genomic datasets and unified discovery and access. Within the 1+ Million Genome (1+MG) initiative, the Genome of Europe (GoE) population genomics project aims to capture the genetic diversity of our continent with data from at least 500,000 citizens, which is expected to start this year. It will be the first major federated cohort of data available across Europe based on the 1+MG Infrastructure. It will build on the last 10 years’ effort to advance common standards and open-source solutions through the ELIXIR research infrastructure for life sciences, funded by Member States and the EC. ELIXIR is coordinating the European Genomic Data Infrastructure (GDI) project to operationalise the infrastructure across 24 countries, aligning 1+MG and ELIXIR efforts.

The workshop starts with an overview of the architecture and the use cases, amounts, and types of data that will be available. Through a hands-on tutorial, we’ll dive into practical applications, including what information will be available, how to access it, and how to apply it to e.g. polygenic risk scores and cancer. As the infrastructure is still being developed and deployed, the workshop will give you the opportunity to influence the functionalities that will be made available, making sure the infrastructure resonates with your research needs. Get a sneak peek and provide important feedback on this major project in ways that will maximise the utility of the European data infrastructure for computational health research.