🎯 About Me
Hi and welcome to my website! My name is Alba, and I’m a Sandbox Data Scientist at the Health Data Science Center at the University of Copenhagen. With a strong background in large-scale genomics and pipeline development, I specialize in building and maintaining computational environments on HPC platforms.
My work focuses on research, education, and the development of computational tools for bioinformatics and data science. I aim to help researchers manage research data more effectively and promote FAIR (Findable, Accessible, Interoperable, and Reusable) principles.
What I Do
I am part of a team in charge of online training modules and offer computational services for researchers. My focus is on equipping researchers with the skills and tools needed for effective and FAIRly data management and data analysis. This includes offering lectures, workshops, and computational support on essential tools for data analysis and reproducibility, including:
- 🌀 Git & GitHub (for version control and collaboration)
- ⚙️ Snakemake & Nextflow (for workflow automation)
- 🐳 Docker & Conda (for containerization and environment management)
- 🛠️ Cookiecutter (for reproducible project templates)
- 💡 Shiny Apps (for interactive data visualization)
My Approach
I believe that practical, hands-on learning is essential. That’s why I design interactive training material, workshops and real-world projects. All training materials, including web-based exercises, are openly available on the project’s website.
📦 Sandbox Project: Training modules & Apps
As part of the national Sandbox Project, we develop a range of containerized apps and training modules to support researchers in bioinformatics and research data management (RDM). These resources combine notebooks, coding exercises, and interactive learning materials, hosted on a GitHub Pages website built with Quarto. Our materials are fully version-controlled, ensuring consistency, transparency, and easy access to updates. Below are some of the key modules I contribute to:
Two of our training modules are designed to equip researchers with the essential tools and concepts needed to develop effective workflows and manage software in High-Performance Computing (HPC) and Research Data Management (RDM).
Objective: Equip researchers in the omics field with the best RDM practices, helping them better organize, integrate, and visualize their data.
Key Concepts Covered:
- Research Data Management (RDM) concepts
- Best practices in data organization and integration
- Tools to support effective data management
💡 Why it matters? Effective RDM saves time and improves reproducibility for researchers working with large, complex datasets.
Objective: Introduce researchers to pipelines and workflows for bioinformatics, especially those on High-Performance Computing (HPC) systems.
Key Concepts Covered:
- Workflow management using Snakemake and Nextflow
- Using community-driven pipelines like nf-core
- Building efficient and scalable workflows to reduce data management time
⚙️ My Role: Main contributor to the HPC Pipes module. This module demonstrates how to structure workflows, set resources, and optimize performance, helping researchers save time and effort.
Containerized Apps
We also create portable, reproducible, and system-agnostic apps using Docker, enabling seamless deployment across platforms. While primarily deployed on Danish HPC systems, all content is openly available on GitHub, allowing others to deploy them on any compatible system.
- Genomics App: Tools and tutorials for genomic data analysis.
- Transcriptomics App: Tools for working with transcriptomics data.
By containerizing our tools with Docker, we ensure consistent performance across platforms, making it easier for researchers to replicate results.