Evolution of the GAIA Data Mining Platform To Near-Petabyte Scale

P10
12 Nov 2025, 11:00
15m
Synagoge

Synagoge

Görlitz
oral presentation Science platforms in the big data era Plenary Session 10

Speaker

Malcolm Illingworth (University Of Edinburgh)

Description

The GAIA Datamining Platform provides interactive, JupyterHub-based access to the GAIA Data Release 3 dataset, which comprises 7TB of data.

The GAIA Data Release 4 dataset is expected to be in excess of 600TB.
We describe our progress in evolving the GAIA Data Mining Platform to a modern, kubernetes-based, platform-independent deployment, named Astroflow, adding dask functionality to existing large scale Spark analytical processing.

In conjunction with the closely related SPACIOUS project, we report our findings and successes in deploying the platform to both on-premise (OpenStack) and commercial (Google) cloud platforms.

We outline our plans to incorporate Apache Iceberg into our architecture to efficiently scale up to and support the forthcoming GAIA DR4 release, and to use the data lake model to support and combine future multiple data sources for large scale analytical processing and data mining in our interactive environment

Affiliation of the submitter Institute For Astronomy, University Of Edinburgh
Attendance in-person

Primary authors

Brendan O'Brien (University Of Edinburgh) Enrique Molina (ESA) Malcolm Illingworth (University Of Edinburgh) Nigel Hambly (University Of Edinburgh) Simon Harnqvist (University Of Edinburgh)

Presentation materials