The training dataset can be found on the download page.

General Files

Inside both the training and test data you will find AuxillaryTable.csv and SpectralData.hdf5. The Ground Truth Package is available in the training data for supervised/semi-supervised training. Below contains detailed description of each file:


Contains auxillary data about the planetary system. There are 9 features in total: Star Distance, Stellar Mass, Stellar Radius, Stellar Temperature, Planet Mass, Orbital Period, Semi-Major Axis, Planet Radius, Surface Gravity. These information are unique to each planet and are sourced various exoplanet database. There are 91,392 examples in total.

To load the dataset:

import pandas as pd
AuxTable = pd.read_csv('AuxillaryTable.csv', index='planet_ID')           ### load file
AuxTable.head(1)                                                          ### preview the table


Contains the spectroscopic information about the observation, includes wavelength grid, spectrum, uncertainty and bin width

Planet ID
   | --- instrument_wlgrid
   | --- instrument_spectrum
   | --- instrument_noise
   | --- instrument_width

To load the dataset:

import h5py
SpectralData = h5py.File('SpectralData.hdf5')         ### load file
planetlist = [p for p in SpectralData.keys()]
## access wlgrid, spectrum, noise and wlwidth of a single planet instance
wlgrid = SpectralData[planetlist[0]]['instrument_wlgrid'][:]
spectrum = SpectralData[planetlist[0]]['instrument_spectrum'][:]
noise = SpectralData[planetlist[0]]['instrument_noise'][:]
wlwidth = SpectralData[planetlist[0]]['instrument_width'][:]

Below is a small function to convert the .hdf5 format into a matrix:

def to_matrix(SpectralData):
    # id is in ascending order"
    num = len(SpectralData.keys())
    id_order = np.arange(num)
    # we knew the instrument resolution beforehand
    observed_spectrum = np.zeros((num,52,4))
    for idx, x in enumerate(id_order):
        current_id = f'Planet_{x}'
        wlgrid = SpectralData[current_id]['instrument_wlgrid'][:]
        spectrum = SpectralData[current_id]['instrument_spectrum'][:]
        noise = SpectralData[current_id]['instrument_noise'][:]
        wlwidth = SpectralData[current_id]['instrument_width'][:]
        observed_spectrum[idx,:,:] = np.concatenate([wlgrid[...,np.newaxis],spectrum[...,np.newaxis],noise[...,np.newaxis],wlwidth[...,np.newaxis]],axis=-1)
    return observed_spectrum

Ground Truth Packages

Contains Ground Truth for both tracks. Tracedata.hdf5 is designed for Regular Track and QuartilesTable.csv is intended for Light Track. However, only 24% of the input are complemented with the ground truth, the rest of the input are unlabelled. We have also provided FM_Parameter_Table.csv for semi-supervised learning.


The posterior distributions were generated using MultiNest, the hdf5 file contains two arrays per planet: Tracedata and Weights. The tracedata are the likelihood evaluated points, L, by MultiNest. Weights, w, are the importance weights, see equation 10 here. To calculate the posterior distribution, p:

Tracedata.hdf5 is nested by planet ID with Tracedata and Weights being standard numpy arrays:

Planet ID
   | --- Tracedata
   | --- Weights

The length of each tracedata/weights array will be variable depending on the convergence speed of the MultiNest algorithm but will contain a minimum of 1500 data points.

In order to read in the data (in python) you can use the below script:

import h5py
infile     = h5py.File('Tracedata.hdf5')           ### loading in tracedata.hdf5
planetlist = [p for p in infile.keys()]            ### getting list of planets in file
trace      = infile[planetlist[0]]['tracedata'][:] ### accessing Nested Sampling trace data
weights    = infile[planetlist[0]]['weights'][:]   ### accessing Nested Sampling weight data


contains the 16th (q1), 50th (q2) and 84th (q3) percentile of each atmospheric target for one-quarter of the instances.

Planet ID Parameter_q1 Parameter_q2 : : Paramter_q3
1 Value Value Value Value
2 Value Value Value Value
Value Value Value Value

Paramters: T, log_H2O, log_CO2, log_CH4, log_CO, log_NH3 (i.e. T_q1, T_q2, T_q3, etc)


contains the input value used to generate the high-resolution spectra (Forward Model) before it is binned to Ariel resolution. It is not the ground truth, even though in some cases it is close to the ground truth. Available for all instances.

Planet ID Parameter1 Parameter2 : : Paramter3
1 Value Value Value Value
2 Value Value Value Value
Value Value Value Value

Unfortunately, the ground truth is not available for the test data.

Upload Format

To upload your surrogate distributions, please folow the same data structure as for the tracedata.hdf5 above. At the end of the competition. The top 10 participants (as on the leaderboard) will be asked to run their codes on an additional set of planets.

We have also written a helper function to convert your output into the correct format.

You can upload your predictions here