The training dataset can be found on the download page.

General Files

Inside both the training and test data you will find AuxillaryTable.csv and SpectralData.hdf5. The Ground Truth Package is available in the training data for supervised/semi-supervised training. Below contains detailed description of each file:

`AuxillaryTable.csv`

Contains auxillary data about the planetary system. There are 9 features in total: Star Distance, Stellar Mass, Stellar Radius, Stellar Temperature, Planet Mass, Orbital Period, Semi-Major Axis, Planet Radius, Surface Gravity. These information are unique to each planet and are sourced various exoplanet database. There are 91,392 examples in total.

To load the dataset:

import pandas as pd
AuxTable = pd.read_csv('AuxillaryTable.csv', index='planet_ID')           ### load file
AuxTable.head(1)                                                          ### preview the table

`SpectralData.hdf5`

Contains the spectroscopic information about the observation, includes wavelength grid, spectrum, uncertainty and bin width

Planet ID
   | --- instrument_wlgrid
   | --- instrument_spectrum
   | --- instrument_noise
   | --- instrument_width

To load the dataset:

import h5py
SpectralData = h5py.File('SpectralData.hdf5')         ### load file
planetlist = [p for p in SpectralData.keys()]
## access wlgrid, spectrum, noise and wlwidth of a single planet instance
wlgrid = SpectralData[planetlist[0]]['instrument_wlgrid'][:]
spectrum = SpectralData[planetlist[0]]['instrument_spectrum'][:]
noise = SpectralData[planetlist[0]]['instrument_noise'][:]
wlwidth = SpectralData[planetlist[0]]['instrument_width'][:]

Below is a small function to convert the .hdf5 format into a matrix:

def to_matrix(SpectralData):
    # id is in ascending order"
    num = len(SpectralData.keys())
    id_order = np.arange(num)
    # we knew the instrument resolution beforehand
    observed_spectrum = np.zeros((num,52,4))
    for idx, x in enumerate(id_order):
        current_id = f'Planet_{x}'
        wlgrid = SpectralData[current_id]['instrument_wlgrid'][:]
        spectrum = SpectralData[current_id]['instrument_spectrum'][:]
        noise = SpectralData[current_id]['instrument_noise'][:]
        wlwidth = SpectralData[current_id]['instrument_width'][:]
        observed_spectrum[idx,:,:] = np.concatenate([wlgrid[...,np.newaxis],spectrum[...,np.newaxis],noise[...,np.newaxis],wlwidth[...,np.newaxis]],axis=-1)
    return observed_spectrum

Ground Truth Packages

Contains Ground Truth for both tracks. Tracedata.hdf5 is designed for Regular Track and QuartilesTable.csv is intended for Light Track. However, only 24% of the input are complemented with the ground truth, the rest of the input are unlabelled. We have also provided FM_Parameter_Table.csv for semi-supervised learning.

`Tracedata.hdf5`

The posterior distributions were generated using MultiNest, the hdf5 file contains two arrays per planet: Tracedata and Weights. The tracedata are the likelihood evaluated points, L, by MultiNest. Weights, w, are the importance weights, see equation 10 here. To calculate the posterior distribution, p:

$p_i = \frac{L_i w_i}{\sum_j L_j w_j}$

Tracedata.hdf5 is nested by planet ID with Tracedata and Weights being standard numpy arrays:

Planet ID
   | --- Tracedata
   | --- Weights

The length of each tracedata/weights array will be variable depending on the convergence speed of the MultiNest algorithm but will contain a minimum of 1500 data points.

In order to read in the data (in python) you can use the below script:

import h5py
infile     = h5py.File('Tracedata.hdf5')           ### loading in tracedata.hdf5
planetlist = [p for p in infile.keys()]            ### getting list of planets in file
trace      = infile[planetlist[0]]['tracedata'][:] ### accessing Nested Sampling trace data
weights    = infile[planetlist[0]]['weights'][:]   ### accessing Nested Sampling weight data

`QuartilesTable.csv`

contains the 16^th (q1), 50^th (q2) and 84^th (q3) percentile of each atmospheric target for one-quarter of the instances.

Planet ID	Parameter_q1	Parameter_q2 :	: Paramter_q3	…
1	Value	Value	Value	Value
2	Value	Value	Value	Value
…	Value	Value	Value	Value

Paramters: T, log_H2O, log_CO2, log_CH4, log_CO, log_NH3 (i.e. T_q1, T_q2, T_q3, etc)

`FM_Parameter_Table.csv`

contains the input value used to generate the high-resolution spectra (Forward Model) before it is binned to Ariel resolution. It is not the ground truth, even though in some cases it is close to the ground truth. Available for all instances.

Planet ID	Parameter1	Parameter2 :	: Paramter3	…
1	Value	Value	Value	Value
2	Value	Value	Value	Value
…	Value	Value	Value	Value

Unfortunately, the ground truth is not available for the test data.

Upload Format

To upload your surrogate distributions, please folow the same data structure as for the tracedata.hdf5 above. At the end of the competition. The top 10 participants (as on the leaderboard) will be asked to run their codes on an additional set of planets.

We have also written a helper function to convert your output into the correct format.

You can upload your predictions here

Documentation

Data

General Files

AuxillaryTable.csv

SpectralData.hdf5