The training dataset can be found on the download page.

Inside both the training and test data you will find `AuxillaryTable.csv`

and `SpectralData.hdf5`

. The Ground Truth Package is available in the training data for supervised/semi-supervised training. Below contains detailed description of each file:

`AuxillaryTable.csv`

Contains auxillary data about the planetary system. There are 9 features in total: Star Distance, Stellar Mass, Stellar Radius, Stellar Temperature, Planet Mass, Orbital Period, Semi-Major Axis, Planet Radius, Surface Gravity. These information are unique to each planet and are sourced various exoplanet database. There are 91,392 examples in total.

To load the dataset:

```
import pandas as pd
AuxTable = pd.read_csv('AuxillaryTable.csv', index='planet_ID') ### load file
AuxTable.head(1) ### preview the table
```

`SpectralData.hdf5`

Contains the spectroscopic information about the observation, includes wavelength grid, spectrum, uncertainty and bin width

```
Planet ID
| --- instrument_wlgrid
| --- instrument_spectrum
| --- instrument_noise
| --- instrument_width
```

To load the dataset:

```
import h5py
SpectralData = h5py.File('SpectralData.hdf5') ### load file
planetlist = [p for p in SpectralData.keys()]
## access wlgrid, spectrum, noise and wlwidth of a single planet instance
wlgrid = SpectralData[planetlist[0]]['instrument_wlgrid'][:]
spectrum = SpectralData[planetlist[0]]['instrument_spectrum'][:]
noise = SpectralData[planetlist[0]]['instrument_noise'][:]
wlwidth = SpectralData[planetlist[0]]['instrument_width'][:]
```

Below is a small function to convert the .hdf5 format into a matrix:

```
def to_matrix(SpectralData):
# id is in ascending order"
num = len(SpectralData.keys())
id_order = np.arange(num)
# we knew the instrument resolution beforehand
observed_spectrum = np.zeros((num,52,4))
for idx, x in enumerate(id_order):
current_id = f'Planet_{x}'
wlgrid = SpectralData[current_id]['instrument_wlgrid'][:]
spectrum = SpectralData[current_id]['instrument_spectrum'][:]
noise = SpectralData[current_id]['instrument_noise'][:]
wlwidth = SpectralData[current_id]['instrument_width'][:]
observed_spectrum[idx,:,:] = np.concatenate([wlgrid[...,np.newaxis],spectrum[...,np.newaxis],noise[...,np.newaxis],wlwidth[...,np.newaxis]],axis=-1)
return observed_spectrum
```

Contains Ground Truth for both tracks. `Tracedata.hdf5`

is designed for Regular Track and `QuartilesTable.csv`

is intended for Light Track. However, only 24% of the input are complemented with the ground truth, the rest of the input are unlabelled. We have also provided `FM_Parameter_Table.csv`

for semi-supervised learning.

`Tracedata.hdf5`

The posterior distributions were generated using MultiNest, the hdf5 file contains two arrays per planet: Tracedata and Weights. The tracedata are the likelihood evaluated points, L, by MultiNest. Weights, w, are the importance weights, see equation 10 here. To calculate the posterior distribution, p:

Tracedata.hdf5 is nested by planet ID with Tracedata and Weights being standard numpy arrays:

```
Planet ID
| --- Tracedata
| --- Weights
```

The length of each tracedata/weights array will be variable depending on the convergence speed of the MultiNest algorithm but will contain a minimum of 1500 data points.

In order to read in the data (in python) you can use the below script:

```
import h5py
infile = h5py.File('Tracedata.hdf5') ### loading in tracedata.hdf5
planetlist = [p for p in infile.keys()] ### getting list of planets in file
trace = infile[planetlist[0]]['tracedata'][:] ### accessing Nested Sampling trace data
weights = infile[planetlist[0]]['weights'][:] ### accessing Nested Sampling weight data
```

`QuartilesTable.csv`

contains the 16^{th} (q1), 50^{th} (q2) and 84^{th} (q3) percentile of each atmospheric target for one-quarter of the instances.

Planet ID | Parameter_q1 | Parameter_q2 : | : Paramter_q3 | … |
---|---|---|---|---|

1 | Value | Value | Value | Value |

2 | Value | Value | Value | Value |

… | Value | Value | Value | Value |

Paramters: T, log_H2O, log_CO2, log_CH4, log_CO, log_NH3 (i.e. T_q1, T_q2, T_q3, etc)

`FM_Parameter_Table.csv`

contains the input value used to generate the high-resolution spectra (Forward Model) before it is binned to Ariel resolution. It is *not* the ground truth, even though in some cases it is close to the ground truth. Available for all instances.

Planet ID | Parameter1 | Parameter2 : | : Paramter3 | … |
---|---|---|---|---|

1 | Value | Value | Value | Value |

2 | Value | Value | Value | Value |

… | Value | Value | Value | Value |

Unfortunately, the ground truth is not available for the test data.

To upload your surrogate distributions, please folow the same data structure as for the tracedata.hdf5 above. At the end of the competition. The top 10 participants (as on the leaderboard) will be asked to run their codes on an additional set of planets.

We have also written a helper function to convert your output into the correct format.

**You can upload your predictions here**