The training dataset can be found on the download page.
Inside both the training and test data you will find AuxillaryTable.csv
and SpectralData.hdf5
. The Ground Truth Package is available in the training data for supervised/semi-supervised training. Below contains detailed description of each file:
AuxillaryTable.csv
Contains auxillary data about the planetary system. There are 9 features in total: Star Distance, Stellar Mass, Stellar Radius, Stellar Temperature, Planet Mass, Orbital Period, Semi-Major Axis, Planet Radius, Surface Gravity. These information are unique to each planet and are sourced various exoplanet database. There are 91,392 examples in total.
To load the dataset:
import pandas as pd
AuxTable = pd.read_csv('AuxillaryTable.csv', index='planet_ID') ### load file
AuxTable.head(1) ### preview the table
SpectralData.hdf5
Contains the spectroscopic information about the observation, includes wavelength grid, spectrum, uncertainty and bin width
Planet ID
| --- instrument_wlgrid
| --- instrument_spectrum
| --- instrument_noise
| --- instrument_width
To load the dataset:
import h5py
SpectralData = h5py.File('SpectralData.hdf5') ### load file
planetlist = [p for p in SpectralData.keys()]
## access wlgrid, spectrum, noise and wlwidth of a single planet instance
wlgrid = SpectralData[planetlist[0]]['instrument_wlgrid'][:]
spectrum = SpectralData[planetlist[0]]['instrument_spectrum'][:]
noise = SpectralData[planetlist[0]]['instrument_noise'][:]
wlwidth = SpectralData[planetlist[0]]['instrument_width'][:]
Below is a small function to convert the .hdf5 format into a matrix:
def to_matrix(SpectralData):
# id is in ascending order"
num = len(SpectralData.keys())
id_order = np.arange(num)
# we knew the instrument resolution beforehand
observed_spectrum = np.zeros((num,52,4))
for idx, x in enumerate(id_order):
current_id = f'Planet_{x}'
wlgrid = SpectralData[current_id]['instrument_wlgrid'][:]
spectrum = SpectralData[current_id]['instrument_spectrum'][:]
noise = SpectralData[current_id]['instrument_noise'][:]
wlwidth = SpectralData[current_id]['instrument_width'][:]
observed_spectrum[idx,:,:] = np.concatenate([wlgrid[...,np.newaxis],spectrum[...,np.newaxis],noise[...,np.newaxis],wlwidth[...,np.newaxis]],axis=-1)
return observed_spectrum
Contains Ground Truth for both tracks. Tracedata.hdf5
is designed for Regular Track and QuartilesTable.csv
is intended for Light Track. However, only 24% of the input are complemented with the ground truth, the rest of the input are unlabelled. We have also provided FM_Parameter_Table.csv
for semi-supervised learning.
Tracedata.hdf5
The posterior distributions were generated using MultiNest, the hdf5 file contains two arrays per planet: Tracedata and Weights. The tracedata are the likelihood evaluated points, L, by MultiNest. Weights, w, are the importance weights, see equation 10 here. To calculate the posterior distribution, p:
Tracedata.hdf5 is nested by planet ID with Tracedata and Weights being standard numpy arrays:
Planet ID
| --- Tracedata
| --- Weights
The length of each tracedata/weights array will be variable depending on the convergence speed of the MultiNest algorithm but will contain a minimum of 1500 data points.
In order to read in the data (in python) you can use the below script:
import h5py
infile = h5py.File('Tracedata.hdf5') ### loading in tracedata.hdf5
planetlist = [p for p in infile.keys()] ### getting list of planets in file
trace = infile[planetlist[0]]['tracedata'][:] ### accessing Nested Sampling trace data
weights = infile[planetlist[0]]['weights'][:] ### accessing Nested Sampling weight data
QuartilesTable.csv
contains the 16th (q1), 50th (q2) and 84th (q3) percentile of each atmospheric target for one-quarter of the instances.
Planet ID | Parameter_q1 | Parameter_q2 : | : Paramter_q3 | … |
---|---|---|---|---|
1 | Value | Value | Value | Value |
2 | Value | Value | Value | Value |
… | Value | Value | Value | Value |
Paramters: T, log_H2O, log_CO2, log_CH4, log_CO, log_NH3 (i.e. T_q1, T_q2, T_q3, etc)
FM_Parameter_Table.csv
contains the input value used to generate the high-resolution spectra (Forward Model) before it is binned to Ariel resolution. It is not the ground truth, even though in some cases it is close to the ground truth. Available for all instances.
Planet ID | Parameter1 | Parameter2 : | : Paramter3 | … |
---|---|---|---|---|
1 | Value | Value | Value | Value |
2 | Value | Value | Value | Value |
… | Value | Value | Value | Value |
Unfortunately, the ground truth is not available for the test data.
To upload your surrogate distributions, please folow the same data structure as for the tracedata.hdf5 above. At the end of the competition. The top 10 participants (as on the leaderboard) will be asked to run their codes on an additional set of planets.
We have also written a helper function to convert your output into the correct format.
You can upload your predictions here