ESA Datalabs Ariel Hackathon 2025

This dataset (restructured from original competition data) is kindly provided by Luís F. Simões.

you should find the following pickle file inside the team_workspaces:

ADC21esac_train.pkl
ADC21esac_test.pkl

You should train your model with data provided in ADC21esac_train.pkl and test your model using data from ADC21esac_test.pkl. In other words, you should submit your prediction for the data from ADC21esac_test.pkl.

Inside the pickle files you will find everything you need during this hackathon. To open the file you can use the following command in Python:

with open(PATH_TO_FILE,'rb') as f:
    data = pkl.load(f)

You should find 8 keys inside the dictionary file: 'dataset_name', 'planet_idxs', 'obs_to_fname', 'planet', 'X_params', 'y_params', 'X', 'y'

A brief introduction to the keys:
- dataset_name: The name of the dataset (whether its train/test)
- planet_idxs: The index of the each exoplanet (you will find that it only has N(planet_idxs) way smaller than the observation, because here we are simulating 100 instances for each planet )
- obs_to_fname: A mapping from AAAA_BB_CC.txt to tuples (A, B, C), more on this below.
- planet: Information about the observed planetary system (repeated information from X_params and y_params )
- X_params: auxiliary information about the observation.
- y_params: auxiliary information about the target (note that sma and inc will not be provided in the test file)
- X: light curve observation.
- y: target (spectrum of each planet in each instance), this file will be empty for the test file.

‘Noisy’ observations

The observed data (i.e. the features) is contained within the keys X and X_params.
X contains a dictionary of 2D arrays of relative fluxes of dimension (55 x 300), where every row corresponds to a timeseries (with 300 time steps, denoted with t# below) of a particular wavelength channel (there are 55 channels, denoted with w# below), and X_params is a table containing 6 stellar and planet parameters.

X is a nested dictionary with the followiing structure:

|- 0001_01_01.txt
      |- observation
|- 0001_01_02.txt
      |- observation

The .txt files are named following the convention: AAAA_BB_CC.txt

The name is unique for each observation (i.e. datapoint) and AAAA (0001 to 2097) is an index for the planet observed, BB (01 to 10) is an index for the stellar spot noise instance observed and CC (01 to 10) is an index for the gaussian photon noise instance observed. You can access a list of the stellar spot noise instances via data['y_params'][0]['stellar_spot'] and photon noise instances via data['y_params'][0]['photon']

Each observation is organised like the following (without the column and row names):

      (t1)            (t2)            ...   (t300)
(w1)  1.00010151742   1.00010218526   ...   1.00001215251
(w2)  0.999857792623  1.00009976297   ...   1.00007764626
(...) ...             ...             ...   ...
(w55) 0.999523150082  0.999468565171  ...   0.999934661757

X_params is a table that can be read as a pandas DataFrame using pd.DataFrame.from_dict(data['X_params'][0]):

Index	planet	stellar_spot	photon	star_temp	star_logg	star_rad	star_mass	star_k_mag	period
0001_01_01.txt	1	1	1	3667.42	5.0	0.4395	0.476	9.429	5.707100
…	…	…	…	…	…	…	…	…	…

note that the information is repeated for 100 times as the planets are repeated during simulation. To find a unique table for each planet, please see data['planet'][0]

Both X and X_params should be used as features for training as the respective target parameters are given (see parameters above below).

test_X_data.pkl and test_X_params.pkl follows the same structure as the training data, and should be used as features to predict their respective target parameters (see upload file format below).

‘Target’ files

y contain the retrieved data (i.e. the targets), namely: a 1D array of relative radii (planet-to-star-radius ratios) of dimension (1 x 55), where every column corresponds to a particular wavelength channel (there are 55 channels, denoted with w# below). The targets of the regression problem are the 55 relative radii.

The file structure can be seen below (without the column and row names):

|- 0001_01_01.txt
      |- targets 
|- 0001_01_02.txt
      |- targets

Each target follows the following format:

              (w1)            (w2)            (...) (w55)           
(AAAA_BB_CC)  0.0195608058653 0.019439812298  ...   0.0271040897872

y should be used as targets for training, as they correspond to the noisy observations(see X and/or X_params above). Your task is to predict the corresponding relative radii at each wavelength using the test observation data contained within X and/or X_params.

y_params contain 2 planet parameters (‘sma’ and ‘incl’, which can be used as intermediate targets or be ignored). Optional parameters which may or may not help with model training. Note that these parameters are not provided when submitting to the test data. You can read them as a pandas DataFrame via pd.DataFrame.from_dict(data['X_params'][0])

Index	planet	stellar_spot	photon	sma	incl
0001_01_01.txt	1	1	1	7.300915e+09	88.779129
…	…	…	…	…	…

Note: If you find it useful, you can use the two additional parameters (inside y_params) that are provided ONLY for the training set examples – (semimajor axis) ‘sma’ and (inclination) ‘incl’ – as intermediate targets for predicting the actual 55 targets. Otherwise you can ignore them.

Uploading your prediction

Your prediction to be uploaded should contain all the predictions of the 55 relative radii that correspond to the noisy observaitons X within the test file (sorted by the observation name)

The file structure can be seen below (without the column and row names):

                  (w1)              (w2)              (...) (w55)
(0003_01_01)      0.0195608058653   0.019439812298    ...   0.0271040897872
(0003_01_02)      0.0195608058653   0.019439812298    ...   0.0271040897872
(...)             ...               ...               ...   ...
(4381_10_10)      0.0195608058653   0.019439812298    ...   0.0271040897872

Once you are happy with your prediction, you should submit the prediction via the provided API. More on this in the starter notebook on ESA Datalabs.

Error codes and description

Here we provide a list of the error codes and their descriptions.

Wrong file type : File is not in .csv format
Invalid secret code : Typo in secret code or wrong code
Invalid data format : The file has non-numeric values
Wrong data format : The num of rows or column is not as expected
Invalid Submission : Empty file submitted

Here we have included a helper function to help you shape your submission into the right format.

def to_submission_format(file_index, matrix):

#     file_index (1D array): file index of the test set, in the form of AAAA_BB_CC.txt
#     note that it should be in the same order as the given test set! 
#     matrix (2D array): N_examples X 55 wl  

# they should match each other.
assert len(file_index) == len(matrix)

column_names = [f'w{i+1}' for i in range(55)] 
submission_df = pd.DataFrame(matrix, columns = column_names)
submission_df.insert(0, 'files',file_index )
submission_df = submission_df.set_index('files')
return submission_df

Data Format

Dataset

‘Noisy’ observations

‘Target’ files

Uploading your prediction

Error codes and description