Dataset

This dataset (restructured from original competition data) is kindly provided by Luís F. Simões.

you should find the following pickle file inside the team_workspaces:

  • ADC21esac_train.pkl
  • ADC21esac_test.pkl

You should train your model with data provided in ADC21esac_train.pkl and test your model using data from ADC21esac_test.pkl. In other words, you should submit your prediction for the data from ADC21esac_test.pkl.

Inside the pickle files you will find everything you need during this hackathon. To open the file you can use the following command in Python:

with open(PATH_TO_FILE,'rb') as f:
    data = pkl.load(f)


You should find 8 keys inside the dictionary file: 'dataset_name', 'planet_idxs', 'obs_to_fname', 'planet', 'X_params', 'y_params', 'X', 'y'

A brief introduction to the keys:
- dataset_name: The name of the dataset (whether its train/test)
- planet_idxs: The index of the each exoplanet (you will find that it only has N(planet_idxs) way smaller than the observation, because here we are simulating 100 instances for each planet )
- obs_to_fname: A mapping from AAAA_BB_CC.txt to tuples (A, B, C), more on this below.
- planet: Information about the observed planetary system (repeated information from X_params and y_params )
- X_params: auxiliary information about the observation.
- y_params: auxiliary information about the target (note that sma and inc will not be provided in the test file)
- X: light curve observation.
- y: target (spectrum of each planet in each instance), this file will be empty for the test file.

‘Noisy’ observations

The observed data (i.e. the features) is contained within the keys X and X_params.
X contains a dictionary of 2D arrays of relative fluxes of dimension (55 x 300), where every row corresponds to a timeseries (with 300 time steps, denoted with t# below) of a particular wavelength channel (there are 55 channels, denoted with w# below), and X_params is a table containing 6 stellar and planet parameters.

X is a nested dictionary with the followiing structure:

|- 0001_01_01.txt
      |- observation
|- 0001_01_02.txt
      |- observation

The .txt files are named following the convention: AAAA_BB_CC.txt

The name is unique for each observation (i.e. datapoint) and AAAA (0001 to 2097) is an index for the planet observed, BB (01 to 10) is an index for the stellar spot noise instance observed and CC (01 to 10) is an index for the gaussian photon noise instance observed. You can access a list of the stellar spot noise instances via data['y_params'][0]['stellar_spot'] and photon noise instances via data['y_params'][0]['photon']

Each observation is organised like the following (without the column and row names):

      (t1)            (t2)            ...   (t300)
(w1)  1.00010151742   1.00010218526   ...   1.00001215251
(w2)  0.999857792623  1.00009976297   ...   1.00007764626
(...) ...             ...             ...   ...
(w55) 0.999523150082  0.999468565171  ...   0.999934661757

X_params is a table that can be read as a pandas DataFrame using pd.DataFrame.from_dict(data['X_params'][0]):

Index planet stellar_spot photon star_temp star_logg star_rad star_mass star_k_mag period
0001_01_01.txt 1 1 1 3667.42 5.0 0.4395 0.476 9.429 5.707100

note that the information is repeated for 100 times as the planets are repeated during simulation. To find a unique table for each planet, please see data['planet'][0]

Both X and X_params should be used as features for training as the respective target parameters are given (see parameters above below).

test_X_data.pkl and test_X_params.pkl follows the same structure as the training data, and should be used as features to predict their respective target parameters (see upload file format below).

‘Target’ files

y contain the retrieved data (i.e. the targets), namely: a 1D array of relative radii (planet-to-star-radius ratios) of dimension (1 x 55), where every column corresponds to a particular wavelength channel (there are 55 channels, denoted with w# below). The targets of the regression problem are the 55 relative radii.

The file structure can be seen below (without the column and row names):

|- 0001_01_01.txt
      |- targets 
|- 0001_01_02.txt
      |- targets

Each target follows the following format:

              (w1)            (w2)            (...) (w55)           
(AAAA_BB_CC)  0.0195608058653 0.019439812298  ...   0.0271040897872

y should be used as targets for training, as they correspond to the noisy observations(see X and/or X_params above). Your task is to predict the corresponding relative radii at each wavelength using the test observation data contained within X and/or X_params.

y_params contain 2 planet parameters (‘sma’ and ‘incl’, which can be used as intermediate targets or be ignored). Optional parameters which may or may not help with model training. Note that these parameters are not provided when submitting to the test data. You can read them as a pandas DataFrame via pd.DataFrame.from_dict(data['X_params'][0])

Index planet stellar_spot photon sma incl
0001_01_01.txt 1 1 1 7.300915e+09 88.779129

Note: If you find it useful, you can use the two additional parameters (inside y_params) that are provided ONLY for the training set examples – (semimajor axis) ‘sma’ and (inclination) ‘incl’ – as intermediate targets for predicting the actual 55 targets. Otherwise you can ignore them.

Uploading your prediction

Your prediction to be uploaded should contain all the predictions of the 55 relative radii that correspond to the noisy observaitons X within the test file (sorted by the observation name)

The file structure can be seen below (without the column and row names):

                  (w1)              (w2)              (...) (w55)
(0003_01_01)      0.0195608058653   0.019439812298    ...   0.0271040897872
(0003_01_02)      0.0195608058653   0.019439812298    ...   0.0271040897872
(...)             ...               ...               ...   ...
(4381_10_10)      0.0195608058653   0.019439812298    ...   0.0271040897872

Once you are happy with your prediction, you should submit the prediction via the provided API. More on this in the starter notebook on ESA Datalabs.

Error codes and description

Here we provide a list of the error codes and their descriptions.

  • Wrong file type : File is not in .csv format
  • Invalid secret code : Typo in secret code or wrong code
  • Invalid data format : The file has non-numeric values
  • Wrong data format : The num of rows or column is not as expected
  • Invalid Submission : Empty file submitted

Here we have included a helper function to help you shape your submission into the right format.


def to_submission_format(file_index, matrix):

#     file_index (1D array): file index of the test set, in the form of AAAA_BB_CC.txt
#     note that it should be in the same order as the given test set! 
#     matrix (2D array): N_examples X 55 wl  

# they should match each other.
assert len(file_index) == len(matrix)

column_names = [f'w{i+1}' for i in range(55)] 
submission_df = pd.DataFrame(matrix, columns = column_names)
submission_df.insert(0, 'files',file_index )
submission_df = submission_df.set_index('files')
return submission_df