Dataset

The dataset can be found here. This dataset (restructured from original competition data) is kindly provided by Luís F. Simões.

Once you uncompressed the dataset, you should find the following items inside the dataset:

  1. train_X_data.pkl
  2. train_X_params.pkl
  3. train_y_data.pkl
  4. train_y_params.pkl
  5. test_X_data.pkl
  6. test_X_params.pkl

‘Noisy’ observations

The observed data (i.e. the features) is split into two files for easier processing.
train_X_data.pkl contains a dictionary of 2D arrays of relative fluxes of dimension (55 x 300), where every row corresponds to a timeseries (with 300 time steps, denoted with t# below) of a particular wavelength channel (there are 55 channels, denoted with w# below), and train_X_params.pkl is a table containing 6 stellar and planet parameters.

train_X_data.pkl is a nested python dictionary with the followiing structure:

|- 0001_01_01.txt
      |- observation
|- 0001_01_02.txt
      |- observation

The .txt files are named following the convention: AAAA_BB_CC.txt

The name is unique for each observation (i.e. datapoint) and AAAA (0001 to 2097) is an index for the planet observed, BB (01 to 10) is an index for the stellar spot noise instance observed and CC (01 to 10) is an index for the gaussian photon noise instance observed.

Each observation is organised like the following (without the column and row names):

      (t1)            (t2)            ...   (t300)
(w1)  1.00010151742   1.00010218526   ...   1.00001215251
(w2)  0.999857792623  1.00009976297   ...   1.00007764626
(...) ...             ...             ...   ...
(w55) 0.999523150082  0.999468565171  ...   0.999934661757

train_X_params.pkl is a table in the following format:

Index planet stellar_spot photon star_temp star_logg star_rad star_mass star_k_mag period
0001_01_01.txt 1 1 1 3667.42 5.0 0.4395 0.476 9.429 5.707100

Both train_X_data.pkl and train_X_params.pkl should be used as features for training as the respective target parameters are given (see parameters files below).

test_X_data.pkl and test_X_params.pkl follows the same structure as the training data, andshould be used as features to predict and upload their respective target parameters (see upload file format below).

‘Target’ files

train_y_data contain the retrieved data (i.e. the targets), namely: a 1D array of relative radii (planet-to-star-radius ratios) of dimension (1 x 55), where every column corresponds to a particular wavelength channel (there are 55 channels, denoted with w# below). The targets of the regression problem are the 55 relative radii.

The file structure can be seen below (without the column and row names):

|- 0001_01_01.txt
      |- targets 
|- 0001_01_02.txt
      |- targets

Each target follows the following format:

              (w1)            (w2)            (...) (w55)           
(AAAA_BB_CC)  0.0195608058653 0.019439812298  ...   0.0271040897872

train_y_data files should be used as targets for training, as they correspond to the noisy files(see train_X_data.pkl and/or train_X_params.pklabove). Your task is to predict the corresponding relative radii at each wavelength using test_X_data.pkl and/or test_X_params.pkl

train_y_params contain 2 planet parameters (‘sma’ and ‘incl’, which can be used as intermediate targets or be ignored). Optional parameters which may or may not help with model training. These parameters are not provided when submitting to the leaderboard/final evaluation data.

Index planet stellar_spot photon sma incl
0001_01_01.txt 1 1 1 3667.42 7.300915e+09 88.779129

Note: If you find it useful, you can use the two additional parameters (inside train_y_params) that are provided ONLY for the training set examples – (semimajor axis) ‘sma’ and (inclination) ‘incl’ – as intermediate targets for predicting the actual 55 targets. Otherwise you can ignore them.

Upload file

The file to be uploaded should contain all the predictions of the 55 relative radii that correspond to the noisy files in the train_X_data.pkl directory (sorted by the observation name), but not the planetray parameters ‘sma’ and ‘incl’ (see files above).

The file structure can be seen below (without the column and row names):

                  (w1)              (w2)              (...) (w55)
(0005_01_01)      0.0195608058653   0.019439812298    ...   0.0271040897872
(0005_01_02)      0.0195608058653   0.019439812298    ...   0.0271040897872
(...)             ...               ...               ...   ...
(2096_10_10)      0.0195608058653   0.019439812298    ...   0.0271040897872

You can upload your predictions here. You should be able to create this structure with pandas Dataframes, and save that as a .csv file.

Once you are done with your presentation, you can upload your presentation via this link

Error codes and description

  • Wrong file type : File is not in .csv format
  • Invalid secret code : Typo in secret code or wrong code
  • Invalid data format : The file has non-numeric values
  • Wrong data format : The num of rows or column is not as expected
  • Invalid Submission : Empty file submitted

Here we have included a helper function to help you shape your submission into the right format.


def to_submission_format(file_index, matrix):

#     file_index (1D array): file index of the test set, 
#     note that it should be in the same order as the given test set! 
#     matrix (2D array): N_examples X 55 wl  

# they should match each other.
assert len(file_index) == len(matrix)

column_names = [f'w{i+1}' for i in range(55)] 
submission_df = pd.DataFrame(matrix, columns = column_names)
submission_df.insert(0, 'files',file_index )
submission_df = submission_df.set_index('files')
return submission_df