The dataset can be found here. This dataset (restructured from original competition data) is kindly provided by Luís F. Simões.
Once you uncompressed the dataset, you should find the following items inside the dataset:
- train_X_data.pkl
- train_X_params.pkl
- train_y_data.pkl
- train_y_params.pkl
- test_X_data.pkl
- test_X_params.pkl
‘Noisy’ observations
The observed data (i.e. the features) is split into two files for easier processing.
train_X_data.pkl
contains a dictionary of 2D arrays of relative fluxes of dimension (55 x 300), where every row corresponds to a timeseries (with 300 time steps, denoted with t# below) of a particular wavelength channel (there are 55 channels, denoted with w# below), and train_X_params.pkl
is a table containing 6 stellar and planet parameters.
train_X_data.pkl
is a nested python dictionary with the followiing structure:
|- 0001_01_01.txt
|- observation
|- 0001_01_02.txt
|- observation
The .txt
files are named following the convention: AAAA_BB_CC.txt
The name is unique for each observation (i.e. datapoint) and AAAA (0001 to 2097) is an index for the planet observed, BB (01 to 10) is an index for the stellar spot noise instance observed and CC (01 to 10) is an index for the gaussian photon noise instance observed.
Each observation is organised like the following (without the column and row names):
(t1) (t2) ... (t300)
(w1) 1.00010151742 1.00010218526 ... 1.00001215251
(w2) 0.999857792623 1.00009976297 ... 1.00007764626
(...) ... ... ... ...
(w55) 0.999523150082 0.999468565171 ... 0.999934661757
train_X_params.pkl
is a table in the following format:
Index |
planet |
stellar_spot |
photon |
star_temp |
star_logg |
star_rad |
star_mass |
star_k_mag |
period |
0001_01_01.txt |
1 |
1 |
1 |
3667.42 |
5.0 |
0.4395 |
0.476 |
9.429 |
5.707100 |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
Both train_X_data.pkl
and train_X_params.pkl
should be used as features for training as the respective target parameters are given (see parameters files below).
test_X_data.pkl
and test_X_params.pkl
follows the same structure as the training data, andshould be used as features to predict and upload their respective target parameters (see upload file format below).
‘Target’ files
train_y_data
contain the retrieved data (i.e. the targets), namely: a 1D array of relative radii (planet-to-star-radius ratios) of dimension (1 x 55), where every column corresponds to a particular wavelength channel (there are 55 channels, denoted with w# below). The targets of the regression problem are the 55 relative radii.
The file structure can be seen below (without the column and row names):
|- 0001_01_01.txt
|- targets
|- 0001_01_02.txt
|- targets
Each target follows the following format:
(w1) (w2) (...) (w55)
(AAAA_BB_CC) 0.0195608058653 0.019439812298 ... 0.0271040897872
train_y_data
files should be used as targets for training, as they correspond to the noisy files(see train_X_data.pkl
and/or train_X_params.pkl
above). Your task is to predict the corresponding relative radii at each wavelength using test_X_data.pkl
and/or test_X_params.pkl
train_y_params
contain 2 planet parameters (‘sma’ and ‘incl’, which can be used as intermediate targets or be ignored). Optional parameters which may or may not help with model training. These parameters are not provided when submitting to the leaderboard/final evaluation data.
Index |
planet |
stellar_spot |
photon |
sma |
incl |
|
0001_01_01.txt |
1 |
1 |
1 |
3667.42 |
7.300915e+09 |
88.779129 |
… |
… |
… |
… |
… |
… |
|
Note: If you find it useful, you can use the two additional parameters (inside train_y_params
) that are provided ONLY for the training set examples – (semimajor axis) ‘sma’ and (inclination) ‘incl’ – as intermediate targets for predicting the actual 55 targets. Otherwise you can ignore them.
Upload file
The file to be uploaded should contain all the predictions of the 55 relative radii that correspond to the noisy files in the train_X_data.pkl
directory (sorted by the observation name), but not the planetray parameters ‘sma’ and ‘incl’ (see files above).
The file structure can be seen below (without the column and row names):
(w1) (w2) (...) (w55)
(0005_01_01) 0.0195608058653 0.019439812298 ... 0.0271040897872
(0005_01_02) 0.0195608058653 0.019439812298 ... 0.0271040897872
(...) ... ... ... ...
(2096_10_10) 0.0195608058653 0.019439812298 ... 0.0271040897872
You can upload your predictions here. You should be able to create this structure with pandas Dataframes, and save that as a .csv
file.
Once you are done with your presentation, you can upload your presentation via this link
Error codes and description
- Wrong file type : File is not in .csv format
- Invalid secret code : Typo in secret code or wrong code
- Invalid data format : The file has non-numeric values
- Wrong data format : The num of rows or column is not as expected
- Invalid Submission : Empty file submitted
Here we have included a helper function to help you shape your submission into the right format.
def to_submission_format(file_index, matrix):
# file_index (1D array): file index of the test set,
# note that it should be in the same order as the given test set!
# matrix (2D array): N_examples X 55 wl
# they should match each other.
assert len(file_index) == len(matrix)
column_names = [f'w{i+1}' for i in range(55)]
submission_df = pd.DataFrame(matrix, columns = column_names)
submission_df.insert(0, 'files',file_index )
submission_df = submission_df.set_index('files')
return submission_df