This dataset (restructured from original competition data) is kindly provided by Luís F. Simões.
you should find the following pickle file inside the team_workspaces:
ADC21esac_train.pkl
ADC21esac_test.pkl
You should train your model with data provided in ADC21esac_train.pkl
and test your model using data from ADC21esac_test.pkl
. In other words, you should submit your prediction for the data from ADC21esac_test.pkl
.
Inside the pickle files you will find everything you need during this hackathon. To open the file you can use the following command in Python:
with open(PATH_TO_FILE,'rb') as f:
data = pkl.load(f)
You should find 8 keys inside the dictionary file: 'dataset_name', 'planet_idxs', 'obs_to_fname', 'planet', 'X_params', 'y_params', 'X', 'y'
A brief introduction to the keys:
- dataset_name
: The name of the dataset (whether its train/test)
- planet_idxs
: The index of the each exoplanet (you will find that it only has N(planet_idxs) way smaller than the observation, because here we are simulating 100 instances for each planet )
- obs_to_fname
: A mapping from AAAA_BB_CC.txt
to tuples (A, B, C)
, more on this below.
- planet
: Information about the observed planetary system (repeated information from X_params
and y_params
)
- X_params
: auxiliary information about the observation.
- y_params
: auxiliary information about the target (note that sma
and inc
will not be provided in the test file)
- X
: light curve observation.
- y
: target (spectrum of each planet in each instance), this file will be empty for the test file.
‘Noisy’ observations
The observed data (i.e. the features) is contained within the keys X
and X_params
.
X
contains a dictionary of 2D arrays of relative fluxes of dimension (55 x 300), where every row corresponds to a timeseries (with 300 time steps, denoted with t# below) of a particular wavelength channel (there are 55 channels, denoted with w# below), and X_params
is a table containing 6 stellar and planet parameters.
X
is a nested dictionary with the followiing structure:
|- 0001_01_01.txt
|- observation
|- 0001_01_02.txt
|- observation
The .txt
files are named following the convention: AAAA_BB_CC.txt
The name is unique for each observation (i.e. datapoint) and AAAA (0001 to 2097) is an index for the planet observed, BB (01 to 10) is an index for the stellar spot noise instance observed and CC (01 to 10) is an index for the gaussian photon noise instance observed. You can access a list of the stellar spot noise instances via data['y_params'][0]['stellar_spot']
and photon noise instances via data['y_params'][0]['photon']
Each observation is organised like the following (without the column and row names):
(t1) (t2) ... (t300)
(w1) 1.00010151742 1.00010218526 ... 1.00001215251
(w2) 0.999857792623 1.00009976297 ... 1.00007764626
(...) ... ... ... ...
(w55) 0.999523150082 0.999468565171 ... 0.999934661757
X_params
is a table that can be read as a pandas DataFrame using pd.DataFrame.from_dict(data['X_params'][0])
:
Index |
planet |
stellar_spot |
photon |
star_temp |
star_logg |
star_rad |
star_mass |
star_k_mag |
period |
0001_01_01.txt |
1 |
1 |
1 |
3667.42 |
5.0 |
0.4395 |
0.476 |
9.429 |
5.707100 |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
note that the information is repeated for 100 times as the planets are repeated during simulation. To find a unique table for each planet, please see data['planet'][0]
Both X
and X_params
should be used as features for training as the respective target parameters are given (see parameters above below).
test_X_data.pkl
and test_X_params.pkl
follows the same structure as the training data, and should be used as features to predict their respective target parameters (see upload file format below).
‘Target’ files
y
contain the retrieved data (i.e. the targets), namely: a 1D array of relative radii (planet-to-star-radius ratios) of dimension (1 x 55), where every column corresponds to a particular wavelength channel (there are 55 channels, denoted with w# below). The targets of the regression problem are the 55 relative radii.
The file structure can be seen below (without the column and row names):
|- 0001_01_01.txt
|- targets
|- 0001_01_02.txt
|- targets
Each target follows the following format:
(w1) (w2) (...) (w55)
(AAAA_BB_CC) 0.0195608058653 0.019439812298 ... 0.0271040897872
y
should be used as targets for training, as they correspond to the noisy observations(see X
and/or X_params
above). Your task is to predict the corresponding relative radii at each wavelength using the test observation data contained within X
and/or X_params
.
y_params
contain 2 planet parameters (‘sma’ and ‘incl’, which can be used as intermediate targets or be ignored). Optional parameters which may or may not help with model training. Note that these parameters are not provided when submitting to the test data. You can read them as a pandas DataFrame via pd.DataFrame.from_dict(data['X_params'][0])
Index |
planet |
stellar_spot |
photon |
sma |
incl |
0001_01_01.txt |
1 |
1 |
1 |
7.300915e+09 |
88.779129 |
… |
… |
… |
… |
… |
… |
Note: If you find it useful, you can use the two additional parameters (inside y_params
) that are provided ONLY for the training set examples – (semimajor axis) ‘sma’ and (inclination) ‘incl’ – as intermediate targets for predicting the actual 55 targets. Otherwise you can ignore them.
Uploading your prediction
Your prediction to be uploaded should contain all the predictions of the 55 relative radii that correspond to the noisy observaitons X
within the test file (sorted by the observation name)
The file structure can be seen below (without the column and row names):
(w1) (w2) (...) (w55)
(0003_01_01) 0.0195608058653 0.019439812298 ... 0.0271040897872
(0003_01_02) 0.0195608058653 0.019439812298 ... 0.0271040897872
(...) ... ... ... ...
(4381_10_10) 0.0195608058653 0.019439812298 ... 0.0271040897872
Once you are happy with your prediction, you should submit the prediction via the provided API. More on this in the starter notebook on ESA Datalabs.
Error codes and description
Here we provide a list of the error codes and their descriptions.
- Wrong file type : File is not in .csv format
- Invalid secret code : Typo in secret code or wrong code
- Invalid data format : The file has non-numeric values
- Wrong data format : The num of rows or column is not as expected
- Invalid Submission : Empty file submitted
Here we have included a helper function to help you shape your submission into the right format.
def to_submission_format(file_index, matrix):
# file_index (1D array): file index of the test set, in the form of AAAA_BB_CC.txt
# note that it should be in the same order as the given test set!
# matrix (2D array): N_examples X 55 wl
# they should match each other.
assert len(file_index) == len(matrix)
column_names = [f'w{i+1}' for i in range(55)]
submission_df = pd.DataFrame(matrix, columns = column_names)
submission_df.insert(0, 'files',file_index )
submission_df = submission_df.set_index('files')
return submission_df