CSED Hackathon 2024 - Data Format

The dataset can be found here. This dataset (restructured from original competition data) is kindly provided by Luís F. Simões.

Once you uncompressed the dataset, you should find the following items inside the dataset:

train_X_data.pkl
train_X_params.pkl
train_y_data.pkl
train_y_params.pkl
test_X_data.pkl
test_X_params.pkl

‘Noisy’ observations

The observed data (i.e. the features) is split into two files for easier processing.
train_X_data.pkl contains a dictionary of 2D arrays of relative fluxes of dimension (55 x 300), where every row corresponds to a timeseries (with 300 time steps, denoted with t# below) of a particular wavelength channel (there are 55 channels, denoted with w# below), and train_X_params.pkl is a table containing 6 stellar and planet parameters.

train_X_data.pkl is a nested python dictionary with the followiing structure:

|- 0001_01_01.txt
      |- observation
|- 0001_01_02.txt
      |- observation

The .txt files are named following the convention: AAAA_BB_CC.txt

The name is unique for each observation (i.e. datapoint) and AAAA (0001 to 2097) is an index for the planet observed, BB (01 to 10) is an index for the stellar spot noise instance observed and CC (01 to 10) is an index for the gaussian photon noise instance observed.

Each observation is organised like the following (without the column and row names):

      (t1)            (t2)            ...   (t300)
(w1)  1.00010151742   1.00010218526   ...   1.00001215251
(w2)  0.999857792623  1.00009976297   ...   1.00007764626
(...) ...             ...             ...   ...
(w55) 0.999523150082  0.999468565171  ...   0.999934661757

train_X_params.pkl is a table in the following format:

Index	planet	stellar_spot	photon	star_temp	star_logg	star_rad	star_mass	star_k_mag	period
0001_01_01.txt	1	1	1	3667.42	5.0	0.4395	0.476	9.429	5.707100
…	…	…	…	…	…	…	…	…	…

Both train_X_data.pkl and train_X_params.pkl should be used as features for training as the respective target parameters are given (see parameters files below).

test_X_data.pkl and test_X_params.pkl follows the same structure as the training data, andshould be used as features to predict and upload their respective target parameters (see upload file format below).

‘Target’ files

train_y_data contain the retrieved data (i.e. the targets), namely: a 1D array of relative radii (planet-to-star-radius ratios) of dimension (1 x 55), where every column corresponds to a particular wavelength channel (there are 55 channels, denoted with w# below). The targets of the regression problem are the 55 relative radii.

The file structure can be seen below (without the column and row names):

|- 0001_01_01.txt
      |- targets 
|- 0001_01_02.txt
      |- targets

Each target follows the following format:

              (w1)            (w2)            (...) (w55)           
(AAAA_BB_CC)  0.0195608058653 0.019439812298  ...   0.0271040897872

train_y_data files should be used as targets for training, as they correspond to the noisy files(see train_X_data.pkl and/or train_X_params.pklabove). Your task is to predict the corresponding relative radii at each wavelength using test_X_data.pkl and/or test_X_params.pkl

train_y_params contain 2 planet parameters (‘sma’ and ‘incl’, which can be used as intermediate targets or be ignored). Optional parameters which may or may not help with model training. These parameters are not provided when submitting to the leaderboard/final evaluation data.

Index	planet	stellar_spot	photon	sma	incl
0001_01_01.txt	1	1	1	3667.42	7.300915e+09	88.779129
…	…	…	…	…	…

Note: If you find it useful, you can use the two additional parameters (inside train_y_params) that are provided ONLY for the training set examples – (semimajor axis) ‘sma’ and (inclination) ‘incl’ – as intermediate targets for predicting the actual 55 targets. Otherwise you can ignore them.

Upload file

The file to be uploaded should contain all the predictions of the 55 relative radii that correspond to the noisy files in the train_X_data.pkl directory (sorted by the observation name), but not the planetray parameters ‘sma’ and ‘incl’ (see files above).

The file structure can be seen below (without the column and row names):

                  (w1)              (w2)              (...) (w55)
(0005_01_01)      0.0195608058653   0.019439812298    ...   0.0271040897872
(0005_01_02)      0.0195608058653   0.019439812298    ...   0.0271040897872
(...)             ...               ...               ...   ...
(2096_10_10)      0.0195608058653   0.019439812298    ...   0.0271040897872

You can upload your predictions here. You should be able to create this structure with pandas Dataframes, and save that as a .csv file.

Once you are done with your presentation, you can upload your presentation via this link

Error codes and description

Wrong file type : File is not in .csv format
Invalid secret code : Typo in secret code or wrong code
Invalid data format : The file has non-numeric values
Wrong data format : The num of rows or column is not as expected
Invalid Submission : Empty file submitted

Here we have included a helper function to help you shape your submission into the right format.

def to_submission_format(file_index, matrix):

#     file_index (1D array): file index of the test set, 
#     note that it should be in the same order as the given test set! 
#     matrix (2D array): N_examples X 55 wl  

# they should match each other.
assert len(file_index) == len(matrix)

column_names = [f'w{i+1}' for i in range(55)] 
submission_df = pd.DataFrame(matrix, columns = column_names)
submission_df.insert(0, 'files',file_index )
submission_df = submission_df.set_index('files')
return submission_df

Dataset

‘Noisy’ observations

‘Target’ files

Upload file

Error codes and description