If you are looking at this page, you are probably curious about the test data, or you are trying to figure out the discrepancy between your local score (produced by the validation data) and the score we have on the leader board.
In short, as you may have realised, the test data is not a simple train-test split. In fact, we made it deliberately different from the training distribution.
The test set is sub-divided into 4 subsets (see the figure below). In-Distribution means that participants will have seen the data in the training set and Out-of-Distribution means that those information are not available from the training data. For example, Set 3 is composed of planets that are unseen from the training set (specifically, the stellar parameters, inclination of the planet etc), while the atmopsheric model remains the same as the ones used to generate the training set. The idea is that as you go from Set 1 to Set 4, the data will become increasingly different from the training distribution.
We don’t just create a different test set to make your life harder. Exoplanet characterisation is a field where the ground truth is difficult to come by (we can’t physically travel to the planet and take measurements). The models we built are always a simplification or quite often, wrong description of what is actually happening with the exoplanet’s atmosphere. The recent release of WASP-96 b observation from JWST NIRISS is an excellent example (see below). While our best fit model is able to figure out the existence of water from the data, it is obvious that our model cannot fully explain the fluctuations in fluxes (see 0.75 - 1 micron for example).
We would like to simulate the same scenario in this competition, i.e. we want to expose the submitted solutions to distributions that are different from their training distribution. Our end goal is to look for a ML solution that can maintain consistent performance when exposed to data that is drastically different from its training distribution. To evaluate the performance, all the test data we produced are complemented with corresponding retrieval results. The retrievals, or sampling procedure, are carried out using the same atmospheric model that produced the training data in the first place.
But surely, if you apply the training atmospheric model to the test data that is different, the answer will be wrong or biased?
Yes, you are right. The result will be wrong since the atmospheric model struggles to fit the data, as it may lack the necessary physics and/or fail to account for other astrophysical/instrumental effects. A key philosophy behind atmospheric retrieval is model comparison, i.e. to tell in a relatively objective way, which atmospheric model (assumptions) is most suitable for the observation. We don’t care if it is wrong, since most of them are anyway, what we do care about is whether it can faithfully reproduce the posterior distributions (even to spectra that it has never seen before). Having a model that can accurately reproduce the posterior distributions allows us to approximate the Bayesian Evidence, which is a frequently used tool to compare the adequacy of a model against one another.
Why dont you just just use sampling-based retrieval like you did when producing the ground truth?
Yes and no. Yes we can always do that, in fact this is what we have been doing for the past decade. However, with the arrival of JWST and soon after, Ariel, we are in desperate need for another solution. Simply put, sampling based solution will become prohibitively slow.
Can you produce a training set with multiple atmospheric assumptions?
We can but there are two obvious issues here. 1. Retrievals are expensive to obtain. We spent 5M CPUh to produce the training and test set, this is a lot of computational resources and we would prefer not to repeat the same exercise for other atmospheric models. 2. Which models should we pick? We simply don’t know as unfortunately, we don’t have prior knowledge on what to expect for exoplanetary atmospheres (it is still ongoing research).
We are certain that there are also other ways to circumvent these limitations, unfortunately, with our limited brain power, we did not manage to make it happen. We welcome your suggestions. Please feel free to ask them in the slack channel and/or email us directly.
Can you describe what kind of atmospheric assumptions has been used to produce the test set?
Very sorry but we cannot tell you :(
In essence, they are coming from the same test set. The 2nd data release has in total of 800 test examples, you have already seen 500 of them as they are copied directly from the 1st test data release. It will likely be harder as we have included a bit more examples on Set 4 (unknown distributions), which is there to help you better understand how your model will likely perform during the final evaluation phase.
Unfortunately, we will reset the scores. However, your best score before will be kept as a legacy score for your reference. The entry to the final evaluation stage will however depend on the new score.