# Documentation

### Light Track

Submissions will be evaluated by comparing distance between the quartiles estimates of each target $t$ given by Nested Sampling ($q_{l,t}$) and the submission ($\hat{q}_{l,t}$), where $q \in [1,2,3]$ (stands for 16th, 50th and 84th percentile). For each test case we will compute the average relative RMSE out of all quartiles and targets, i.e.

The participants’ performance on the Light Track will be measured based on the average performance over the entire test set, i.e. $\bar{S} = \frac{\sum^N_n RMSE_n }{N}$. The scoring function function is computed in the following way:

Thus the higher the score, the better the performance. Winners will be selected based on the overall score of the final evaluation test set.

To help you with the model evaluation, we have included the exact metric we used for the challenge here.

### Regular Track

We use the Wasserstein-2 distance, also known as Earth-Mover distance. It is a metric originating from the theory of optimal transport and is used to evaluate the level of overlap between two multivariate distributions, with 0 indicating maximally similar (i.e. the same) distributions and 1 maximally dissimilar.

where $F_n$ and $\hat{F}_n$ represent the Nested Sampling (NS) generated approximate conditional distribution and the participant’s surrogate distribution for a single test case (n) respectively. $\Gamma(F_n, \hat{F}_n)$ represents the set of probability distributions on the metric space $\mathbb{R}\times\mathbb{R}$, whose marginal distributions are $F$ and $\hat{F}$ on first and second factor respectively.
The overall score will be the average score obtained over the entire uploaded test set,

We subtract $\bar{W}_2$ by unity and multiply the result by 1000 to turn it into a monotonically increasing function

We would also like to note that, as of 06-Oct, we are limiting the size of the distribution to at most 5000 samples (minimum 1000). This measure will only affect submissions to the regular track but not the light track. For more information please see here.

To help you with the model evaluation, we have included the exact metric we used for the challenge here.