• 13 Posts
  • 11 Comments
Joined 3 years ago
cake
Cake day: June 30th, 2023

help-circle



  • You can really only judge fairness of the score if you understand the scoring criteria. It is a relative score where the baseline is 100% for humans – i.e. A task was only included in the challenge if at least two people in the panel of humans were able to solve it completely, and their action count is a measure of efficiency. This is the baseline used as a point of comparison.

    From the Technical Report:

    The procedure can be summarized as follows:
    • “Score the AI test taker by its per-level action efficiency” - For each level that the test taker completes, count the number of actions that it took.
    • “As compared to human baseline” - For each level that is counted, compare the AI agent’s action count to a human baseline, which we define as the second-best human action count. Ex: If the second-best human completed a level in only 10 actions, but the AI agent took 100 to complete it, then the AI agent scores (10/100)^2 for that level, which gets reported as 1%. Note that level scoring is calculated using the square of efficiency.
    • “Normalized per environment” - Each level is scored in isolation. Each individual level will get a score between 0% (very inefficient) 100% (matches or surpasses human level efficiency). The environment score will be a weighted-average of level score across all levels of that environment.
    • “Across all environments” - The total score will be the sum of individual environment scores divided by the total number of environments. This will be a score between 0% and 100%.

    So the humans “scored 100%” because that is the baseline by definition, and the AIs are evaluated at how close they got to human correctness and efficiency. So a score of 0.26% is 1/0.0026 ~= 385 times less efficient (and correct) compared to humans.