TNGHK commited on
Commit
2827249
·
verified ·
1 Parent(s): cf3c8de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -61,14 +61,14 @@ R1T2 operates at a new sweet spot in intelligence vs. output token length. It ap
61
  Evaluation was performed using the evalchemy framework (pass@1 averaged over 10/5 runs for AIME/GPQAD, at a temperature of 0.6).
62
  We report measured benchmark results for our R1T2, R1T models and published benchmark results for V3-0324, R1, R1-0528.
63
 
64
- | | R1T2 | R1T | V3-0324 | R1 | R1-0528 | Comment |
65
- |:-----------------------------------|-----:|-----:|--------:|-----:|--------:|:--------|
66
- | AIME-24 | 82.3 | 74.7 | 59.4 | 79.8 | 91.4 | |
67
- | AIME-25 | 70.0 | 58.3 | 49.6 | 70.0 | 87.5 | V3-0324 source: AIME-25 measured by us |
68
- | GPQA-Diamond | 77.9 | 72.0 | 68.4 | 71.5 | 81.0 | |
69
- | Aider Polyglot | 64.4 | 48.4 | 44.9 | 52.0 | 71.6 | R1T2 source: Aider discord, t=0.75 |
70
- | EQ-Bench Longform Creative Writing | 76.4 | ./. | 78.1 | 74.6 | 78.9 | see [EQ Bench](https://eqbench.com/creative_writing_longform.html) |
71
- | Vectara Hallucination Rate | 5.5 | ./. | 8.0 | 14.3 | 7.7 | see [Hallucination Leaderboard](https://github.com/vectara/hallucination-leaderboard), lower hallucination rates are better |
72
 
73
  ## Technological background
74
 
 
61
  Evaluation was performed using the evalchemy framework (pass@1 averaged over 10/5 runs for AIME/GPQAD, at a temperature of 0.6).
62
  We report measured benchmark results for our R1T2, R1T models and published benchmark results for V3-0324, R1, R1-0528.
63
 
64
+ | | R1T2 | R1T | V3-0324 | R1 | R1-0528 | Comment | Special source |
65
+ |:-----------------------------------|-----:|-----:|--------:|-----:|--------:|:--------|:--------|
66
+ | AIME-24 | 82.3 | 74.7 | 59.4 | 79.8 | 91.4 | | |
67
+ | AIME-25 | 70.0 | 58.3 | 49.6 | 70.0 | 87.5 | | V3-0324 AIME-25 measured by us |
68
+ | GPQA-Diamond | 77.9 | 72.0 | 68.4 | 71.5 | 81.0 | | |
69
+ | Aider Polyglot | 64.4 | 48.4 | 44.9 | 52.0 | 71.6 | R1T2 beats two of its parents, V3-0324 and R1, and was measured to be about 2.2 times more token efficient, i.e. faster, than its third parent, R1-0528 | R1T2 source: Aider discord, t=0.75 |
70
+ | EQ-Bench Longform Creative Writing | 76.4 | ./. | 78.1 | 74.6 | 78.9 | EQ Bench version before August 8th, 2025 | see [EQ Bench](https://eqbench.com/creative_writing_longform.html) |
71
+ | Vectara Hallucination Rate | 5.5 | ./. | 8.0 | 14.3 | 7.7 | lower hallucination rates are better, R1T2 is better than all its three parents | see [Hallucination Leaderboard](https://github.com/vectara/hallucination-leaderboard) |
72
 
73
  ## Technological background
74