Update README.md
Browse files
README.md
CHANGED
|
@@ -61,14 +61,14 @@ R1T2 operates at a new sweet spot in intelligence vs. output token length. It ap
|
|
| 61 |
Evaluation was performed using the evalchemy framework (pass@1 averaged over 10/5 runs for AIME/GPQAD, at a temperature of 0.6).
|
| 62 |
We report measured benchmark results for our R1T2, R1T models and published benchmark results for V3-0324, R1, R1-0528.
|
| 63 |
|
| 64 |
-
| | R1T2 | R1T | V3-0324 | R1 | R1-0528 | Comment |
|
| 65 |
-
|
| 66 |
-
| AIME-24 | 82.3 | 74.7 | 59.4 | 79.8 | 91.4 | |
|
| 67 |
-
| AIME-25 | 70.0 | 58.3 | 49.6 | 70.0 | 87.5 | V3-0324
|
| 68 |
-
| GPQA-Diamond | 77.9 | 72.0 | 68.4 | 71.5 | 81.0 | |
|
| 69 |
-
| Aider Polyglot | 64.4 | 48.4 | 44.9 | 52.0 | 71.6 | R1T2 source: Aider discord, t=0.75 |
|
| 70 |
-
| EQ-Bench Longform Creative Writing | 76.4 | ./. | 78.1 | 74.6 | 78.9 | see [EQ Bench](https://eqbench.com/creative_writing_longform.html) |
|
| 71 |
-
| Vectara Hallucination Rate | 5.5 | ./. | 8.0 | 14.3 | 7.7 | see [Hallucination Leaderboard](https://github.com/vectara/hallucination-leaderboard)
|
| 72 |
|
| 73 |
## Technological background
|
| 74 |
|
|
|
|
| 61 |
Evaluation was performed using the evalchemy framework (pass@1 averaged over 10/5 runs for AIME/GPQAD, at a temperature of 0.6).
|
| 62 |
We report measured benchmark results for our R1T2, R1T models and published benchmark results for V3-0324, R1, R1-0528.
|
| 63 |
|
| 64 |
+
| | R1T2 | R1T | V3-0324 | R1 | R1-0528 | Comment | Special source |
|
| 65 |
+
|:-----------------------------------|-----:|-----:|--------:|-----:|--------:|:--------|:--------|
|
| 66 |
+
| AIME-24 | 82.3 | 74.7 | 59.4 | 79.8 | 91.4 | | |
|
| 67 |
+
| AIME-25 | 70.0 | 58.3 | 49.6 | 70.0 | 87.5 | | V3-0324 AIME-25 measured by us |
|
| 68 |
+
| GPQA-Diamond | 77.9 | 72.0 | 68.4 | 71.5 | 81.0 | | |
|
| 69 |
+
| Aider Polyglot | 64.4 | 48.4 | 44.9 | 52.0 | 71.6 | R1T2 beats two of its parents, V3-0324 and R1, and was measured to be about 2.2 times more token efficient, i.e. faster, than its third parent, R1-0528 | R1T2 source: Aider discord, t=0.75 |
|
| 70 |
+
| EQ-Bench Longform Creative Writing | 76.4 | ./. | 78.1 | 74.6 | 78.9 | EQ Bench version before August 8th, 2025 | see [EQ Bench](https://eqbench.com/creative_writing_longform.html) |
|
| 71 |
+
| Vectara Hallucination Rate | 5.5 | ./. | 8.0 | 14.3 | 7.7 | lower hallucination rates are better, R1T2 is better than all its three parents | see [Hallucination Leaderboard](https://github.com/vectara/hallucination-leaderboard) |
|
| 72 |
|
| 73 |
## Technological background
|
| 74 |
|