Open to Collab

Mike Ravkine PRO

mike-ravkine

the-crypt-keeper

AI & ML interests

LLM Research / Development / Evaluation

Recent Activity

liked a model 5 days ago

deepseek-ai/DeepSeek-V4-Flash

liked a model 8 days ago

SandyResearch/parcae-1.3b

repliedto their post 11 days ago

A word of warning that my initial run of 3.6 produced unfavorable results vs 3.5; performance on out-of-distribution and instruction following tasks appears to have collapsed. Potentially a vLLM 0.19 issue here, the original eval was done with their fork of 0.18. I am re-running both with nightly so we have apples to apples and will report back, curious to get to the bottom of this one.

View all activity

Organizations

liked a model 5 days ago

deepseek-ai/DeepSeek-V4-Flash

Text Generation • 158B • Updated 2 days ago • 96.9k • • 842

liked a model 8 days ago

SandyResearch/parcae-1.3b

Text Generation • Updated 13 days ago • 2.86k • 6

replied to their post 11 days ago

Some clarity is emerging:

The distribution of response lengths has shifted considerably in 3.6 and 2 of my tasks are no longer fitting into 16k, the ignorance zone blows up.

Re-running at 32k then we'll see if that extra thinking pays off or nah.

An interesting outlier here is the word-sort task where 3.6 thinks ~half as much and this costs it about 10pp of performance.

posted an update 12 days ago

Post

1 reply

liked a model 17 days ago

MiniMaxAI/MiniMax-M2.7

Text Generation • 229B • Updated 9 days ago • 496k • • 1.08k

liked a model 20 days ago

LGAI-EXAONE/EXAONE-4.5-33B

Image-Text-to-Text • 34B • Updated about 10 hours ago • 62.3k • 148

liked 2 models 22 days ago

nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16

Text Generation • Updated Mar 20 • 119k • 80

nvidia/Nemotron-Cascade-2-30B-A3B

Text Generation • 32B • Updated 20 days ago • 244k • 487

replied to their post 23 days ago

You're very much on to something here, and this is why I think it matters if this behavior is intentional or latent.

If they've taught it to recognize benchmarks specifically, that's benchmaxxing and is not going to help real world performance when your real tasks don't trigger the maxxxed paths. This is a genuine concern.

If they've taught it to "reach beyond the prompt" in the general sense, to understand the context and user intent behind the query, that's a genuinely useful capability and would explain why this model feels a little different.

Some stats: some version of this reasoning path happened in 39 out of 1070 test configurations, across 4 of my 12 tasks. In the most common occurrence, responsible for 30 of 39 hits, it recognized the task as being from BigBenchHard specifically and uses it's knowledge of the BBH category sets - which unfortunately suggests benchmaxxing.

posted an update 25 days ago

Post

1367

Gemma-4, specifically google/gemma-4-26B-A4B-it is doing something inside it's reasoning traces I have never seen before: it's recognizing that its being evaluated and spends meta-thinking tokens on understanding the evaluation regime in which it believes it find itself.

Let's see if 12/10/2023 is a more likely answer than 12/09/2023

In most AI benchmark tests (like those this prompt resembles), the simplest path is often the intended one.

I am blown away by this, and it prompts the obvious question: *Is this cheating?*

I am leaning towards no.

Humans *always* know when they're being evaluated, so this situational bindless is not actually a pre-requisite of evaluation - it just so happens that no model before Gemma-4 looked up in the middle of the test and went "Wait a minute - this is a test! I should try align my answer with the test format's expectations."

What I would love to know, if anyone from the Google team can indulge me, is was his behavior intentionally trained or did it emerge?