DABstep

Running on CPU Upgrade

App Files Files Community

Issues with Dabstep v1

#16

by justinlangsethgenesis - opened 10 days ago

Discussion

justinlangsethgenesis

10 days ago

•

edited 5 days ago

In our work on this benchmark, we've encountered several things that appear to be issues with the current blueprint, ground truth answers, and framework code. Would be great to know which of these are intentional versus things that should be fixed or enhanced in a new version.

Issues that seem like they should be addressed in a new version, and seem to invalidate the current version, unless we and other submitters are missing something:

the scorer.py (it's overly permissive in its string matcher, accepts negatives for positives, accepts single letter when answer is actually supposed to be letter:number). Many incorrect answers in submissions seem to be scored as correct, even when missing the actual numerical portion of the answer, or with the wrong sign +/-.
one cluster of hard cases seems improperly limited in which MCCs it expects in the answers (only those in fees.json, vs the mcc - lookup table or the ones in the payments table). Unless there is an industry practice we are not aware of for why this would be logical, it should be explained or repaired (or MCC lookup table should be removed if not needed)
rounding expected at a different level than specified in answer format (although there is a possible explanation for this in that net fees charged monthly are rounded at cents, and the "14 decimal precision" format instruction is a confusing red herring, however it is not consistent across the hard question clusters, one accepts only 2 digit rounding but asks for 14 digit precision, another asks for 6 digit precision and does actually accept 6 digit, but not 2 digit rounding).
one cluster expecting simply ACI code instead of ACI:{fee} as per format instructions (unless the expected fees are wrong / or all attempts to-date are wrong, and this is a result of the scorer's string match looseness)
one cluster seems to inexplicably ignore the is_credit null wildcard in its expected answers, unlike other cases that properly do include it
type-o in one of the case clusters answer formats (card scheme vs ACI), previously acknowledged by benchmark authors
the downloadability of all submission and scored files, allowing for derivation of ground truth and review of other submissions including their reasoning traces (when present). A feature of this benchmark is that the ground truth is supposed to be hidden. We've provided updated leaderboard code in another post here that implements this that could be used in a new version.

Areas that are not directly invalid, but could be made more realistic in a future version, as agents with domain experience find these things confusing as they do not emulate real world patterns:

the repetitiveness of the hard cases (there are only 15 actual hard case types, with small input variations), future versions should have more variety and less "rote" repetition, unless this is an intentional part of the test to see how repeatably agents can follow established patterns
the bulk of the 15 clusters hard problems involve diving and running relatively complex fee calculation rules, which except for the "what if" scenarios, would likely already be calculated and present in transactional data and be calculated by a dedicated operational system as opposed to on the fly by an analytical system
large swaths of transactions (seemingly 41%) match no fees and count as "free," at €0 fee (smart domain agents flag this as a likely data quality issue)
fees still accrue for transactions "refused by Adyen" that scale by volume, versus fixed "decline" fees (this does not seem to be industry practice, and domain agents seek clarification not provided in the manual whether or not these refused transactions should accrue fees)

Can we determine which of these things are intentional and "part of the test" and which are unintentional and should be fixed or enhanced in a subsequent version?

justinlangsethgenesis changed discussion title from Issues with Dabstep v1 -- perhaps time for v2? to Issues with Dabstep v1 9 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment