Spaces:
Build error
Build error
| title: Competition MATH | |
| emoji: 🤗 | |
| colorFrom: blue | |
| colorTo: red | |
| sdk: gradio | |
| sdk_version: 3.19.1 | |
| app_file: app.py | |
| pinned: false | |
| tags: | |
| - evaluate | |
| - metric | |
| description: >- | |
| This metric is used to assess performance on the Mathematics Aptitude Test of Heuristics (MATH) dataset. | |
| It first canonicalizes the inputs (e.g., converting "1/2" to "\frac{1}{2}") and then computes accuracy. | |
| # Metric Card for Competition MATH | |
| ## Metric description | |
| This metric is used to assess performance on the [Mathematics Aptitude Test of Heuristics (MATH) dataset](https://huggingface.co/datasets/competition_math). | |
| It first canonicalizes the inputs (e.g., converting `1/2` to `\\frac{1}{2}`) and then computes accuracy. | |
| ## How to use | |
| This metric takes two arguments: | |
| `predictions`: a list of predictions to score. Each prediction is a string that contains natural language and LaTeX. | |
| `references`: list of reference for each prediction. Each reference is a string that contains natural language and LaTeX. | |
| ```python | |
| >>> from evaluate import load | |
| >>> math = load("competition_math") | |
| >>> references = ["\\frac{1}{2}"] | |
| >>> predictions = ["1/2"] | |
| >>> results = math.compute(references=references, predictions=predictions) | |
| ``` | |
| N.B. To be able to use Competition MATH, you need to install the `math_equivalence` dependency using `pip install git+https://github.com/hendrycks/math.git`. | |
| ## Output values | |
| This metric returns a dictionary that contains the [accuracy](https://huggingface.co/metrics/accuracy) after canonicalizing inputs, on a scale between 0.0 and 1.0. | |
| ### Values from popular papers | |
| The [original MATH dataset paper](https://arxiv.org/abs/2103.03874) reported accuracies ranging from 3.0% to 6.9% by different large language models. | |
| More recent progress on the dataset can be found on the [dataset leaderboard](https://paperswithcode.com/sota/math-word-problem-solving-on-math). | |
| ## Examples | |
| Maximal values (full match): | |
| ```python | |
| >>> from evaluate import load | |
| >>> math = load("competition_math") | |
| >>> references = ["\\frac{1}{2}"] | |
| >>> predictions = ["1/2"] | |
| >>> results = math.compute(references=references, predictions=predictions) | |
| >>> print(results) | |
| {'accuracy': 1.0} | |
| ``` | |
| Minimal values (no match): | |
| ```python | |
| >>> from evaluate import load | |
| >>> math = load("competition_math") | |
| >>> references = ["\\frac{1}{2}"] | |
| >>> predictions = ["3/4"] | |
| >>> results = math.compute(references=references, predictions=predictions) | |
| >>> print(results) | |
| {'accuracy': 0.0} | |
| ``` | |
| Partial match: | |
| ```python | |
| >>> from evaluate import load | |
| >>> math = load("competition_math") | |
| >>> references = ["\\frac{1}{2}","\\frac{3}{4}"] | |
| >>> predictions = ["1/5", "3/4"] | |
| >>> results = math.compute(references=references, predictions=predictions) | |
| >>> print(results) | |
| {'accuracy': 0.5} | |
| ``` | |
| ## Limitations and bias | |
| This metric is limited to datasets with the same format as the [Mathematics Aptitude Test of Heuristics (MATH) dataset](https://huggingface.co/datasets/competition_math), and is meant to evaluate the performance of large language models at solving mathematical problems. | |
| N.B. The MATH dataset also assigns levels of difficulty to different problems, so disagregating model performance by difficulty level (similarly to what was done in the [original paper](https://arxiv.org/abs/2103.03874) can give a better indication of how a given model does on a given difficulty of math problem, compared to overall accuracy. | |
| ## Citation | |
| ```bibtex | |
| @article{hendrycksmath2021, | |
| title={Measuring Mathematical Problem Solving With the MATH Dataset}, | |
| author={Dan Hendrycks | |
| and Collin Burns | |
| and Saurav Kadavath | |
| and Akul Arora | |
| and Steven Basart | |
| and Eric Tang | |
| and Dawn Song | |
| and Jacob Steinhardt}, | |
| journal={arXiv preprint arXiv:2103.03874}, | |
| year={2021} | |
| } | |
| ``` | |
| ## Further References | |
| - [MATH dataset](https://huggingface.co/datasets/competition_math) | |
| - [MATH leaderboard](https://paperswithcode.com/sota/math-word-problem-solving-on-math) | |
| - [MATH paper](https://arxiv.org/abs/2103.03874) |