Spaces:

sunhill
/

cider

Sleeping

App Files Files Community

sunhill commited on Sep 24

Commit

ecd5dc6

1 Parent(s): 9fa3156

add CIDEr score

Browse files

Files changed (5) hide show

README.md +41 -19
app.py +44 -2
cider.py +60 -54
cider_scorer.py +204 -0
tests.py +38 -12

README.md CHANGED Viewed

@@ -3,46 +3,68 @@ title: CIDEr
 tags:
 - evaluate
 - metric
-description: "TODO: add a description here"
 sdk: gradio
-sdk_version: 3.19.1
 app_file: app.py
 pinned: false
 ---
 # Metric Card for CIDEr
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
 ### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
-### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
-#### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 ### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
-## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
 ## Further References
-*Add any useful further references.*

 tags:
 - evaluate
 - metric
+description: "CIDEr (Consensus-based Image Description Evaluation) is a metric used to evaluate the quality of image captions by measuring their similarity to human-generated reference captions."
 sdk: gradio
+sdk_version: 5.45.0
 app_file: app.py
 pinned: false
 ---
 # Metric Card for CIDEr
+***Module Card Instructions:*** *This module implements the CIDEr metric for image captioning evaluation.*
 ## Metric Description
+CIDEr (Consensus-based Image Description Evaluation) is a metric used to evaluate the quality of image captions by measuring their similarity to human-generated reference captions. It does this by comparing the n-grams of the candidate caption to the n-grams of the reference captions, and measuring how many n-grams are shared between the candidate and the references.
 ## How to Use
+*To use this metric, you can call the `compute` method with the following parameters:*
 ### Inputs
+- **predictions** *(batch of list of strings): The generated captions to evaluate.*
+- **references** *(batch of list of strings): The reference captions for each generated caption.*
+### Output Values
+- **score** *(dict): The CIDEr score, which ranges from 0 to 1, with higher scores indicating better quality captions.*
 ### Examples
+```python
+import evaluate
+metric = evaluate.load("sunhill/cider")
+results = metric.compute(
+    predictions=[["train traveling down a track in front of a road"]],
+    references=[
+        [
+            "a train traveling down tracks next to lights",
+            "a blue and silver train next to train station and trees",
+            "a blue train is next to a sidewalk on the rails",
+            "a passenger train pulls into a train station",
+            "a train coming down the tracks arriving at a station",
+        ]
+    ]
+)
+print(results)
+```
 ## Citation
+```bibtex
+@InProceedings{Vedantam_2015_CVPR,
+    author = {Vedantam, Ramakrishna and Lawrence Zitnick, C. and Parikh, Devi},
+    title = {CIDEr: Consensus-Based Image Description Evaluation},
+    booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
+    month = {June},
+    year = {2015}
+}
+```
 ## Further References
+- [CIDEr](https://github.com/ramavedantam/cider)
+- [Image Caption Metrics](https://github.com/EricWWWW/image-caption-metrics)

app.py CHANGED Viewed

@@ -1,6 +1,48 @@
 import evaluate
-from evaluate.utils import launch_gradio_widget
 module = evaluate.load("sunhill/cider")
-launch_gradio_widget(module)

+import sys
+from pathlib import Path
 import evaluate
+import gradio as gr
+from evaluate import parse_readme
 module = evaluate.load("sunhill/cider")
+def compute_cider(references, predictions):
+    predictions = [[predictions]]
+    references = [[ref.strip() for ref in references.split(";") if ref.strip()]]
+    return module.compute(predictions=predictions, references=references)["cider_score"]
+iface = gr.Interface(
+    fn=compute_cider,
+    inputs=[
+        gr.Textbox(
+            label="References",
+            placeholder="Enter reference texts here, separated by semicolon... (e.g. ref1; ref2; ref3)",
+        ),
+        gr.Textbox(
+            label="Predictions",
+            placeholder="Enter prediction text here, Only one prediction is allowed...",
+        ),
+    ],
+    outputs=gr.Number(label="CIDEr Score"),
+    title="CIDEr Score Evaluator",
+    description="Evaluate the alignment between an image and a text using CIDEr Score.",
+    examples=[
+        [
+            (
+                "a train traveling down tracks next to lights; "
+                "a blue and silver train next to train station and trees; "
+                "a blue train is next to a sidewalk on the rails; "
+                "a passenger train pulls into a train station; "
+                "a train coming down the tracks arriving at a station;"
+            ),
+            "train traveling down a track in front of a road",
+        ]
+    ],
+    article=parse_readme(Path(sys.path[0]) / "README.md"),
+)
+iface.launch()

cider.py CHANGED Viewed

@@ -1,68 +1,61 @@
-# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""TODO: Add a description here."""
 import evaluate
 import datasets
-# TODO: Add BibTeX citation
 _CITATION = """\
-@InProceedings{huggingface:module,
-title = {A great new module},
-authors={huggingface, Inc.},
-year={2020}
 }
 """
-# TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
 """
-# TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
-Calculates how good are predictions given some references, using certain scores
 Args:
-    predictions: list of predictions to score. Each predictions
-        should be a string with tokens separated by spaces.
-    references: list of reference for each prediction. Each
-        reference should be a string with tokens separated by spaces.
 Returns:
-    accuracy: description of the first score,
-    another_score: description of the second score,
 Examples:
-    Examples should be written in doctest format, and should illustrate how
-    to use the function.
-    >>> my_new_module = evaluate.load("my_new_module")
-    >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
-    >>> print(results)
-    {'accuracy': 1.0}
 """
-# TODO: Define external resources urls if needed
-BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class CIDEr(evaluate.Metric):
-    """TODO: Short description of my evaluation module."""
     def _info(self):
-        # TODO: Specifies the evaluate.EvaluationModuleInfo object
         return evaluate.MetricInfo(
             # This is the description that will appear on the modules page.
             module_type="metric",
@@ -70,26 +63,39 @@ class CIDEr(evaluate.Metric):
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
-            features=datasets.Features({
-                'predictions': datasets.Value('int64'),
-                'references': datasets.Value('int64'),
-            }),
             # Homepage of the module for documentation
-            homepage="http://module.homepage",
             # Additional links to the codebase or references
-            codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
-            reference_urls=["http://path.to.reference.url/new_module"]
         )
     def _download_and_prepare(self, dl_manager):
         """Optional: download external resources useful to compute the scores"""
-        # TODO: Download external resources if needed
         pass
     def _compute(self, predictions, references):
         """Returns the scores"""
-        # TODO: Compute the different scores of the module
-        accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
-        return {
-            "accuracy": accuracy,
-        }

+"""This module implements the CIDEr metric for image captioning evaluation."""
 import evaluate
 import datasets
+from .cider_scorer import CiderScorer
 _CITATION = """\
+@InProceedings{Vedantam_2015_CVPR,
+    author = {Vedantam, Ramakrishna and Lawrence Zitnick, C. and Parikh, Devi},
+    title = {CIDEr: Consensus-Based Image Description Evaluation},
+    booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
+    month = {June},
+    year = {2015}
 }
 """
 _DESCRIPTION = """\
+This is a metric to evaluate image captioning. It is based on the idea of
+measuring the consensus between a candidate image caption and a set of
+reference image captions written by humans. The CIDEr score is computed by
+comparing the n-grams of the candidate caption to the n-grams of the reference
+captions, and measuring how many n-grams are shared between the candidate and
+the references. The score is then normalized by the length of the candidate
+caption and the number of reference captions.
 """
 _KWARGS_DESCRIPTION = """
+CIDEr (Consensus-based Image Description Evaluation) is a metric for evaluating the quality of image captions.
+It measures how similar a generated caption is to a set of reference captions written by humans.
 Args:
+    predictions: list of predictions to score.
+    references: list of references for each prediction.
 Returns:
+    score: CIDEr score.
 Examples:
+    >>> metric = evaluate.load("sunhill/cider")
+    >>> results = metric.compute(
+        predictions=[['train traveling down a track in front of a road']],
+        references=[
+            [
+                'a train traveling down tracks next to lights',
+                'a blue and silver train next to train station and trees',
+                'a blue train is next to a sidewalk on the rails',
+                'a passenger train pulls into a train station',
+                'a train coming down the tracks arriving at a station'
+            ]
+        ]
+    )
 """
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class CIDEr(evaluate.Metric):
+    """CIDEr metric."""
     def _info(self):
         return evaluate.MetricInfo(
             # This is the description that will appear on the modules page.
             module_type="metric",
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
+            features=datasets.Features(
+                {
+                    "predictions": datasets.List((datasets.Value("string"))),
+                    "references": datasets.List(datasets.Value("string")),
+                }
+            ),
             # Homepage of the module for documentation
+            homepage="https://huggingface.co/spaces/sunhill/cider",
             # Additional links to the codebase or references
+            codebase_urls=[
+                "https://github.com/ramavedantam/cider",
+                "https://github.com/EricWWWW/image-caption-metrics",
+            ],
+            reference_urls=[
+                (
+                    "https://openaccess.thecvf.com/content_cvpr_2015/html/"
+                    "Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.html"
+                )
+            ],
         )
     def _download_and_prepare(self, dl_manager):
         """Optional: download external resources useful to compute the scores"""
         pass
     def _compute(self, predictions, references):
         """Returns the scores"""
+        assert len(predictions) == len(references), (
+            "The number of predictions and references should be the same. "
+            f"Got {len(predictions)} predictions and {len(references)} references."
+        )
+        cider_scorer = CiderScorer(n=4, sigma=6.0)
+        for pred, ref in zip(predictions, references):
+            cider_scorer += (pred[0], ref)
+        score, _ = cider_scorer.compute_score()
+        return {"cider_score": score.item()}

cider_scorer.py ADDED Viewed

	@@ -0,0 +1,204 @@

+#!/usr/bin/env python
+# Tsung-Yi Lin <[email protected]>
+# Ramakrishna Vedantam <[email protected]>
+import math
+import copy
+from collections import defaultdict
+import numpy as np
+def precook(s, n=4, out=False):
+    """
+    Takes a string as input and returns an object that can be given to
+    either cook_refs or cook_test. This is optional: cook_refs and cook_test
+    can take string arguments as well.
+    :param s: string : sentence to be converted into ngrams
+    :param n: int    : number of ngrams for which representation is calculated
+    :return: term frequency vector for occuring ngrams
+    """
+    words = s.split()
+    counts = defaultdict(int)
+    for k in range(1, n + 1):
+        for i in range(len(words) - k + 1):
+            ngram = tuple(words[i: i + k])
+            counts[ngram] += 1
+    return counts
+def cook_refs(refs, n=4):
+    """Takes a list of reference sentences for a single segment
+    and returns an object that encapsulates everything that BLEU
+    needs to know about them.
+    :param refs: list of string : reference sentences for some image
+    :param n: int : number of ngrams for which (ngram) representation is calculated
+    :return: result (list of dict)
+    """
+    # lhuang: oracle will call with "average"
+    return [precook(ref, n) for ref in refs]
+def cook_test(test, n=4):
+    """Takes a test sentence and returns an object that
+    encapsulates everything that BLEU needs to know about it.
+    :param test: list of string : hypothesis sentence for some image
+    :param n: int : number of ngrams for which (ngram) representation is calculated
+    :return: result (dict)
+    """
+    return precook(test, n, True)
+class CiderScorer(object):
+    """CIDEr scorer."""
+    def copy(self):
+        """copy the refs."""
+        new = CiderScorer(n=self.n)
+        new.ctest = copy.copy(self.ctest)
+        new.crefs = copy.copy(self.crefs)
+        return new
+    def __init__(self, test=None, refs=None, n=4, sigma=6.0):
+        """singular instance"""
+        self.n = n
+        self.sigma = sigma
+        self.crefs = []
+        self.ctest = []
+        self.document_frequency = defaultdict(float)
+        self.cook_append(test, refs)
+        self.ref_len = None
+    def cook_append(self, test, refs):
+        """called by constructor and __iadd__ to avoid creating new instances."""
+        if refs is not None:
+            self.crefs.append(cook_refs(refs))
+            if test is not None:
+                # N.B.: -1
+                self.ctest.append(cook_test(test))
+            else:
+                self.ctest.append(None)  # lens of crefs and ctest have to match
+    def size(self):
+        assert len(self.crefs) == len(self.ctest), "refs/test mismatch! %d<>%d" % (
+            len(self.crefs),
+            len(self.ctest),
+        )
+        return len(self.crefs)
+    def __iadd__(self, other):
+        """add an instance (e.g., from another sentence)."""
+        if type(other) is tuple:
+            # avoid creating new CiderScorer instances
+            self.cook_append(other[0], other[1])
+        else:
+            self.ctest.extend(other.ctest)
+            self.crefs.extend(other.crefs)
+        return self
+    def compute_doc_freq(self):
+        """
+        Compute term frequency for reference data.
+        This will be used to compute idf (inverse document frequency later)
+        The term frequency is stored in the object
+        :return: None
+        """
+        for refs in self.crefs:
+            # refs, k ref captions of one image
+            for ngram in set([ngram for ref in refs for (ngram, count) in ref.items()]):
+                self.document_frequency[ngram] += 1
+            # maxcounts[ngram] = max(maxcounts.get(ngram,0), count)
+    def compute_cider(self):
+        def counts2vec(cnts):
+            """
+            Function maps counts of ngram to vector of tfidf weights.
+            The function returns vec, an array of dictionary that store mapping of n-gram and tf-idf weights.
+            The n-th entry of array denotes length of n-grams.
+            :param cnts:
+            :return: vec (array of dict), norm (array of float), length (int)
+            """
+            vec = [defaultdict(float) for _ in range(self.n)]
+            length = 0
+            norm = [0.0 for _ in range(self.n)]
+            for ngram, term_freq in cnts.items():
+                # give word count 1 if it doesn't appear in reference corpus
+                df = np.log(max(1.0, self.document_frequency[ngram]))
+                # ngram index
+                n = len(ngram) - 1
+                # tf (term_freq) * idf (precomputed idf) for n-grams
+                vec[n][ngram] = float(term_freq) * (self.ref_len - df)
+                # compute norm for the vector.  the norm will be used for computing similarity
+                norm[n] += pow(vec[n][ngram], 2)
+                if n == 1:
+                    length += term_freq
+            norm = [np.sqrt(n) for n in norm]
+            return vec, norm, length
+        def sim(vec_hyp, vec_ref, norm_hyp, norm_ref, length_hyp, length_ref):
+            """
+            Compute the cosine similarity of two vectors.
+            :param vec_hyp: array of dictionary for vector corresponding to hypothesis
+            :param vec_ref: array of dictionary for vector corresponding to reference
+            :param norm_hyp: array of float for vector corresponding to hypothesis
+            :param norm_ref: array of float for vector corresponding to reference
+            :param length_hyp: int containing length of hypothesis
+            :param length_ref: int containing length of reference
+            :return: array of score for each n-grams cosine similarity
+            """
+            delta = float(length_hyp - length_ref)
+            # measure consine similarity
+            val = np.array([0.0 for _ in range(self.n)])
+            for n in range(self.n):
+                # ngram
+                for ngram, count in vec_hyp[n].items():
+                    # vrama91 : added clipping
+                    val[n] += (
+                        min(vec_hyp[n][ngram], vec_ref[n][ngram]) * vec_ref[n][ngram]
+                    )
+                if (norm_hyp[n] != 0) and (norm_ref[n] != 0):
+                    val[n] /= norm_hyp[n] * norm_ref[n]
+                assert not math.isnan(val[n])
+                # vrama91: added a length based gaussian penalty
+                val[n] *= np.e ** (-(delta**2) / (2 * self.sigma**2))
+            return val
+        # compute log reference length
+        self.ref_len = np.log(float(len(self.crefs)))
+        if len(self.crefs) == 1:
+            self.ref_len = 1
+        scores = []
+        for test, refs in zip(self.ctest, self.crefs):
+            # compute vector for test captions
+            vec, norm, length = counts2vec(test)
+            # compute vector for ref captions
+            score = np.array([0.0 for _ in range(self.n)])
+            for ref in refs:
+                vec_ref, norm_ref, length_ref = counts2vec(ref)
+                score += sim(vec, vec_ref, norm, norm_ref, length, length_ref)
+            # change by vrama91 - mean of ngram scores, instead of sum
+            score_avg = np.mean(score)
+            # divide by number of references
+            score_avg /= len(refs)
+            # multiply score by 10
+            # score_avg *= 10.0
+            # append score of an image to the score list
+            scores.append(score_avg)
+        return scores
+    def compute_score(self, option=None, verbose=0):
+        # compute idf
+        self.compute_doc_freq()
+        # assert to check document frequency
+        assert len(self.ctest) >= max(self.document_frequency.values())
+        # compute cider score
+        score = self.compute_cider()
+        # debug
+        # print score
+        return np.mean(np.array(score)), np.array(score)

tests.py CHANGED Viewed

@@ -1,17 +1,43 @@
 test_cases = [
     {
-        "predictions": [0, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0}
     },
     {
-        "predictions": [1, 1],
-        "references": [1, 1],
-        "result": {"metric_score": 1}
     },
-    {
-        "predictions": [1, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0.5}
-    }
-]

+import evaluate
 test_cases = [
     {
+        "predictions": [["train traveling down a track in front of a road"]],
+        "references": [
+            [
+                "a train traveling down tracks next to lights",
+                "a blue and silver train next to train station and trees",
+                "a blue train is next to a sidewalk on the rails",
+                "a passenger train pulls into a train station",
+                "a train coming down the tracks arriving at a station",
+            ]
+        ]
     },
     {
+        "predictions": [
+            ["plane is flying through the sky"],
+            ["birthday cake sitting on top of a white plate"],
+        ],
+        "references": [
+            [
+                "a large jetliner flying over a traffic filled street",
+                "an airplane flies low in the sky over a city street",
+                "an airplane flies over a street with many cars",
+                "an airplane comes in to land over a road full of cars",
+                "the plane is flying over top of the cars",
+            ],
+            ["a blue plate filled with marshmallows chocolate chips and banana"],
+        ]
     },
+]
+metric = evaluate.load("sunhill/cider")
+for i, test_case in enumerate(test_cases):
+    results = metric.compute(
+        predictions=test_case["predictions"], references=test_case["references"]
+    )
+    print(f"Test case {i+1}:")
+    print("Predictions:", test_case["predictions"])
+    print("References:", test_case["references"])
+    print(results)