BLEU score - Huang Xiao

BLEU (Bilingual Evaluation Understudy) score is a metric used for evaluating the quality of text which has been machine-translated from one language to another. It is one of the most widely used metrics in the field of natural language processing (NLP), particularly in machine translation. The BLEU score measures the correspondence between a machine's output and that of a human, offering a quantitative assessment of how close the machine-generated translations are to a set of reference translations. # How BLEU Works: 1. **N-gram Matching**: BLEU evaluates the quality of text by considering the precision of n-grams (contiguous sequences of n items from a given sample of text) in the machine-generated text compared to reference texts. It counts how many n-grams in the machine-generated text appear in the reference texts and computes a precision score for each n-gram order (typically up to 4-grams). 2. **Brevity Penalty**: To discourage overly short translations, BLEU incorporates a brevity penalty. If the length of the machine-generated translation is shorter than the reference texts, the BLEU score is penalized, since shorter texts are likely to have higher n-gram precision simply by being less verbose. # Calculation: The BLEU score is calculated as follows: - Compute the n-gram precision $P_n$ for n-grams of different lengths (usually 1 to 4). - Compute a brevity penalty (BP) to penalise translations that are too short. - The overall BLEU score is then calculated using a weighted geometric mean of the n-gram precisions, multiplied by the brevity penalty: $\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log P_n\right)$ where $w_n$ are weights for each n-gram precision (usually uniform, i.e., $w_n = 1/N$, and BP is the brevity penalty. # Usage: The BLEU score is widely used as a quick and inexpensive way to evaluate machine translation quality. It's useful for comparing the performance of different translation models or systems and for tracking improvements in translation quality over time. However, it's not a perfect measure and doesn't capture all aspects of translation quality, such as semantic coherence, grammaticality, or appropriateness of the translation in context. # Limitations: - **Lack of Semantic Understanding**: BLEU focuses on surface form matching and does not account for the meaning or semantic content of the translation. - **Reference Dependence**: The quality of the BLEU score is heavily dependent on the quality and variety of the reference translations. - **Granularity**: It may not capture finer aspects of language such as style, tone, or idiomatic expressions. Despite these limitations, BLEU remains a standard benchmark in machine translation due to its simplicity, ease of use, and ability to provide a quick comparative measure of translation quality.