Quantifying Variance in Evaluation Benchmarks

Bibliographic Details
Title: Quantifying Variance in Evaluation Benchmarks
Authors: Madaan, Lovish, Singh, Aaditya K., Schaeffer, Rylan, Poulton, Andrew, Koyejo, Sanmi, Stenetorp, Pontus, Narang, Sharan, Hupkes, Dieuwke
Publication Year: 2024
Collection: Computer Science
Subject Terms: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
More Details: Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.
Document Type: Working Paper
Access URL: http://arxiv.org/abs/2406.10229
Accession Number: edsarx.2406.10229
Database: arXiv
FullText Text:
  Availability: 0
CustomLinks:
  – Url: http://arxiv.org/abs/2406.10229
    Name: EDS - Arxiv
    Category: fullText
    Text: View this record from Arxiv
    MouseOverText: View this record from Arxiv
  – Url: https://resolver.ebsco.com/c/xy5jbn/result?sid=EBSCO:edsarx&genre=article&issn=&ISBN=&volume=&issue=&date=20240614&spage=&pages=&title=Quantifying Variance in Evaluation Benchmarks&atitle=Quantifying%20Variance%20in%20Evaluation%20Benchmarks&aulast=Madaan%2C%20Lovish&id=DOI:
    Name: Full Text Finder (for New FTF UI) (s8985755)
    Category: fullText
    Text: Find It @ SCU Libraries
    MouseOverText: Find It @ SCU Libraries
Header DbId: edsarx
DbLabel: arXiv
An: edsarx.2406.10229
RelevancyScore: 1098
AccessLevel: 3
PubType: Report
PubTypeId: report
PreciseRelevancyScore: 1098.0439453125
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Quantifying Variance in Evaluation Benchmarks
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Madaan%2C+Lovish%22">Madaan, Lovish</searchLink><br /><searchLink fieldCode="AR" term="%22Singh%2C+Aaditya+K%2E%22">Singh, Aaditya K.</searchLink><br /><searchLink fieldCode="AR" term="%22Schaeffer%2C+Rylan%22">Schaeffer, Rylan</searchLink><br /><searchLink fieldCode="AR" term="%22Poulton%2C+Andrew%22">Poulton, Andrew</searchLink><br /><searchLink fieldCode="AR" term="%22Koyejo%2C+Sanmi%22">Koyejo, Sanmi</searchLink><br /><searchLink fieldCode="AR" term="%22Stenetorp%2C+Pontus%22">Stenetorp, Pontus</searchLink><br /><searchLink fieldCode="AR" term="%22Narang%2C+Sharan%22">Narang, Sharan</searchLink><br /><searchLink fieldCode="AR" term="%22Hupkes%2C+Dieuwke%22">Hupkes, Dieuwke</searchLink>
– Name: DatePubCY
  Label: Publication Year
  Group: Date
  Data: 2024
– Name: Subset
  Label: Collection
  Group: HoldingsInfo
  Data: Computer Science
– Name: Subject
  Label: Subject Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Computer+Science+-+Machine+Learning%22">Computer Science - Machine Learning</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Artificial+Intelligence%22">Computer Science - Artificial Intelligence</searchLink>
– Name: Abstract
  Label: Description
  Group: Ab
  Data: Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Working Paper
– Name: URL
  Label: Access URL
  Group: URL
  Data: <link linkTarget="URL" linkTerm="http://arxiv.org/abs/2406.10229" linkWindow="_blank">http://arxiv.org/abs/2406.10229</link>
– Name: AN
  Label: Accession Number
  Group: ID
  Data: edsarx.2406.10229
PLink https://login.libproxy.scu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsarx&AN=edsarx.2406.10229
RecordInfo BibRecord:
  BibEntity:
    Subjects:
      – SubjectFull: Computer Science - Machine Learning
        Type: general
      – SubjectFull: Computer Science - Artificial Intelligence
        Type: general
    Titles:
      – TitleFull: Quantifying Variance in Evaluation Benchmarks
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Madaan, Lovish
      – PersonEntity:
          Name:
            NameFull: Singh, Aaditya K.
      – PersonEntity:
          Name:
            NameFull: Schaeffer, Rylan
      – PersonEntity:
          Name:
            NameFull: Poulton, Andrew
      – PersonEntity:
          Name:
            NameFull: Koyejo, Sanmi
      – PersonEntity:
          Name:
            NameFull: Stenetorp, Pontus
      – PersonEntity:
          Name:
            NameFull: Narang, Sharan
      – PersonEntity:
          Name:
            NameFull: Hupkes, Dieuwke
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 14
              M: 06
              Type: published
              Y: 2024
ResultId 1