Quantifying Variance in Evaluation Benchmarks
Title: | Quantifying Variance in Evaluation Benchmarks |
---|---|
Authors: | Madaan, Lovish, Singh, Aaditya K., Schaeffer, Rylan, Poulton, Andrew, Koyejo, Sanmi, Stenetorp, Pontus, Narang, Sharan, Hupkes, Dieuwke |
Publication Year: | 2024 |
Collection: | Computer Science |
Subject Terms: | Computer Science - Machine Learning, Computer Science - Artificial Intelligence |
More Details: | Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models. |
Document Type: | Working Paper |
Access URL: | http://arxiv.org/abs/2406.10229 |
Accession Number: | edsarx.2406.10229 |
Database: | arXiv |
FullText | Text: Availability: 0 CustomLinks: – Url: http://arxiv.org/abs/2406.10229 Name: EDS - Arxiv Category: fullText Text: View this record from Arxiv MouseOverText: View this record from Arxiv – Url: https://resolver.ebsco.com/c/xy5jbn/result?sid=EBSCO:edsarx&genre=article&issn=&ISBN=&volume=&issue=&date=20240614&spage=&pages=&title=Quantifying Variance in Evaluation Benchmarks&atitle=Quantifying%20Variance%20in%20Evaluation%20Benchmarks&aulast=Madaan%2C%20Lovish&id=DOI: Name: Full Text Finder (for New FTF UI) (s8985755) Category: fullText Text: Find It @ SCU Libraries MouseOverText: Find It @ SCU Libraries |
---|---|
Header | DbId: edsarx DbLabel: arXiv An: edsarx.2406.10229 RelevancyScore: 1098 AccessLevel: 3 PubType: Report PubTypeId: report PreciseRelevancyScore: 1098.0439453125 |
IllustrationInfo | |
Items | – Name: Title Label: Title Group: Ti Data: Quantifying Variance in Evaluation Benchmarks – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Madaan%2C+Lovish%22">Madaan, Lovish</searchLink><br /><searchLink fieldCode="AR" term="%22Singh%2C+Aaditya+K%2E%22">Singh, Aaditya K.</searchLink><br /><searchLink fieldCode="AR" term="%22Schaeffer%2C+Rylan%22">Schaeffer, Rylan</searchLink><br /><searchLink fieldCode="AR" term="%22Poulton%2C+Andrew%22">Poulton, Andrew</searchLink><br /><searchLink fieldCode="AR" term="%22Koyejo%2C+Sanmi%22">Koyejo, Sanmi</searchLink><br /><searchLink fieldCode="AR" term="%22Stenetorp%2C+Pontus%22">Stenetorp, Pontus</searchLink><br /><searchLink fieldCode="AR" term="%22Narang%2C+Sharan%22">Narang, Sharan</searchLink><br /><searchLink fieldCode="AR" term="%22Hupkes%2C+Dieuwke%22">Hupkes, Dieuwke</searchLink> – Name: DatePubCY Label: Publication Year Group: Date Data: 2024 – Name: Subset Label: Collection Group: HoldingsInfo Data: Computer Science – Name: Subject Label: Subject Terms Group: Su Data: <searchLink fieldCode="DE" term="%22Computer+Science+-+Machine+Learning%22">Computer Science - Machine Learning</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Artificial+Intelligence%22">Computer Science - Artificial Intelligence</searchLink> – Name: Abstract Label: Description Group: Ab Data: Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models. – Name: TypeDocument Label: Document Type Group: TypDoc Data: Working Paper – Name: URL Label: Access URL Group: URL Data: <link linkTarget="URL" linkTerm="http://arxiv.org/abs/2406.10229" linkWindow="_blank">http://arxiv.org/abs/2406.10229</link> – Name: AN Label: Accession Number Group: ID Data: edsarx.2406.10229 |
PLink | https://login.libproxy.scu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsarx&AN=edsarx.2406.10229 |
RecordInfo | BibRecord: BibEntity: Subjects: – SubjectFull: Computer Science - Machine Learning Type: general – SubjectFull: Computer Science - Artificial Intelligence Type: general Titles: – TitleFull: Quantifying Variance in Evaluation Benchmarks Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Madaan, Lovish – PersonEntity: Name: NameFull: Singh, Aaditya K. – PersonEntity: Name: NameFull: Schaeffer, Rylan – PersonEntity: Name: NameFull: Poulton, Andrew – PersonEntity: Name: NameFull: Koyejo, Sanmi – PersonEntity: Name: NameFull: Stenetorp, Pontus – PersonEntity: Name: NameFull: Narang, Sharan – PersonEntity: Name: NameFull: Hupkes, Dieuwke IsPartOfRelationships: – BibEntity: Dates: – D: 14 M: 06 Type: published Y: 2024 |
ResultId | 1 |