CONDA 2024 | The 1st Workshop on Data Contamination

Program schedule (Friday, August 16, 2024)

08:55-09:00	Opening Remarks
09:00-09:45	Invited talk by Margaret Mitchell: On the value of carefully measuring data.
09:45-10:30	Invited talk by Dieuwke Hupkes: Evaluation data contamination:how much is there, and how much does it actually matter?
10:30-11:00	Break
11:00-11:45	Invited talk by Anna Rogers: A Sanity Check on Emergent Properties
11:45-12:00	Best paper presentation
12:00-13:30	Lunch Break
13:30-15:30	Poster Session:
	Evaluating Chinese Large Language Models on Discipline Knowledge Acquisition via Assessing Memorization and Robustness Chuang Liu, Renren Jin, Mark Steedman, Deyi Xiong
	Scaling Laws for Data Poisoning in LLMs Dillon Bowen, Brendan Murphy, Will Cai, David Khachaturov, Adam Gleave, Kellin Pelrine
	LLM Dataset Inference: Did you train on my dataset? Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic
	Rethinking LLM Memorization through the Lens of Adversarial Compression Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary Chase Lipton, J Zico Kolter
	TOFU: A Task of Fictitious Unlearning for LLMs Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, J Zico Kolter
	Train-to-Test Contamination in Code Generation Evaluations Alexandre Matton, Elena Tommasone, Dennis Aumiller, Milad Alizadeh, Kylie He, Tom Sherborne, Raymond Ma, Maxime Voisin, Ellen Gilsenan-Mcmahon, Matthias Gallé
	Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts Jacob Haimes, Cenny Wenner, Kunvar Thaman, Vassil Tashev, Clement Neo, Esben Kran, Jason Hoelscher-Obermaier
	Confounders in Instance Variation for the Analysis of Data Contamination Behzad Mehrbakhsh, Dario Garigliotti, Fernando Martínez-Plumed, Jose Hernandez-Orallo
	Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation Chunyuan Deng, Yilun Zhao, Yuzhao Heng, Yitong Li, Jiannan Cao, Xiangru Tang, Arman Cohan
	Task Contamination: Language Models May Not Be Few-Shot Anymore Changmao Li, Jeffrey Flanigan
	Using Cochrane Systematic Literature Reviews to Reduce Contamination in the Evaluation of Large Language Models Wojciech Kusa, Moritz Staudinger, Harrisen Scells, Allan Hanbury
	Proving membership in LLM pretraining data via data watermarks Johnny Wei, Ryan Yixiang Wang, Robin Jia
15:30-16:00	Break
16:00-16:45	Invited talk by Jesse Dodge: Contamination in Web-Scale Datasets and its Impact on Large Model Evaluations
17:00-17:15	Closing Remarks

Background & Scope

Data contamination, where evaluation data is inadvertently included in pre-training corpora of large scale models, and language models (LMs) in particular, has become a concern in recent times (Sainz et al. 2023; Jacovi et al. 2023). The growing scale of both models and data, coupled with massive web crawling, has led to the inclusion of segments from evaluation benchmarks in the pre-training data of LMs (Dodge et al., 2021; OpenAI, 2023; Google, 2023; Elazar et al., 2023). The scale of internet data makes it difficult to prevent this contamination from happening, or even detect when it has happened (Bommasani et al., 2022; Mitchell et al., 2023). Crucially, when evaluation data becomes part of pre-training data, it introduces biases and can artificially inflate the performance of LMs on specific tasks or benchmarks (Magar and Schwartz, 2022). This poses a challenge for fair and unbiased evaluation of models, as their performance may not accurately reflect their generalization capabilities.

Although a growing number of papers and state-of-the-art models mention issues of data contamination (Brown et al., 2020; Wei et al., 2022; Chowdhery et al., 2022; OpenAI, 2023; Google, 2023; Touvron et al., 2023), there is no agreed upon definition or standard methodology to ensure that a model does not report results on contaminated benchmarks. Addressing data contamination is a shared responsibility among researchers, developers, and the broader community. By adopting best practices, increasing transparency, documenting vulnerabilities, and conducting thorough evaluations, we can work towards minimizing the impact of data contamination and ensuring fair and reliable evaluations.

Invited speakers

Anna Rogers

Associate Professor at IT University of Copenhagen

A Sanity Check on Emergent Properties.

Abstract: One of the frequent points in the mainstream narrative about large language models is that they have emergent properties", but there is a lot of disagreement about what that even means. If they are understood as a kind of generalization beyond training data- as something that a model does without being explicitly trained for it- I argue that we have not in fact established the existence of any such properties, and at the moment we do not even have the methodology for doing so.

Jesse Dodge

Research Scientist at Allen Institute for AI

Contamination in Web-Scale Datasets and its Impact on Large Model Evaluations.

Abstract: We are at a pivotal moment in the history of AI. The AI research community has driven pro gress for decades, but over the past couple years industry has started to make significant advances in model capabilities while purposely being closed about how. In this talk I’ll start by discussing different types of contamination and how they appear in the wild. I’ll then discuss some of our work on building massive datasets by scraping the web, including Dolma and C4. I’ll discuss What’s In My Big Data, a toolkit for documenting the contents of web-scale datasets, and some of our results on measuring contamination in different ways across a variety of popular pretraining corpora. I’ll conclude by discussing evaluation of large models, and how current evaluations have low construct validity and how we don’t have strong evaluations for the actual use cases that users care about.

Dieuwke Hupkes

Research Scientist at Meta

Evaluation data contamination: how much is there, and how much does it actually matter?

Abstract: With many of the current "SOTA" LLMs being closed sourced and their training data inaccessible, more and more questions arise that relate to potential contamination of the evaluation datasets used to claim their results. Various claims can be found online that range from suspicions of outright training on evaluation data to inflate results to suggestions that the definitions of contamination used may be inadequate and underestimate its impact. However, even with access to the training corpus, contamination and its impact is far from trivial to assess. In this talk, I discuss common ways of measuring contamination and provide empirical data into how much they impact results for a range of LLMs.

Margaret Mitchell

Researcher and Chief Ethics Scientist at HuggingFace

On the value of carefully measuring data.

Abstract: Just as we evaluate models, we should measure data. Measuring data involves quantifying different aspects of its composition, such as counts of the top-represented domains, or correlations between sensitive identity terms and other concepts. In this talk, I will define the problem of measuring data and unpack how it can be applied to automatically curating distinct training and evaluation datasets for ML models.

Important Dates

~~May 17~~ May 31, 2024	Paper submission deadline
June 14, 2024	ARR pre-reviewed commitment deadline
June 17, 2024	Notification of acceptance
~~July 1~~ July 4, 2024	Camera ready deadline
August 16, 2024	Workshop day

Call for papers

We welcome paper submissions on all topics related to data contamination, including but not limited to:

Definitions, taxonomies and gradings of contamination
Contamination detection (both manual and automatic)
Community efforts to discover, report and organize contamination events
Documentation frameworks for datasets or models
Methods to avoid data contamination
Methods to forget contaminated data
Scaling laws and contamination
Memorization and contamination
Policies to avoid impact of contamination in publication venues and open source communities
Reproducing and attributing results from previous work to data contamination
Survey work on data contamination research
Data contamination in other modalities

Paper Submission Information

We welcome two types of papers: regular workshop papers and non-archival submissions. Regular workshop papers will be included in the workshop proceedings. All submissions must be in PDF format and made through OpenReview.

Regular workshop papers: Authors can submit papers up to 8 pages, with unlimited pages for references. Authors may submit up to 100 MB of supplementary materials separately and their code for reproducibility. All submissions undergo an double-blind single-track review. Best Paper Award(s) will be given based on nomination by the reviewers. Accepted papers will be presented as posters with the possibility of oral presentations.
Non-archival submissions: Cross-submissions are welcome. Accepted papers will be presented at the workshop, but will not be included in the workshop proceedings. Papers must be in PDF format and will be reviewed in a double-blind fashion by workshop reviewers. We also welcome extended abstracts (up to 2 pages) of papers that are work in progress, under review or to be submitted to other venues. Papers in this category need to follow the ACL format.

In addition to papers submitted directly to the workshop, which will be reviewed by our Programme Committee. We also accept papers reviewed through ACL Rolling Review and committed to the workshop. Please, check the relevant dates for each type of submission.

Links to OpenReview submission pages:

Shared Task: Data Contamination Evidence Collection

In addition to paper contributions, we are organizing a community effort on centralized data contamination evidence collection. While the problem of data contamination is prevalent and serious, the breadth and depth of this contamination are still largely unknown. The concrete evidence of contamination is scattered across papers, blog posts, and social media, and it is suspected that the true scope of data contamination in NLP is significantly larger than reported.

With this shared task we aim to provide a structured, centralized platform for contamination evidence collection to help the community understand the extent of the problem and to help researchers avoid repeating the same mistakes. The shared task also gathers evidence of clean, non-contaminated instances. The platform is already available for perusal at Data Contamination Database.

Compilation Paper

As a companion to the contamination evidence platform, we will produce a paper that will provide a summary and overview of the evidence collected in the shared task. The participants who contribute to the shared task will be listed as co-authors in the paper, to be published in the workshop proceedings.

Instructions for Evidence Submission

Each submission should report a case of contamination or lack of contamination thereof. The submission can be either about (1) contamination in the corpus used to pre-train language models, where the pre-training corpus contains a specific evaluation dataset, or about (2) contamination in a model that shows evidence of having seen a specific evaluation dataset while being trained. Each submission needs to mention the corpus (or model) and the evaluation dataset, in addition to some evidence of contamination. Alternatively, we also welcome evidence of a lack of contamination.

Reports must be submitted through a Pull Request in the Data Contamination Database space at HuggingFace. The reports must follow the Contribution Guidelines provided in the space and will be reviewed by the organizers. If you have any questions, please contact us at conda-workshop@googlegroups.com or open a discussion in the space itself.

URL with contribution guidelines: Data Contamination Database (“Contribution Guidelines” tab).