The 1st Workshop on Data Contamination (CONDA)
Workshop@ACL 2024
Evaluation data has been compromised!
A workshop on detecting, preventing, and addressing data contamination.
Background & Scope
Data contamination, where evaluation data is inadvertently included in pre-training corpora of large scale models, and language models (LMs) in particular, has become a concern in recent times (Sainz et al. 2023; Jacovi et al. 2023). The growing scale of both models and data, coupled with massive web crawling, has led to the inclusion of segments from evaluation benchmarks in the pre-training data of LMs (Dodge et al., 2021; OpenAI, 2023; Google, 2023; Elazar et al., 2023). The scale of internet data makes it difficult to prevent this contamination from happening, or even detect when it has happened (Bommasani et al., 2022; Mitchell et al., 2023). Crucially, when evaluation data becomes part of pre-training data, it introduces biases and can artificially inflate the performance of LMs on specific tasks or benchmarks (Magar and Schwartz, 2022). This poses a challenge for fair and unbiased evaluation of models, as their performance may not accurately reflect their generalization capabilities.
Although a growing number of papers and state-of-the-art models mention issues of data contamination (Brown et al., 2020; Wei et al., 2022; Chowdhery et al., 2022; OpenAI, 2023; Google, 2023; Touvron et al., 2023), there is no agreed upon definition or standard methodology to ensure that a model does not report results on contaminated benchmarks. Addressing data contamination is a shared responsibility among researchers, developers, and the broader community. By adopting best practices, increasing transparency, documenting vulnerabilities, and conducting thorough evaluations, we can work towards minimizing the impact of data contamination and ensuring fair and reliable evaluations.
Invited speakers
Anna Rogers
Associate Professor at IT University of Copenhagen
Abstract: TBA
Jesse Dodge
Research Scientist at Allen Institute for AI
Abstract: TBA
Dieuwke Hupkes
Research Scientist at Meta
Evaluation data contamination: how much is there, and how much does it actually matter?
Abstract: With many of the current "SOTA" LLMs being closed sourced and their training data inaccessible, more and more questions arise that relate to potential contamination of the evaluation datasets used to claim their results. Various claims can be found online that range from suspicions of outright training on evaluation data to inflate results to suggestions that the definitions of contamination used may be inadequate and underestimate its impact. However, even with access to the training corpus, contamination and its impact is far from trivial to assess. In this talk, I discuss common ways of measuring contamination and provide empirical data into how much they impact results for a range of LLMs.
Margaret Mitchell
Researcher and Chief Ethics Scientist at HuggingFace
On the value of carefully measuring data.
Abstract: TBA
Important Dates
May 17, 2024 | Paper submission deadline |
---|---|
TBA | ARR pre-reviewed commitment deadline |
June 17, 2024 | Notification of acceptance |
July 1, 2024 | Camera ready deadline |
August 16, 2024 | Workshop day |
Call for papers
We welcome paper submissions on all topics related to data contamination, including but not limited to:
- Definitions, taxonomies and gradings of contamination
- Contamination detection (both manual and automatic)
- Community efforts to discover, report and organize contamination events
- Documentation frameworks for datasets or models
- Methods to avoid data contamination
- Methods to forget contaminated data
- Scaling laws and contamination
- Memorization and contamination
- Policies to avoid impact of contamination in publication venues and open source communities
- Reproducing and attributing results from previous work to data contamination
- Survey work on data contamination research
- Data contamination in other modalities
Paper Submission Information
We welcome two types of papers: regular workshop papers and non-archival submissions. Regular workshop papers will be included in the workshop proceedings. All submissions must be in PDF format and made through OpenReview.
- Regular workshop papers: Authors can submit papers up to 8 pages, with unlimited pages for references. Authors may submit up to 100 MB of supplementary materials separately and their code for reproducibility. All submissions undergo an double-blind single-track review. Best Paper Award(s) will be given based on nomination by the reviewers. Accepted papers will be presented as posters with the possibility of oral presentations.
- Non-archival submissions: Cross-submissions are welcome. Accepted papers will be presented at the workshop, but will not be included in the workshop proceedings. Papers must be in PDF format and will be reviewed in a double-blind fashion by workshop reviewers. We also welcome extended abstracts (up to 2 pages) of papers that are work in progress, under review or to be submitted to other venues. Papers in this category need to follow the ACL format.
In addition to papers submitted directly to the workshop, which will be reviewed by our Programme Committee. We also accept papers reviewed through ACL Rolling Review and committed to the workshop. Please, check the relevant dates for each type of submission.