Thursday Sep 26, 2024

arxiv preprint - Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

In this episode, we discuss Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale by Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu. The paper introduces Programming Every Example (PROX), a framework that enables small language models to refine pre-training corpora by executing fine-grained operations on individual examples, outperforming traditional human-crafted rules. Experimental results show that models trained on PROX-curated data achieve over 2% higher performance across various benchmarks compared to other data selection methods. PROX also significantly enhances domain-specific continual pre-training and reduces training FLOPs, with the authors open-sourcing their data and models for further research.

Comments (0)

To leave or reply to comments, please download free Podbean or

No Comments

Copyright 2023 All rights reserved.

Podcast Powered By Podbean

Version: 20240731