Attention Is All You Need (2017.06)
#NLP Transformer
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data (2019.11)
Scaling Laws for Neural Language Models (2020.01)
Language Models are Few-Shot Learners (2020.05)
The pile: An 800gb dataset of diverse text for language modeling (2020.12)
Deep Learning on a Data Diet: Finding Important Examples Early in Training (2021.07)
Beyond neural scaling laws: beating power law scaling via data pruning (2022.06)
#CV SSL prototype
SemDeDup: Data-efficient learning at web-scale through semantic deduplication (2023.04)
Textbooks Are All You Need (2023.06)
#NLP phi-1
D4: Improving LLM Pretraining via Document De-Duplication and Diversification (2023.08)
#NLP SemDedDup + SSL prototype