Introduction
Scaling has been a key factor driving progress in AI. Models are growing in parameters and being trained on increasingly enormous datasets, leading to exponential growth in training compute, and dramatic increases in performance. For example, five years and four orders of magnitude of compute separate the barely coherent GPT-2 with the powerful GPT-4.
So far, AI developers have not faced major limits to scaling beyond simply procuring AI chips, which are scarce but rapidly growing in supply. If chips are the only bottleneck, then AI systems are likely to continue growing exponentially in compute and expanding the frontier of capabilities. As such, a key question in forecasting AI progress is whether inputs other than raw compute could become binding constraints.
In particular, scaling requires growing training datasets. The most powerful AI systems to date are language models that are primarily trained on trillions of words of human-generated text from the internet. However, there is only a finite amount of human-generated data out there, which raises the question of whether training data could become the main bottleneck to scaling.
In our new paper, we attempt to shed light on this question by estimating the stock of human-generated public text data, updating our 2022 analysis of this topic.
Results
We find that the total effective stock of human-generated public text data is on the order of 300 trillion tokens, with a 90% confidence interval of 100T to 1000T. This estimate includes only data that is sufficiently high-quality to be used for training, and accounts for the possibility of training models for multiple epochs.
We report estimates of the stocks of several types of data in Figure 1.
Given our estimate of the data stock, we then forecast when this data would be fully utilized. We develop two models of dataset growth. One simply extrapolates the historical growth rate in dataset sizes, and the other accounts for our projection of training compute growth and derives the corresponding dataset size (more below). Our overall projection, shown in Figure 2, comes from combining these two models. Our 80% confidence interval is that the data stock will be fully utilized at some point between 2026 and 2032.
However, the exact point in time at which this data would be fully utilized depends on how models are scaled. If models are trained compute-optimally, there is enough data to train a model with 5e28 floating-point operations (FLOP), a level we expect to be reached in 2028 (see Figure 2). But recent models, like Llama 3, are often “overtrained” with fewer parameters and more data so that they are more compute-efficient during inference.
We develop a simplified model of revenues and costs, and solve it to find the scaling policy that would maximize AI developers profits. Depending on how demand for AI inference changes with the performance of the model, we find it might make sense to overtrain models up to 100x.
We then make projections of future dataset sizes using these profit-maximizing policies, as shown in Figure 3. If models are overtrained by a modest factor of 5x, the stock of data will be fully used by 2027, but if they are overtrained by 100x, the stock of data will be fully used by 2025. For context, Llama 3-70B was overtrained by 10x.
Comparison with previous estimates
Our 2022 paper predicted that high-quality text data would be fully used by 2024, whereas our new results indicate that might not happen until 2028. This discrepancy is due to a difference in methodology and the incorporation of recent findings that have altered our understanding of data quality and model training.
In our previous work, we modeled high-quality data as consisting of roughly an even mix of scraped web data and human-curated corpora such as published scientific papers or books. This produced an estimate of about 10 trillion tokens of high-quality data. However, later results indicated that web data can outperform curated corpora, if it is filtered carefully (Penedo et al., 2023). Since web data is much more abundant than manually curated data, this led to a 5x increase in our estimate of the stock of high-quality data.
Another recent finding that challenged our old assumptions is that models can be trained on several epochs without significant degradation (Muennighoff et al., 2023). This discovery suggests that the same dataset can be used multiple times during training, effectively increasing the amount of data available to the model. As a result, this further expanded our estimate of the effective stock by a factor of 2-5x, contributing to the revised projection of when the data stock might be fully utilized.
The combination of these findings – the effectiveness of carefully filtered web data and the ability to train models on multiple epochs—has led to a substantial increase in our estimate of the available high-quality data stock. This, in turn, has pushed back our projected date for when this data might be fully utilized by language models.
Discussion
There are many types of data beyond human-generated publicly-available text data, including images and video, private data such as instant messaging conversations, and AI-generated synthetic data. We chose to focus on public human text data for three reasons:
- Text is the main modality used to train frontier models and is more likely to become a key bottleneck, as other modalities are easier to generate (in the case of images and video) or have not demonstrated their usefulness for training LLMs (for example, in the case of astronomical or other scientific data).
- AI-generated synthetic data is not yet well understood, and has only been shown to reliably improve capabilities in relatively narrow domains like math and coding.
- Non-public data, like instant messages, seems unlikely to be used at scale due to legal issues, and because it is fragmented over several platforms controlled by actors with competing interests.
Even if models come to be trained on all available public human text data, this would not necessarily lead to a complete halt of progress in model capabilities. One way in which progress can continue is if models grow in terms of parameters while keeping their dataset size constant, a strategy called “undertraining”. While undertraining can deliver the equivalent of up to two additional orders of magnitude of compute-optimal scaling (see Figure 4), it is expected to eventually lead to a plateau.
But ultimately, new innovations will be required to maintain progress beyond 2030. While we cannot predict which innovations will succeed, we identify three categories that seem especially relevant: synthetic data, learning from other modalities of data, and data efficiency improvements. As data becomes more scarce relative to compute, the amount of resources dedicated to developing these and other techniques will increase substantially. Given this increased level of investment, we expect such efforts to succeed.
For more information, read the full paper on ArXiv https://arxiv.org/abs/2211.04325.
Post a Comment