Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

bytko —يونيو 17, 2024 0

Introduction

Scaling has been a key factor driving progress in AI. Models are growing in parameters and being trained on increasingly enormous datasets, leading to exponential growth in training compute, and dramatic increases in performance. For example, five years and four orders of magnitude of compute separate the barely coherent GPT-2 with the powerful GPT-4.

So far, AI developers have not faced major limits to scaling beyond simply procuring AI chips, which are scarce but rapidly growing in supply. If chips are the only bottleneck, then AI systems are likely to continue growing exponentially in compute and expanding the frontier of capabilities. As such, a key question in forecasting AI progress is whether inputs other than raw compute could become binding constraints.

In particular, scaling requires growing training datasets. The most powerful AI systems to date are language models that are primarily trained on trillions of words of human-generated text from the internet. However, there is only a finite amount of human-generated data out there, which raises the question of whether training data could become the main bottleneck to scaling.

In our new paper, we attempt to shed light on this question by estimating the stock of human-generated public text data, updating our 2022 analysis of this topic.

Results

We find that the total effective stock of human-generated public text data is on the order of 300 trillion tokens, with a 90% confidence interval of 100T to 1000T. This estimate includes only data that is sufficiently high-quality to be used for training, and accounts for the possibility of training models for multiple epochs.

We report estimates of the stocks of several types of data in Figure 1.

</p><p>    {“title”: “Estimates of different stocks of data”, “xAxis”: {“lim”: [-0.6400000000000001, 4.640000000000001], “scaleType”: “linear”, “ticks”: [0, 1, 2, 3, 4], “tickLabels”: [“CommonCrawl”, “Indexed web”, “Whole web\n (inc. private data)”, “Images”, “Video”], “hideMinorGrid”: true, “nice”: false}, “yAxis”: {“label”: “Effective stock (number of tokens)”, “lim”: [63095734448019.43, 1e+16], “scaleType”: “log”, “ticks”: [10000000000000, 20000000000000, 30000000000000, 40000000000000, 50000000000000, 60000000000000, 70000000000000, 80000000000000, 90000000000000, 100000000000000, 200000000000000, 300000000000000, 400000000000000, 500000000000000, 600000000000000, 700000000000000, 800000000000000, 900000000000000, 1000000000000000, 2000000000000000, 3000000000000000, 4000000000000000, 5000000000000000, 6000000000000000, 7000000000000000, 8000000000000000, 9000000000000000, 10000000000000000], “tickLabels”: [“10T”, “”, “”, “”, “”, “”, “”, “”, “”, “100T”, “”, “”, “”, “”, “”, “”, “”, “”, “1000T”, “”, “”, “”, “”, “”, “”, “”, “”, “10000T”], “hideMinorGrid”: false, “hideTicks”: false, “overMainArea”: false}, “showLegend”: true, “legendPosition”: “header”, “showFrame”: false, “addDataPadding”: false, “showXGrid”: false, “showYGrid”: true, “showXBasis”: true, “tooltipType”: “header-footer”, “objects”: [{“type”: “polygon”, “color”: “rgb(0.0, 165.0, 166.0)”, “zOrder”: 1, “points”: [{“x”: -0.4, “y”: 0}, {“x”: 0.4, “y”: 0}, {“x”: 0.4, “y”: 130000000000000.0}, {“x”: -0.4, “y”: 130000000000000.0}, {“x”: -0.4, “y”: 0}], “closed”: true, “tooltipData”: {“Header”: “CommonCrawl”, “Median”: “130T”, “95%CI”: “100T – 260T”}}, {“type”: “polygon”, “color”: “rgb(0.0, 165.0, 166.0)”, “zOrder”: 1, “points”: [{“x”: 0.6, “y”: 0}, {“x”: 1.4, “y”: 0}, {“x”: 1.4, “y”: 510000000000000.0}, {“x”: 0.6, “y”: 510000000000000.0}, {“x”: 0.6, “y”: 0}], “closed”: true, “tooltipData”: {“Header”: “Indexed web”, “Median”: “510T”, “95%CI”: “130T – 2100T”}}, {“type”: “polygon”, “color”: “rgb(0.0, 165.0, 166.0)”, “zOrder”: 1, “points”: [{“x”: 1.6, “y”: 0}, {“x”: 2.4000000000000004, “y”: 0}, {“x”: 2.4000000000000004, “y”: 3100000000000000.0}, {“x”: 1.6, “y”: 3100000000000000.0}, {“x”: 1.6, “y”: 0}], “closed”: true, “tooltipData”: {“Header”: “Whole web\n (inc. private data)”, “Median”: “3100T”, “95%CI”: “1900T – 5200T”}}, {“type”: “polygon”, “color”: “rgb(0.0, 165.0, 166.0)”, “zOrder”: 1, “points”: [{“x”: 2.6, “y”: 0}, {“x”: 3.4000000000000004, “y”: 0}, {“x”: 3.4000000000000004, “y”: 300000000000000.0}, {“x”: 2.6, “y”: 300000000000000.0}, {“x”: 2.6, “y”: 0}], “closed”: true, “tooltipData”: {“Header”: “Images”, “Median”: “300T”}}, {“type”: “polygon”, “color”: “rgb(0.0, 165.0, 166.0)”, “zOrder”: 1, “points”: [{“x”: 3.6, “y”: 0}, {“x”: 4.4, “y”: 0}, {“x”: 4.4, “y”: 1350000000000000.0}, {“x”: 3.6, “y”: 1350000000000000.0}, {“x”: 3.6, “y”: 0}], “closed”: true, “tooltipData”: {“Header”: “Video”, “Median”: “1350T”}}, {“type”: “line”, “color”: “#2B424B”, “zOrder”: 2, “strokeWidth”: 1, “lineStyle”: “-“, “points”: [{“x”: 0, “y”: 100000000000000.0}, {“x”: 0, “y”: 260000000000000.0}]}, {“type”: “line”, “color”: “#2B424B”, “zOrder”: 2, “strokeWidth”: 1, “lineStyle”: “-“, “points”: [{“x”: -0.04, “y”: 100000000000000.0}, {“x”: 0.04, “y”: 100000000000000.0}]}, {“type”: “line”, “color”: “#2B424B”, “zOrder”: 2, “strokeWidth”: 1, “lineStyle”: “-“, “points”: [{“x”: -0.04, “y”: 260000000000000.0}, {“x”: 0.04, “y”: 260000000000000.0}]}, {“type”: “line”, “color”: “#2B424B”, “zOrder”: 2, “strokeWidth”: 1, “lineStyle”: “-“, “points”: [{“x”: 1, “y”: 130000000000000.0}, {“x”: 1, “y”: 2100000000000000.0}]}, {“type”: “line”, “color”: “#2B424B”, “zOrder”: 2, “strokeWidth”: 1, “lineStyle”: “-“, “points”: [{“x”: 0.96, “y”: 130000000000000.0}, {“x”: 1.04, “y”: 130000000000000.0}]}, {“type”: “line”, “color”: “#2B424B”, “zOrder”: 2, “strokeWidth”: 1, “lineStyle”: “-“, “points”: [{“x”: 0.96, “y”: 2100000000000000.0}, {“x”: 1.04, “y”: 2100000000000000.0}]}, {“type”: “line”, “color”: “#2B424B”, “zOrder”: 2, “strokeWidth”: 1, “lineStyle”: “-“, “points”: [{“x”: 2, “y”: 1900000000000000.0}, {“x”: 2, “y”: 5200000000000000.0}]}, {“type”: “line”, “color”: “#2B424B”, “zOrder”: 2, “strokeWidth”: 1, “lineStyle”: “-“, “points”: [{“x”: 1.96, “y”: 1900000000000000.0}, {“x”: 2.04, “y”: 1900000000000000.0}]}, {“type”: “line”, “color”: “#2B424B”, “zOrder”: 2, “strokeWidth”: 1, “lineStyle”: “-“, “points”: [{“x”: 1.96, “y”: 5200000000000000.0}, {“x”: 2.04, “y”: 5200000000000000.0}]}], “hud”: [], “originalDataAspectRatio”: 0.7451612903225805, “additionalLegendItems”: [], “gridOverMainArea”: true}<br />

Clacton to get US presidential visit if Trump and Farage pull off victories

Islamic Jihad's Largest Rocket Fire In Months, Israel Asks Lebanon To Rein In Hezbollah Attacks - News18

22 Calon Hakim di Mahkamah Agung Lolos Seleksi Kesehatan |Republika Online

Viktor Orbán inaugura la presidencia del Consejo de la UE con su primer viaje a Ucrania desde el inicio de la invasión rusa

مجوز توبه صادر شد؛ شرایط جدید برای مجرمان

Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

Introduction

Results

المتابعون

Here Movie 2024: Why Is Tom Hanks De-Aged? Why Was It Shot With 1 Camera?

Marine Le Pen says National Rally should not try to form government without a majority – Europe live

Фарион вернулась во Львовскую политехнику после скандального увольнения

Pages

Advertisement

Updates

Clacton to get US presidential visit if Trump and Farage pull off victories

Islamic Jihad's Largest Rocket Fire In Months, Israel Asks Lebanon To Rein In Hezbollah Attacks - News18

22 Calon Hakim di Mahkamah Agung Lolos Seleksi Kesehatan |Republika Online

Viktor Orbán inaugura la presidencia del Consejo de la UE con su primer viaje a Ucrania desde el inicio de la invasión rusa

Featured post

Clacton to get US presidential visit if Trump and Farage pull off victories

Categories

Main Tags

نموذج الاتصال

Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

Introduction

Results

Subscribe

المتابعون

Pages

Advertisement

Updates

Featured post

Categories

Main Tags

نموذج الاتصال