AI Models Produce 50 Times More CO2 Emissions, Often With No Benefit

Researchers in Germany have uncovered significant insights into the carbon emissions associated with artificial intelligence (AI), particularly focusing on large language models (LLMs). These models, which generate answers to user queries through a processing system that converts words into numerical tokens, play a critical role in how AI functions. However, the conversion of these tokens, along with other computational processes, leads to a substantial carbon footprint, a fact that many users remain largely unaware of.

The study, published in Frontiers in Communication, involved analyzing 14 different LLMs with parameters ranging from seven billion to 72 billion. The researchers compared the CO2 emissions of these models using a standardized set of 1,000 questions across various subjects. Maximilian Dauner, a researcher at Hochschule München University of Applied Sciences and the study’s first author, stated, “The environmental impact of questioning trained LLMs is strongly determined by their reasoning approach, with explicit reasoning processes significantly driving up energy consumption and carbon emissions.” Findings revealed that reasoning-enabled models can produce up to 50 times more CO2 emissions than their concise response counterparts.

The results illustrated a stark contrast in token usage: reasoning models generated an average of 543.5 “thinking” tokens per question, compared to only 37.7 tokens per question for concise models. This increased token footprint translates directly to higher carbon emissions, although it does not necessarily correlate with improved accuracy in responses, as excessive detail may not be vital for correctness.

Among the evaluated models, the reasoning-enabled Cogito model, which has 70 billion parameters, achieved an accuracy of 84.9%, but it also emitted three times more CO2 than models of similar size that provided concise answers. Dauner highlighted the emerging dilemma, stating, “Currently, we see a clear accuracy-sustainability trade-off inherent in LLM technologies.” Notably, none of the models that managed to keep emissions below 500 grams of CO2 equivalent achieved accuracy rates above 80% on the benchmark questions.

The subject matter of the questions also influenced CO2 emissions significantly. Complex questions requiring extensive reasoning, particularly in fields like abstract algebra or philosophy, led to emissions up to six times higher than simpler subjects, such as high school history.

The researchers hope that their findings will encourage users to make more environmentally conscious decisions regarding AI usage. Dauner suggested that users could substantially reduce emissions by opting for concise answers or restricting the use of high-capacity models to situations that truly necessitate their capabilities. For instance, utilizing a model like DeepSeek R1 to answer 600,000 questions could generate CO2 emissions equivalent to a round-trip flight from London to New York, while a different model, Qwen 2.5, can handle over three times as many questions with similar accuracy but without a proportional increase in emissions.

While the study presents compelling findings, it also acknowledges potential limitations related to the hardware used and regional energy variations that could affect the results’ applicability. Dauner concluded with a call to action, suggesting that increased awareness of the carbon costs associated with AI-generated outputs could lead users to be more selective and thoughtful in their engagement with these powerful technologies.

Reference: