GENIE : A Multimodal Foundation Model for Climate Science

Overview

GENIE is a novel foundation model designed to accelerate hypothesis generation for assessing climate risks. Unlike traditional climate models that require months for hypothesis testing, GENIE integrates numerical data (measurements, climate simulations) and text data (research papers, climate reports) to generate scientifically valid hypotheses in a few-shot manner. Key innovations include:

Multi-Modal Learning: Uses Transformer architectures to represent both numerical and textual climate data.
Scientific Validity: Ensures physical consistency using Physics-Guided Deep Learning (PGDL) and Reinforcement Learning with World-Feedback (RLWF).
Uncertainty Quantification Leverages Bayesian Active Learning to assess prediction confidence and optimize experimental design.

Climate scientists currently rely on running ensembles of complex climate models to test their hypotheses, a process hindered by several limitations. First, hypothesis generation takes an excessive amount of time, with a typical climate model run requiring six months from conception to analysis. This significantly slows down research and decision-making. Second, ad-hoc parameterization across multiple global and regional climate models introduces inconsistencies, as these models use different parameterizations that are related in unknown ways. There is no single framework capable of addressing a diverse set of climate-related tasks. Lastly, barriers to knowledge sharing prevent policymakers and resource-constrained communities from accessing integrated datasets and tools that can efficiently generate climate risk scenarios.

GENIE transforms climate risk assessment by reducing hypothesis testing turnaround time from six months to one week, allowing for rapid and iterative scientific exploration. The integration of Bayesian active learning further optimizes data acquisition costs, reducing the need for expensive large-scale climate simulations by at least 30%. Beyond climate science, GENIE has broad applications in forecasting, causal inference, and scenario creation, making it a versatile tool for researchers and policymakers. Additionally, the model has direct implications for national security, assisting the Department of Defense in areas such as strategic military planning, operational preparedness, and infrastructure resilience against climate-induced threats. Faster, more accurate climate risk assessments enabled by GENIE can drive more effective policy interventions, ultimately mitigating economic losses and protecting human lives.

Research

[ICLR 2025] ClimaQA: An Automated Evaluation Framework for Climate Foundation Models 🔗

Abstract: The use of Large Language Models (LLMs) in climate science has recently gained significant attention. However, a critical issue remains: the lack of a comprehensive evaluation framework capable of assessing the quality and scientific validity of model outputs. To address this issue, we develop ClimaGen (Climate QA Generator), an adaptive learning framework that generates question-answer pairs from graduate textbooks with climate scientists in the loop. As a result, we present ClimaQA-Gold, an expert-annotated benchmark dataset alongside ClimaQA-Silver>, a large-scale, comprehensive synthetic QA dataset for climate science. Finally, we develop evaluation strategies and compare different LLMs on our benchmarks. Our results offer novel insights into various approaches used to enhance knowledge of climate LLMs.

[ICML 2025] Adapting While Learning: Grounding LLMs for Scientific Problems with Tool Usage Adaptation 🔗

Abstract: Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but, even with domain-specific fine-tuning, often produce hallucinations for complex ones. While integrating LLMs with tools can mitigate this reliability issue, models finetuned on tool usage only often over-rely on them, incurring unnecessary costs from resource-intensive scientific tools even for simpler problems. Inspired by how human experts assess the complexity of the problem before choosing the solutions, we propose a novel two-component fine-tuning method, Adapting While Learning (AWL). In the first component World Knowledge Learning (WKL), LLMs internalize scientific knowledge by learning from tools-generated solutions. In the second component Tool Usage Adaptation (TUA), we classify questions as easy or hard based on the \firstphaseshort-trained model's accuracy, and train it to maintain direct reasoning for simple problems while switching to tools for challenging ones. We validate our method on 6 scientific benchmark datasets in climate science, epidemiology, and mathematics. Compared to the base 8B model, our trained models achieve 28.27% higher answer accuracy and 13.76% better tool usage accuracy, even surpassing state-of-the-art models including GPT-4o and Claude-3.5 on 4 custom-created datasets.

Functional-level Uncertainty Quantification for Calibrated Fine-tuning on LLMs 🔗

Abstract: Accurate uncertainty quantification of large language models (LLMs) provides credibility measure over their outputs. However, fine-tuned LLMs often struggle with overconfidence in uncertain predictions due to the limitations in the models' ability to generalize with limited data. Existing parameter efficient fine-tuning (PEFT) uncertainty quantification methods for LLMs focus on post fine-tuning stage and fall short of calibrating epistemic uncertainty. To address these limitations, we propose Functional-Level Uncertainty Quantification for Calibrated Fine-Tuning (UQ4CT), which captures and calibrates epistemic uncertainty over the space of functions that map input prompts to outputs. We implement UQ4CT during the fine-tuning stage via a mixture-of-experts framework that hierarchically decomposes the functional space. We demonstrate that UQ4CT reduces Expected Calibration Error (ECE) by more than 25% while maintaining high accuracy across 5 benchmarks. Even under distribution shift, UQ4CT maintains superior ECE performance with high accuracy, showcasing improved generalizability.

Aquilon: Towards Building Multimodal Weather LLMs

Abstract: Recent advancements in weather foundation models—pre-trained on vast amounts of structured numerical data—have set new standards in weather forecasting accuracy. However, their lack of language-based reasoning capabilities leaves a critical opportunity untapped for human-in-the-loop analysis systems. In contrast, large language models (LLMs) excel at understanding and generating text, but they struggle with high-dimensional weather inputs like meteorological datasets. In this work, we take a significant step towards bridging this gap by enabling multimodal LLMs to reason over complex weather data. We address two fundamental challenges: (1) the absence of large-scale, multitask, multimodal datasets for weather reasoning, and (2) the lack of methods for embedding multi-channel weather data into LLM-compatible representations. To tackle these, we introduce a scalable data generation pipeline that constructs diverse question-answer pairs across a wide spectrum of weather-related tasks, from basic lookups to advanced forecasting and extreme event detection. We also leverage pretrained weather foundation models to extract low-dimensional embeddings of weather fields, enabling their integration with LLMs. Our experiments reveal that multimodal weather reasoning is a challenging problem that current models only partially address—highlighting the need for more effective weather representations and richer training data to fully unlock the potential of LLMs in meteorological applications.