A Missing Layer in Physical AI

Date: 2025.12.21

There's a gap between frontier AI and the physical world. Models trained on internet text have no grounded knowledge of specific places, real-time conditions, or bioregional patterns. Filling this gap requires new infrastructure, not just better models. And while government sensors are the most accessible starting point, the same fragmentation exists across private sector data agriculture, buildings, supply chains, anywhere the physical world gets measured.

The Gap

You can scrape a weather page. You can ask an LLM to search for current air quality and get a reasonable answer. For a one-off question, it works fine.

But try to build something real on that foundation, a flood alert system, a farm irrigation tool, an air quality monitor for a school district, and you hit the wall.

Scraped data has no schema. No guarantees about freshness. No way to know if the source changed format overnight. No method to combine NOAA stream gauges with EPA air sensors with USGS groundwater readings in a single query.

The raw sensor data exists. NOAA tracks stream gauges. The EPA monitors air quality. USGS measures groundwater. But every team that needs this data programmatically, reliably, at scale, in combination, rebuilds the same brittle pipeline from scratch.

The gap isn't access. It's infrastructure.

Why This Matters Now

Physical world intelligence is no longer hypothetical. Tesla, Waymo, Archetype AI, and DeepMind are all building systems that don't just recognize patterns but simulate futures, taking in streams from video, depth, radar, and sensor signals to predict how the world will evolve. NVIDIA's Earth-2 platform already uses AI to generate kilometer-scale weather forecasts, trained on years of NOAA climate data.

But here's what this convergence reveals: the infrastructure layer is privatized and rebuilt from scratch by every company that needs it.

Tesla built its own sensor fusion stack. Waymo built its own. NVIDIA built its own pipeline to ingest and normalize government weather data. Each solution is proprietary, expensive, and inaccessible to anyone outside those organizations.

Meanwhile, the frontier models that hundreds of millions of people actually use, such as Claude, GPT, and Gemini, remain entirely disconnected from the physical world. They can discuss climate science abstractly, but can't tell you if the creek behind your house is rising.

This matters because physical world intelligence is coming regardless. Autonomous robotics is arriving. Climate adaptation requires local environmental awareness. The question is whether this intelligence will be built on shared infrastructure or remain locked inside a handful of well-funded companies.

I've been thinking about this gap for years. Working with FieldKit.org reinforced a simple premise: cheaper sensors at higher density can be as valuable as fewer, higher-accuracy sensors. But density alone doesn't help if the data stays siloed. We need infrastructure that makes sensor data accessible not just to researchers, but to the models and products that shape how people understand the world around them.

All of the emphasis on global climate averages doesn't feel actionable. We need to understand how we're affecting our localized environments, our bioregions, to prevent them from collapsing. That requires connecting physical world data to the systems people actually use.

Why Infrastructure, Not Just Better Models

Couldn't the next generation of models simply be trained on NOAA data, EPA feeds, and USGS archives? Wouldn't that close the gap?

No. And the reason is fundamental to how these data types work.

Training can't solve real-time. A model trained on historical sensor data learns patterns, but the training is frozen at a point in time. If the Sacramento River crested last night, a model trained six months ago doesn't know. Real-time conditions require live connections, not larger training corpora.

Sensor data isn't text. Frontier models are trained predominantly on internet text and images. Sensor streams, continuous numerical feeds with timestamps, coordinates, units, and quality flags, don't fit this paradigm cleanly. They require normalization, interpolation, and contextual grounding that training alone doesn't provide.

The problem is plumbing, not parameters. Even if you fine-tuned a model on every public sensor dataset available, you'd still need infrastructure to:

Authenticate with dozens of different government APIs
Handle different update frequencies (real-time, hourly, daily, archival)
Normalize incompatible data formats and units
Manage rate limits, outages, and data quality issues
Cross-reference sensors across agencies (is this USGS gauge measuring the same watershed as that NOAA station?)

This is middleware work. It's not glamorous, but without it, even the most capable models remain disconnected from the physical systems they're supposed to help us understand.

NVIDIA understood this when building Earth-2. Their CorrDiff model doesn't just predict weather it sits on top of years of pipeline work to ingest, clean, and structure NOAA data. That pipeline is the substrate. The model is what runs on top.

The gap between frontier AI and the physical world won't be closed by making models smarter. It will be closed by building the infrastructure that enables models to access the physical world.

What Exists Today

Government agencies have been collecting environmental data for decades. The infrastructure exists it's just fragmented in ways that make it nearly unusable for modern applications.

The Four Data Modalities

Sensor data from public sources generally falls into four categories, each requiring different infrastructure:

Real-time streaming: Continuous flows from sensors, processed as they arrive. Some NOAA weather stations stream temperature and wind data continuously. This data is dynamic and unbounded, useful for immediate decisions, but challenging to store and query.

Near-real-time batch: Sensors collect continuously but transmit at intervals every 15 minutes, hourly, daily. Many USGS stream gauges work this way, measuring water levels constantly but reporting periodically. Good enough for most applications, but the lag matters for flood response or irrigation timing.

Static archival: Historical datasets updated infrequently or not at all. SoilGrids provides global soil property maps derived from years of field measurements and satellite imagery. Essential for baseline understanding but disconnected from current conditions.

Event-triggered: Data generated only when conditions are met, such as flood alerts, wildfire detections, and air quality warnings. Real-time when active, but sporadic and unpredictable.

Each modality requires different handling: streaming needs persistent connections, batch needs scheduled polling, archival needs efficient storage, and event-triggered needs webhook infrastructure. Building for one doesn't solve the others.

The Fragmentation Problem

The deeper issue is that these data sources were never designed to work together.

NOAA, EPA, USGS, and NASA each maintain their own APIs, authentication systems, data formats, and update schedules. The data is collected for scientific research optimized for rigor and reproducibility, not accessibility or interoperability.

The private sector is no better. John Deere's field sensors don't talk to Climate Corp's. Building HVAC systems speak BACnet, energy meters speak Modbus, occupancy sensors speak something proprietary. Supply chain visibility requires stitching together temperature loggers, GPS trackers, and warehouse sensors from different vendors. The pattern repeats everywhere physical world data gets collected: fragmented protocols, siloed platforms, custom integration work.

How would you answer: What's the environmental status of this watershed right now?

Answering that requires:

Stream levels from USGS (REST API, JSON, 15-minute intervals)
Air quality from EPA's AirNow (different REST API, different JSON schema, hourly)
Precipitation from NOAA (yet another API, GeoJSON, varies by station)
Soil moisture from NASA's SMAP (satellite data, HDF5 files, 2-3 day latency)

Four agencies. Four authentication flows. Four data formats. Four update frequencies. Four sets of documentation. And none of them use the same identifiers for location so correlating "this stream gauge" with "this air quality monitor" requires manual geospatial matching.

This is why the data stays siloed. The barrier isn't access most of it is public. The barrier is the engineering cost of making it usable.

The Interconnection Problem

Even when developers do the work of integrating multiple sources, they face a deeper challenge: the data doesn't speak the same language.

Units differ (Celsius vs. Fahrenheit, meters vs. feet, mg/m³ vs. AQI). Timestamps use different formats and timezones. Quality flags mean different things across agencies. Sensor metadata calibration dates, measurement methods, known issues is inconsistently documented.

Cleaning and normalizing this data is where most of the work happens. Not fetching it. Not storing it. Making it comparable across sources so that a soil moisture reading from USGS can be meaningfully correlated with precipitation data from NOAA and satellite imagery from NASA.

This cleaning layer is invisible but essential. Without it, you have data. With it, you have something models can actually reason about.

Who Bridges This Gap Today

Right now, the gap gets bridged in three ways all inefficient:

Big tech builds internal pipelines. NVIDIA's Earth-2 team spent years building infrastructure to ingest and normalize NOAA data for their weather models. That pipeline is proprietary. Google, Microsoft, and IBM have similar internal systems for their climate and agriculture products.

Startups rebuild from scratch. Every climate tech company, every precision agriculture startup, every environmental monitoring service rebuilds the same data pipeline. They solve the same auth problems, the same format conversions, the same quality issues. Then they maintain it themselves, indefinitely.

Researchers write custom scripts. Academics pull data for specific studies using whatever tools they know often Python scripts that break when APIs change. The code rarely gets shared or maintained. Each new study starts over.

No shared infrastructure layer exists. The plumbing gets rebuilt every time someone needs it.

What's Needed

A Substrate, Not a Model

A substrate is infrastructure that other things build on. AWS is a substrate for web applications developers don't provision their own servers. GPS is a substrate for navigation apps don't launch their own satellites. Stripe is a substrate for payments businesses don't build their own payment processing.

What's missing is a substrate for physical world intelligence: a normalized interface to sensor networks that models, applications, and researchers can access without rebuilding the pipeline every time.

This means:

Universal adapters for major data sources starting with government feeds (NOAA, EPA, USGS, NASA), but extensible to private sensor networks, IoT platforms, and citizen science repositories
Normalized schemas so data from different agencies can be compared directly
Cleaning and quality layers that handle unit conversion, timestamp alignment, and data validation
Real-time and historical access through the same interface
Bioregional organization so queries can be scoped to meaningful ecological boundaries, not just lat/long boxes

The goal isn't to replace the agencies or duplicate their data. It's to create an interstitial layer infrastructure that sits between raw government feeds and the applications that need them.

Why Bioregional Scale

Bioregions are areas defined by watersheds, climate patterns, and species distributions rather than political boundaries. The Shasta bioregion in Northern California, for example, is bounded by mountain ranges and river systems, not county lines.

This matters for infrastructure because ecosystems operate at a bioregional scale. A drought in one part of a watershed affects stream levels downstream. Air quality in one valley is shaped by weather patterns across the region. Soil health in one area correlates with vegetation patterns miles away.

The Global Tipping Points Report, led by Professor Tim Lenton at the University of Exeter's Global Systems Institute, identified over 25 parts of the Earth system with critical thresholds that could shift irreversibly if pushed too far. Five major tipping systems are already at risk at current warming levels: the Greenland and West Antarctic ice sheets, permafrost regions, warm-water coral reefs, and the Atlantic Meridional Overturning Circulation.

Detecting early warning signals for these transitions requires correlating sensor data across bioregions in near real-time exactly the kind of cross-source integration that fragmented APIs make nearly impossible.

Organizing sensor access around bioregions isn't just conceptually elegant. It's how the queries actually need to work.

What Becomes Possible

At planetary scale: The Amazon

Climate researchers studying Amazon dieback need to correlate satellite imagery (deforestation patterns), ground-based humidity sensors (microclimate changes), river gauges (water table shifts), and fire detection systems (burn frequency). Today, that's months of data cleaning before analysis can begin, pulling from NASA Earthdata, Brazilian national agencies, and scattered research stations, each with different formats and access protocols.

With a sensor substrate that collapses in days. Researchers query a bioregion, specify the sensor types and time range, and get normalized data ready for analysis. The cleaning layer handles the integration. The scientists focus on science.

This matters because the Amazon is approaching a tipping point. Research published in Nature suggests that between 10-47% of Amazonian forests will be exposed to compounding disturbances by 2050 that could trigger ecosystem transitions. Detecting early warning signals the subtle correlations between humidity, fire frequency, and vegetation health that precede large-scale dieback requires exactly the kind of cross-source sensor integration that current infrastructure doesn't support.

At local scale: A college campus

A facilities manager at a university wants to reduce water usage across campus. The irrigation system runs on fixed schedules, ignoring actual conditions. Optimizing it requires knowing:

Current soil moisture (from ground sensors or satellite estimates)
Recent precipitation (from local weather stations)
Upcoming forecast (temperature, rain probability)
Historical usage patterns (which zones dry out fastest)

Today, this data exists but lives in separate systems the campus weather station, the regional NOAA feed, the state water board, the irrigation controller's internal logs. Integrating them requires custom engineering that most facilities teams can't afford.

With a substrate that normalizes environmental data and exposes it through a simple interface, the same facilities manager could connect their irrigation system to real conditions. Water the soil when it is dry. Skip watering when rain is coming. Adjust schedules based on what's actually happening, not what the calendar says.

Scale this to every campus, every municipal park system, every golf course, every farm and the aggregate water savings become significant. But it only happens if the data is accessible without a six-month engineering project.

The Cleaning Layer is the Product

What makes this infrastructure valuable isn't fetching data that's relatively straightforward. It's the cleaning and normalization that happens between raw feeds and usable outputs.

When a USGS stream gauge reports "discharge: 450 cfs" and a NOAA station reports "precipitation: 0.8 inches over 24 hours," those numbers mean nothing to each other without context. What's the catchment area? What's the soil saturation? What's the lag time between rainfall and stream response in this particular watershed? (Thank you to Fieldkit for showing me why this matters.)

The cleaning layer encodes that knowledge. It aligns timestamps. Converts units. Flags anomalies. Interpolates gaps. Cross-references sensors to validate readings. This is the work that turns raw data into something a model can reason about, and it's the work that currently gets done ad hoc and inconsistently by everyone who needs it.

Building this layer once, correctly, and making it available as infrastructure is how the gap between frontier AI and the physical world actually closes.

The Opportunity

Physical AI is arriving faster than the infrastructure to support it.

Autonomous vehicles need to understand not just their cameras and lidar but the environmental context around them, road conditions shaped by recent weather, visibility affected by air quality, hazards created by flooding or fire. Warehouse robots operating at scale need to account for humidity and temperature that affect the materials they handle. Agricultural drones need soil moisture, pest pressure, and microclimate data to make useful recommendations.

These systems will get built regardless. The question is whether they'll each rebuild the same data pipeline from scratch, proprietary, siloed, inaccessible, or whether they'll build on shared infrastructure that makes physical world data as accessible as web APIs made internet data.

Government data is the starting point for public, high-quality, and immediately useful. But the adapter pattern scales beyond it. The same infrastructure that normalizes NOAA streams can normalize agricultural IoT sensors, building management systems, and supply chain telemetry. The substrate isn't just for public data. It's for any data about the physical world.

The substrate we build now determines which future we get.

Interstitial is building the missing layer: universal adapters for government sensor networks, normalized schemas for cross-source comparison, and the cleaning infrastructure that makes raw feeds usable. Not a model. Not an application. The plumbing that lets models and applications connect to the physical world.

The physical world generates more data every day than we could ever synthesize. The challenge isn't creating more data; it's making the data that already exists accessible, standardized, and useful for the intelligence systems that are supposed to help us understand and interact with the world around us. Think of it as building a substrate for physical world intelligence.