EMMA: Extracting Multiple physical parameters from Multimodal Data

Gallery

Real footage, simulated from recovered parameters.

For each system, the left panel is the input video; the right panel is a forward simulation driven by the physical parameters EMMA recovered from that video.

Real pendulum video — Pendulum L = 0.86 m, τ = 0.045

Pendulum simulated from recovered parameters — Pendulum L = 0.86 m, τ = 0.045

Real free fall video — Free fall g = 9.8 m/s²

Free fall simulated from recovered parameters — Free fall g = 9.8 m/s²

Real draining container video — Torricelli drainage k = 0.0132 √m/s

Torricelli drainage simulated from recovered parameters — Torricelli drainage k = 0.0132 √m/s

Real sliding block video — Sliding block α = 24.7°, μ = 0.205

Sliding block simulated from recovered parameters — Sliding block α = 24.7°, μ = 0.205

Real LED decay video — LED decay γ = 0.91

LED decay simulated from recovered parameters — LED decay γ = 0.91

Real differential-drive rover video — Rover 5 params

Rover trajectory simulated from recovered parameters — Rover 5 params

Real quadrotor video — Quadrotor 7 params

Quadrotor trajectory simulated from recovered parameters — Quadrotor 7 params

Why EMMA

Give multimodal data (video, image). EMMA figures out the system.

Point a camera at the system, or hand EMMA a published plot. Provide its equation of motion. EMMA recovers the dynamical parameters, the hidden actuation, and the coordinate frame, all inside one continuous-time model. When audio is available, the same pipeline fuses it with video to resolve states that video alone cannot see.

You give EMMA

A video or a chart, plus the equation of motion.

Video of the system visual input
raw frames, no masks, no templates
Chart images visual input
an alternative to video, for systems known only through published plots
Governing ODE form required
the parametric structure of dx/dt = f(x, u; θ) and the parameters you want recovered
Audio optional, multimodal
paired with video when states or actuation are hidden from the camera

EMMA returns

A calibrated, physics-consistent digital twin.

Explicit parameters θ
mass, length, damping, thrust and torque coefficients, ...
Implicit dynamics
drag, friction, and other unmeasured physical effects
Hidden actuation u(t)
latent forcing signals such as wheel speed or rotor rate
Calibration invariants
coordinate frame, camera alignment, initial state

Abstract

One model, every identifiable parameter.

We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model.

Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines.

The Physics

EMMA needs your equation of motion.

EMMA is not data-driven equation discovery. The user supplies the parametric structure of the governing ODE; EMMA solves the inverse problem of recovering its parameters, along with any latent forcing and invariants, from multimodal observations.

$$ \frac{d\mathbf{x}(t)}{dt} \;=\; f\!\left(\mathbf{x}(t),\,\mathbf{u}(t);\,\boldsymbol{\theta}\right) $$

For a quadrotor, $\mathbf{x}(t)$ is the drone's state (position and velocity), $\mathbf{u}(t)$ is an external input the camera can't see (wind or rotor thrust), and $\boldsymbol{\theta}$ is what EMMA recovers (mass, drag, torque).

EMMA closes the loop with a physics-informed loss: simulate forward using the recovered $\boldsymbol{\theta}$, compare to the observed trajectory, and backpropagate through the ODE solver until the two agree.

Demo

Rover, in the lab.

Parameter recovery from a differential-drive rover using synchronized video and audio. The demo shows the rover scenario only; pendulum, quadrotor, and simulation-chart results are covered in the paper.

EMMA · rover demo

Method

Three stages, one continuous-time model.

Sense. Learn. Verify.

EMMA follows a three-step pipeline from raw data to verified parameters. Video or chart images are the visual input; audio extends the same pipeline into the multimodal setting.

Step 01 · Sense

Video · audio · charts

Raw modalities are converted into time-aligned signals through modality-specific pipelines.
Step 02 · Learn

LTC neural network

A Liquid Time-Constant network models the system's latent dynamics in continuous time.
Step 03 · Verify

Physics simulation & loss

A differentiable ODE solver simulates the recovered parameters and checks them against the observations.

EMMA architecture: video and audio pipelines feed a shared LTC network and physics-informed digital twin that outputs parameter estimates and a simulated trajectory. — **Figure 1. EMMA architecture.** Multimodal inputs (video, audio, image) are processed through modality-specific pipelines into unified temporal representations. An LTC-NN learns latent dynamics `h(t)` and predicts physical parameters `θ`. A differentiable physics solver validates predictions, enabling end-to-end gradient flow.

Evaluation

Across benchmarks, hardware, and simulation charts.

EMMA is evaluated on five canonical video benchmarks, real-world platforms with hidden forcing, and chart-based simulation studies spanning biological and chaotic regimes.

8.8% ± 1.7%

mean error

Rover (5 params)

100+ scenarios

Delfys benchmark, rover, drone

& simulation charts

> 10× lower

implicit RMSE

vs PySINDy

107× smaller

compact model

vs Delfys baseline

Compared against PAIG, NIRPI, and Delfys on the video benchmarks and PySINDy on the chart-based simulations. Rover and quadrotor under hidden forcing are reported as first-of-their-kind results. Full tables, ablations, and per-system comparisons are in the paper.

Supported systems

Delfys Benchmark

Pendulum
Torricelli drainage
Sliding block
LED decay
Free fall

Real-world platforms

Differential-drive rover
6-DoF quadrotor
Hidden audio forcing

Simulation charts

Lotka-Volterra
Chaotic Lorenz
F8 Crusader
HIV therapy
AID (Type-1 diabetes)

Citation

Cite this work.

@InProceedings{Shaikh_2026_CVPR,
  author    = {Shaikh, Farhat and Banerjee, Ayan and Gupta, Sandeep},
  title     = {EMMA: Extracting Multiple physical parameters from Multimodal Data},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
               and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026},
  pages     = {1716-1725}
}