CVPR 2026

EMMA: Extracting Multiple physical parameters from Multimodal Data

Recover the physics of a system from raw video or image, even when the hidden dynamics are unknown.

  1. Farhat Shaikh
  2. Ayan Banerjee
  3. Sandeep K. S. Gupta

IMPACT Lab, School of Computing & Augmented Intelligence · Arizona State University



Why EMMA

Give multimodal data (video, image). EMMA figures out the system.

Point a camera at the system, or hand EMMA a published plot. Provide its equation of motion. EMMA recovers the dynamical parameters, the hidden actuation, and the coordinate frame, all inside one continuous-time model. When audio is available, the same pipeline fuses it with video to resolve states that video alone cannot see.

You give EMMA

A video or a chart, plus the equation of motion.

  • Video of the system visual input
    raw frames, no masks, no templates
  • Chart images visual input
    an alternative to video, for systems known only through published plots
  • Governing ODE form required
    the parametric structure of dx/dt = f(x, u; θ) and the parameters you want recovered
  • Audio optional, multimodal
    paired with video when states or actuation are hidden from the camera
EMMA returns

A calibrated, physics-consistent digital twin.

  • Explicit parameters θ
    mass, length, damping, thrust and torque coefficients, ...
  • Implicit dynamics
    drag, friction, and other unmeasured physical effects
  • Hidden actuation u(t)
    latent forcing signals such as wheel speed or rotor rate
  • Calibration invariants
    coordinate frame, camera alignment, initial state

Abstract

One model, every identifiable parameter.

We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model.

Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines.


The Physics

EMMA needs your equation of motion.

EMMA is not data-driven equation discovery. The user supplies the parametric structure of the governing ODE; EMMA solves the inverse problem of recovering its parameters, along with any latent forcing and invariants, from multimodal observations.

$$ \frac{d\mathbf{x}(t)}{dt} \;=\; f\!\left(\mathbf{x}(t),\,\mathbf{u}(t);\,\boldsymbol{\theta}\right) $$
For a quadrotor, $\mathbf{x}(t)$ is the drone's state (position and velocity), $\mathbf{u}(t)$ is an external input the camera can't see (wind or rotor thrust), and $\boldsymbol{\theta}$ is what EMMA recovers (mass, drag, torque).

EMMA closes the loop with a physics-informed loss: simulate forward using the recovered $\boldsymbol{\theta}$, compare to the observed trajectory, and backpropagate through the ODE solver until the two agree.


Demo

Rover, in the lab.

Parameter recovery from a differential-drive rover using synchronized video and audio. The demo shows the rover scenario only; pendulum, quadrotor, and simulation-chart results are covered in the paper.

EMMA demo thumbnail
EMMA · rover demo

Method

Three stages, one continuous-time model.

Sense. Learn. Verify.

EMMA follows a three-step pipeline from raw data to verified parameters. Video or chart images are the visual input; audio extends the same pipeline into the multimodal setting.

  1. Step 01 · Sense

    Video · audio · charts

    Raw modalities are converted into time-aligned signals through modality-specific pipelines.

  2. Step 02 · Learn

    LTC neural network

    A Liquid Time-Constant network models the system's latent dynamics in continuous time.

  3. Step 03 · Verify

    Physics simulation & loss

    A differentiable ODE solver simulates the recovered parameters and checks them against the observations.

EMMA architecture: video and audio pipelines feed a shared LTC network and physics-informed digital twin that outputs parameter estimates and a simulated trajectory.
Figure 1. EMMA architecture. Multimodal inputs (video, audio, image) are processed through modality-specific pipelines into unified temporal representations. An LTC-NN learns latent dynamics h(t) and predicts physical parameters θ. A differentiable physics solver validates predictions, enabling end-to-end gradient flow.

Evaluation

Across benchmarks, hardware, and simulation charts.

EMMA is evaluated on five canonical video benchmarks, real-world platforms with hidden forcing, and chart-based simulation studies spanning biological and chaotic regimes.

8.8% ± 1.7%
mean error
Rover (5 params)
100+ scenarios
Delfys benchmark, rover, drone
& simulation charts
> 10× lower
implicit RMSE
vs PySINDy
107× smaller
compact model
vs Delfys baseline

Compared against PAIG, NIRPI, and Delfys on the video benchmarks and PySINDy on the chart-based simulations. Rover and quadrotor under hidden forcing are reported as first-of-their-kind results. Full tables, ablations, and per-system comparisons are in the paper.

Supported systems

Delfys Benchmark
  • Pendulum
  • Torricelli drainage
  • Sliding block
  • LED decay
  • Free fall
Real-world platforms
  • Differential-drive rover
  • 6-DoF quadrotor
  • Hidden audio forcing
Simulation charts
  • Lotka-Volterra
  • Chaotic Lorenz
  • F8 Crusader
  • HIV therapy
  • AID (Type-1 diabetes)

Citation

Cite this work.

@InProceedings{Shaikh_2026_CVPR,
  author    = {Shaikh, Farhat and Banerjee, Ayan and Gupta, Sandeep},
  title     = {EMMA: Extracting Multiple physical parameters from Multimodal Data},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
               and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026},
  pages     = {1716-1725}
}