March 10, 2026

FDM-1: Datacenter-Scale Robot Inference with 91x Fewer Parameters

Traditional vision encoders allocate hundreds of millions of parameters to process camera frames — most of which encode information irrelevant to action prediction. FDM-1 replaces that with a compact alternative that delivers comparable performance at a fraction of the cost.

Encoder size
4.4M params
91x smaller than SigLIP
Tokens per frame
5
51x compression vs 256
Inference speed
7.3x faster
109 Hz on A100
Action quality
1.26x MSE
Near-parity with teacher

The problem with traditional vision encoders

Robot action models typically rely on large vision encoders to interpret camera frames. The Pi0.5 model, for example, allocates 400 million parameters to SigLIP — a vision encoder that processes images into 256 tokens per frame. While this provides detailed spatial information, much of it is irrelevant for action prediction. Tokens often encode background elements like table surfaces or empty space, which don't contribute meaningfully to deciding the robot's next move.

This "dead weight" leads to inflated model sizes, higher token counts, slower inference times, and increased computational demands.

FDM-1: a smarter way to encode vision

FDM-1 addresses these inefficiencies by introducing a lightweight encoder with just 4.4 million parameters — 91x smaller than SigLIP. Instead of generating 256 tokens per frame, it outputs only 5 carefully crafted tokens, achieving 51x compression. This results in 7.3x faster inference while maintaining action quality with only a 1.26x increase in MSE compared to the full SigLIP baseline.

Rather than uniformly pooling image patches (which can lose critical details), FDM-1 employs 5 learned query tokens. These queries cross-attend to all 196 patches in the image, selectively extracting the most relevant information. Each token specializes in different aspects:

  • Gripper state
  • Object pose
  • Arm configuration
  • Contact geometry
  • Scene context

To ensure these tokens carry unique and complementary information, the model is trained using masked latent prediction. During training, 3 out of the 5 tokens are masked, forcing the remaining 2 to reconstruct them. This compels the encoder to focus on what's essential for the action decoder, eliminating superfluous data.

Results from limited data

The current prototype was trained on just 4,300 robot frames in about 2 hours on a single A100 GPU. Despite this modest dataset, it achieves near-parity performance with the 256-token teacher model, with only a 1.26x MSE gap. We attribute this minor shortfall to data limitations rather than architectural flaws — SigLIP benefited from 400 million image-text pairs during pretraining.

With access to more diverse data and additional compute, we anticipate closing this gap entirely. We project that 5–7 tokens could match or exceed the performance of 256 tokens, paving the way for even more efficient robot control systems.

Breaking down the metrics

Encoder size4.4M parameters (vs. 400M for SigLIP) — 91x smaller
Token count5 tokens per frame (vs. 256) — 51x compression
Inference speed7.3x faster, 109 Hz on A100 hardware
Action quality1.26x higher MSE, minimal per-joint error differences

From research to production

We're already working to productionize this technology. We're developing an inference API with approximately 40ms roundtrip latency, incorporating techniques like DRTC (Dynamic Runtime Token Compression) for further optimizations.

The entire training process was completed on Modal for under $100, showcasing how accessible this approach is for smaller teams or individual researchers.

Why this matters

FDM-1 represents a shift toward task-specific efficiency in AI for robotics. By focusing on what truly matters for action prediction, it reduces the barriers to deploying advanced models on edge devices or in resource-constrained environments. As data scales up, we could see widespread adoption in autonomous manufacturing, healthcare robotics, and home assistants.

Real-world deployment videos are in progress. For technical details or collaboration opportunities, reach out at team@tryreflex.ai.