FDM-1: Datacenter-Scale Robot Inference with 91x Fewer Parameters
Traditional vision encoders allocate hundreds of millions of parameters to process camera frames — most of which encode information irrelevant to action prediction. FDM-1 replaces that with a compact alternative that delivers comparable performance at a fraction of the cost.
The problem with traditional vision encoders
Robot action models typically rely on large vision encoders to interpret camera frames. The Pi0.5 model, for example, allocates 400 million parameters to SigLIP — a vision encoder that processes images into 256 tokens per frame. While this provides detailed spatial information, much of it is irrelevant for action prediction. Tokens often encode background elements like table surfaces or empty space, which don't contribute meaningfully to deciding the robot's next move.
This "dead weight" leads to inflated model sizes, higher token counts, slower inference times, and increased computational demands.
FDM-1: a smarter way to encode vision
FDM-1 addresses these inefficiencies by introducing a lightweight encoder with just 4.4 million parameters — 91x smaller than SigLIP. Instead of generating 256 tokens per frame, it outputs only 5 carefully crafted tokens, achieving 51x compression. This results in 7.3x faster inference while maintaining action quality with only a 1.26x increase in MSE compared to the full SigLIP baseline.
Rather than uniformly pooling image patches (which can lose critical details), FDM-1 employs 5 learned query tokens. These queries cross-attend to all 196 patches in the image, selectively extracting the most relevant information. Each token specializes in different aspects:
- Gripper state
- Object pose
- Arm configuration
- Contact geometry
- Scene context
To ensure these tokens carry unique and complementary information, the model is trained using masked latent prediction. During training, 3 out of the 5 tokens are masked, forcing the remaining 2 to reconstruct them. This compels the encoder to focus on what's essential for the action decoder, eliminating superfluous data.
Results from limited data
The current prototype was trained on just 4,300 robot frames in about 2 hours on a single A100 GPU. Despite this modest dataset, it achieves near-parity performance with the 256-token teacher model, with only a 1.26x MSE gap. We attribute this minor shortfall to data limitations rather than architectural flaws — SigLIP benefited from 400 million image-text pairs during pretraining.
With access to more diverse data and additional compute, we anticipate closing this gap entirely. We project that 5–7 tokens could match or exceed the performance of 256 tokens, paving the way for even more efficient robot control systems.
Breaking down the metrics
From research to production
We're already working to productionize this technology. We're developing an inference API with approximately 40ms roundtrip latency, incorporating techniques like DRTC (Dynamic Runtime Token Compression) for further optimizations.
The entire training process was completed on Modal for under $100, showcasing how accessible this approach is for smaller teams or individual researchers.
Why this matters
FDM-1 represents a shift toward task-specific efficiency in AI for robotics. By focusing on what truly matters for action prediction, it reduces the barriers to deploying advanced models on edge devices or in resource-constrained environments. As data scales up, we could see widespread adoption in autonomous manufacturing, healthcare robotics, and home assistants.
Real-world deployment videos are in progress. For technical details or collaboration opportunities, reach out at team@tryreflex.ai.