General-Purpose Physical AI

Short-Term
Multimodal LLM
10Hz
Short-Term...
Diffusion
Transformer
120Hz
Diffusion...
State
And
Action
Encoder
State...
Action Encoder
Action Encoder
Vision Tower
10HZ
Vision Tower...
Long-Term Planning
Multimodal LLM
1-9Hz
Long-Term Planning...
 Speech To Text Encoder
Database Of
Context
And Tasks
Database Of...
Retrieval
Model
Retrieval...
3D Map3D Mapper Text To Speech DecoderResponseEmbodiment SpecificCommandImageImageActionStateGPU 2GPU 1
Text is not SVG - cannot display

Long-Term Planner

The long-term planning system is responsible for giving tasks to the short-term module and for communicating with humans and other robots. It takes spatial context from the database, the vision tower's output and the short-term large language model's output as input, if needed, queries the database and retrieval model for additional info and then gives the command to the short-term module.

Database

This is responsible for storing contextual and spatial information about all relatively important tasks, objects and people. If social memory is turned on, it is capable of remembering people's names and preferences. For each particular task is memorized the terms for performing the task correctly, when the task should be done and what are the commander's preferences if there should be variations during the task execution.

Short-Term Action Module

This is responsible for controlling the low-level actions. It is trained specifically with over 1,000,000 trajectories to adapt to various environments, commands and situations.

3D Mapping

The 3D Mapper continuously scans the environment and compiles the surroundings to a voxel grid, where each voxel is part of a classified object. The 3D Mapper can take the depth map from only one depth camera as an input or use the symbiosis of depth cameras and LiDARs.

See Pricing