Humanoid Visual SLAM

Real-time humanoid state estimation fusing IMU attitude, kinematic odometry, and monocular visual SLAM for robust drift-free locomotion.

Real-Time Humanoid State Estimation with IMU, Kinematic Odometry, and Visual SLAM

We present a real-time humanoid state estimation method that loosely fuses inertial attitude (Mahony filter), kinematic odometry from legged locomotion, and a visual SLAM frontend based on the open-source Stella system. Humanoid locomotion induces rapid viewpoint changes, motion blur, and contact-driven dynamics that challenge conventional visual and visual-inertial pipelines.

Problem Statement

We consider a world-fixed inertial frame W\mathcal{W} and a body-fixed frame B\mathcal{B} rigidly attached to the humanoid robot midway between the hip yaw joints. The aim is to provide a real-time estimate of the floating base pose:

Hbw=[RbwrB/Ww01]\mathbf{H}_b^w = \begin{bmatrix} \mathbf{R}_b^w & \mathbf{r}_{B/W}^w \\ \mathbf{0} & 1 \end{bmatrix}

where RbwSO(3)\mathbf{R}_b^w \in SO(3) is the rotation from B\mathcal{B} to W\mathcal{W}, and rB/WwR3\mathbf{r}_{B/W}^w \in \mathbb{R}^3 is the body position in W\mathcal{W}. The six degrees-of-freedom (x,y,z,ϕ,θ,ψ)(x, y, z, \phi, \theta, \psi) form the basis of evaluation.

Our Approach

Our approach targets real-time humanoid state estimation by combining three sensor inputs: IMU (gyroscope + accelerometer), servo position encoders, and a monocular RGB camera. For each degree-of-freedom we develop a pipeline best tailored for that component.

Mahony Filter Attitude Estimation

We estimate the floating base orientation using the Mahony filter, a nonlinear complementary filter on SO(3)SO(3). It fuses gyroscope and accelerometer measurements with integral correction terms, providing reliable roll and pitch estimates. In the absence of a magnetometer, yaw gradually drifts over time and requires correction from other sources.

Anchor Point Kinematic Odometry

Leveraging servo position sensors, we employ the anchor point strategy: when a foot is in contact, an anchor point AA on its sole is assumed fixed in W\mathcal{W}. The body frame is then computed as:

Hbw=Haw×(Hab)1\mathbf{H}_b^w = \mathbf{H}_a^w \times (\mathbf{H}_a^b)^{-1}

Each time a support foot switch occurs, the anchor point is updated to the new support foot using relative kinematics between feet. While simple and computationally light, this approach accumulates drift over time from foot slippage, sole compliance, and encoder quantisation.

Visual SLAM with Online Scale Estimation

We adopt Stella, an open-source fork of the OpenVSLAM framework, which combines an ORB feature-based frontend with keyframe-based pose graph optimisation, loop closure, and relocalisation.

Since we employ monocular SLAM, camera poses remain up to an arbitrary scale. We recover metric scale online by exploiting ground-plane geometry. Map point rays are intersected with the z=0z=0 plane using the known camera height from kinematics:

di=h(ui)z,rPi/W,gw=rC/Ww+diuid_i = \frac{|h|}{|(\mathbf{u}_i)_z|}, \qquad \mathbf{r}_{P_i/W,g}^w = \mathbf{r}_{C/W}^w + d_i \mathbf{u}_i

The per-point scales are averaged and smoothed with exponential filtering to produce a stable global scale factor.

Filtering and Fusion

To address the limitations of each component, we adopt a loosely coupled architecture:

  • Yaw filter: A lightweight complementary filter combining gyroscope integration with kinematic yaw updates and visual SLAM corrections, adaptively estimating and compensating for gyroscope bias.

  • Sliding-window smoothing: A weighted least-squares problem over the most recent NN body states balances smoothness with agreement to visual and kinematic measurements:

J(X)=iwsmxi+1xi2+iMsws,ix^isxi2+iMawa,ix^iaxi2J(\mathbf{X}) = \sum_{i} w_{sm} \|\mathbf{x}_{i+1} - \mathbf{x}_i\|^2 + \sum_{i \in \mathcal{M}_s} w_{s,i} \|\hat{\mathbf{x}}^s_i - \mathbf{x}_i\|^2 + \sum_{i \in \mathcal{M}_a} w_{a,i} \|\hat{\mathbf{x}}^a_i - \mathbf{x}_i\|^2

The visual weights ws,iw_{s,i} are adaptive and reflect the SLAM frontend status (high during stable tracking, reduced during relocalisation, near-zero when lost).

Results

Experiments were conducted on the NUgus humanoid robot across four scenarios, comparing IMU+kinematics, vision-only (Stella), and the fused system.

ScenarioMethodX [m]Y [m]Z [m]Roll [°]Pitch [°]Yaw [°]
Stella alone (dropout)Stella0.4790.0350.00452.5551.167.63
IMU+Kinematics0.0700.0900.0043.2215.356.21
PickupFused System0.0630.0320.0261.281.591.48
IMU+KinematicsN/AN/A0.0261.281.594.08
Forward WalkingFused System0.0210.0410.0041.271.751.50
IMU+Kinematics0.1620.1130.0041.271.756.59
Visual DropoutFused System0.1120.1010.0051.432.214.00
IMU+Kinematics0.2360.1810.0051.432.2111.11

Across all scenarios, the fused estimator consistently outperforms individual modalities. Yaw errors were reduced by a factor of 3–5× compared to IMU+kinematics, with significantly lower translational drift. The system degrades gracefully during visual dropout and reconverges once tracking resumes.