Humanoid State Estimation

Floating-base state estimation for a humanoid robot using Mahony filter and kinematic odometry.

Humanoid State Estimation in RoboCup

This work presents a comprehensive pipeline for kinematic state estimation of humanoid robots in the RoboCup competition. The dynamic and sensor-limited environment of RoboCup poses significant challenges for accurate state estimation, including unstable walking surfaces, frequent collisions, and restrictions on external sensing modalities like LiDAR and GPS.

Introduction

For a humanoid robot, locomotion involves controlling the unactuated floating base to a desired location in the world. Before a control action can be applied, an accurate estimate of the position and orientation of the floating base is required. In the context of RoboCup, the artificial grass surface and collisions with other robots complicates stable walking, making state estimation challenging due to falls, noise and drift over time.

Kinematic state estimation in RoboCup can be broken down into two major areas:

  • Odometry: Estimation of the robot's pose with respect to an inertial world frame
  • Localization: Estimation of the robot's pose with respect to the soccer field frame

Odometry

Since the contact configuration of a robot during walking is always changing, we construct a representation of the system in a general World-fixed inertial frame. We consider two reference frames: World-fixed inertial frame attached to the ground and body-fixed frame rigidly attached to the robot midway between the robot's hip yaw joints.

The homogeneous transformation matrix capturing the relationship between these frames is given by:

Hbf=Hwf(x)×HbwH^f_b = H^f_w(x) × H^w_b

where RbwR_b^w is the rotation matrix from the body frame to the world frame, and rB/Wwr_{B/W}^w is the position vector of the body frame with respect to the world frame.

To estimate the orientation of the floating base body frame, we use the Mahony Filter, a simple and efficient approach for real-time attitude estimation. The Mahony filter has only two tuning parameters, the PI compensator gains KpK_p and KiK_i, making the tuning process straightforward.

For floating base translation estimation, we use the anchor point strategy. We select an anchor point AA, located on the robot's foot sole and assume that this point is grounded at position rA/WWr_{A/W}^{W} in the world frame whenever it serves as the support foot. In the floating-base frame, the position rA/BBr_{A/B}^{B} of this anchor point is known through forward kinematics allowing continuous tracking of the floating base translation relative to the world frame.

Visual Landmark Detection

Our localization approach relies on visual landmarks detected using two computer vision methods:

  • YOLOv8n: State-of-the-art real-time object detection for identifying objects and key landmarks
  • Visual Mesh: Highly efficient semantic segmentation network specifically tuned for detecting field lines

The landmarks include YOLOv8n-detected goal posts, T, L, and X intersections, and field line points detected by the Visual Mesh.

Without loss of generality, through a combination of our camera model and the extrinsic matrix HbodycH_{body}^c, the pixel-based detections can be projected onto the field plane. A detection in world space r^O/worldw\hat{r}^w_{O/world} is given by:

r^O/Ww=(e3TrC/Ww)/(e3T(RcwuO/Cc))×RcwuO/Cc+rC/Wwr̂^w_{O/W} = (e_3^T r^w_{C/W}) / (e_3^T (R^w_c u^c_{O/C})) × R^w_c u^c_{O/C} + r^w_{C/W}

where uO/Ccu^c_{O/C} is the unit vector associated with a pixel obtained through our camera model, rC/worldwr^w_{C/world} is the position of the camera in the world frame, RcwR^w_c is the rotation matrix from the camera frame to the world frame, and e3e_3 is the basis vector [0,0,1]T[0, 0, 1]^T.

Performance Benchmarks

MethodSimulation (i7-11850H)Robot (i7-1260P)
YOLOv8n47 FPS66 FPS
Visual Mesh152 FPS259 FPS

Localization

The localization problem can be formulated as estimating the pose of the field relative to the world frame. Due to the flat nature of the soccer field, this can be fully described by the transformation matrix:

Hwf(x)=[cos(θ)sin(θ)0xsin(θ)cos(θ)0y00100001]H_w^f(x) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) & 0 & x \\ \sin(\theta) & \cos(\theta) & 0 & y \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}

where x=[x,y,θ]T\mathbf{x} = [x, y, \theta]^T is a vector containing the x-y translation and yaw rotation.

We propose a localization method leveraging nonlinear optimization to compute the optimal state x\mathbf{x} in real-time. Our framework employs the derivative-free algorithm COBYLA (Constrained Optimization BY Linear Approximations), integrating multiple cost components and constraints.

The optimization problem is given by:

x=argminxJ(x)s.t.xminxxmax\begin{aligned} \mathbf{x}^* &= \underset{\mathbf{x}}{\arg\min} \, J(\mathbf{x}) \\ \text{s.t.} \quad &\mathbf{x}_{\min} \leq \mathbf{x} \leq \mathbf{x}_{\max} \end{aligned}

where xmin,xmax\mathbf{x}_{min}, \mathbf{x}_{max} are the lower and upper bounds on the state vector x\mathbf{x}.

Cost Function Components

The overall cost function J(x)J(\mathbf{x}) is defined as:

J(x)=wfl×Jfl(x)+wlm×Jlm(x)+wsc×Jsc(x)J(\mathbf{x}) = w_{fl} × J_{fl}(\mathbf{x}) + w_{lm} × J_{lm}(\mathbf{x}) + w_{sc} × J_{sc}(\mathbf{x})
  1. Field Line Alignment Cost Jfl(x)J_{fl}(\mathbf{x}): Measures how well the observed field line points align with actual field lines:
Jfl(x)=i=1Nfldmap(Hwf(x)r^iw)2J_{fl}(\mathbf{x}) = \sum_{i=1}^{N_{fl}} d_{map}(H_w^f(\mathbf{x}) \hat{\mathbf{r}}^w_i)^2

where NflN_{fl} is the number of observed field line points, r^iw\hat{r}^w_i represents the ii-th field line point in the world frame, transformed into the field frame via Hwf(x)1H_w^f(\mathbf{x})^{-1}, and dmapd_{map} is a function which provides the distance to the nearest field line using a precomputed distance map.

  1. Landmark Cost Jlm(x)J_{lm}(\mathbf{x}): Assesses the alignment of observed field line intersections and goal posts with known positions:
Jlm(x)=i=1NlmrifHwf(x)r^iw2J_{lm}(\mathbf{x}) = \sum_{i=1}^{N_{lm}} \left\|\mathbf{r}^f_i - H_w^f(\mathbf{x}) \hat{\mathbf{r}}^w_i\right\|^2

where NlmN_{lm} is the number of associated landmarks, rifr^f_i is the known position of the ii-th landmark in the field frame, and r^iw\hat{r}^w_i is the observed position of the ii-th landmark in the world frame.

  1. State Change Cost Jsc(x)J_{sc}(\mathbf{x}): Penalizes significant deviations from the prior state estimate:
Jsc(x)=xx02J_sc(x) = ||x - x_0||^2

where x0\mathbf{x}_0 is the prior state estimate (initial guess).

Results

After each optimization step, the solution x\mathbf{x} is filtered using a standard Kalman filter to smooth state estimates over time. Our method achieves the lowest RMSE errors compared to other approaches:

Methodx [m]y [m]yaw [deg]
Particle Filter0.05630.08901.6180
NLopt (field lines only)0.05030.05630.8389
NLopt (all cost terms)0.05000.05590.8273

On average, the optimization routine and filtering step take only 2 milliseconds to complete, making it suitable for real-time applications on resource-constrained humanoid robot hardware.

Key Contributions

  • Integrated odometry approach combining Mahony filter with anchor point strategy
  • Real-time visual landmark detection using YOLOv8n and Visual Mesh
  • Novel nonlinear optimization framework for localization with multiple cost terms
  • Efficient implementation achieving sub-5ms computation time
  • Robust performance in challenging RoboCup environments with limited sensors