3. Perception & Sensing

The Goal: From Raw Data to Understanding

A humanoid robot's ability to act meaningfully in the world is fundamentally limited by its ability to perceive the world. Perception is not merely about collecting data; it's about transforming raw signals from a suite of sensors into a coherent, actionable understanding of the environment and the robot's own state within it.

In this chapter, we explore the primary sensors that give a robot its "senses" and the foundational algorithms that begin the process of turning data into knowledge.

Exteroceptive Sensors: Seeing the World

These sensors gather information about the external environment.

Vision (Cameras)

Cameras are the richest source of data for a robot, providing dense information about color, texture, and shape. They are passive, low-power, and inexpensive.

Monocular Cameras: A single camera, like a human eye. It provides a 2D projection of the 3D world. Inferring depth is challenging but can be done with advanced AI models.
Stereo Cameras: Two cameras spaced a known distance apart. By comparing the two images (finding a disparity map), the robot can calculate depth through triangulation, mimicking human stereoscopic vision.
RGB-D (Depth) Cameras: These cameras directly provide a depth map alongside a standard color image. Common techniques include Time-of-Flight (ToF), which measures the time it takes for light to bounce back to the sensor.

The raw images from these cameras are processed by computer vision algorithms, which are now dominated by Deep Learning, specifically Convolutional Neural Networks (CNNs). Key tasks include:

Object Detection: Identifying and drawing bounding boxes around objects (e.g., "this is a cup").
Semantic Segmentation: Classifying every single pixel in an image (e.g., "these pixels are 'road'," "these pixels are 'sky'").
Pose Estimation: Determining an object's precise 3D position and orientation relative to the camera.

LiDAR (Light Detection and Ranging)

LiDAR sensors work by emitting rapid pulses of laser light and measuring the time it takes for the reflections to return.

Output: This process generates a point cloud, which is a rich, 3D "map" of the surrounding environment.
Strengths: LiDAR is highly accurate for measuring distance and is unaffected by lighting conditions, making it excellent for mapping, localization, and detecting obstacles.
Weaknesses: It typically doesn't provide color information and can have trouble with reflective or transparent surfaces.

Proprioceptive Sensors: Understanding Self

These sensors provide information about the robot's own state and movement.

Inertial Measurement Unit (IMU)

The IMU is the core of the robot's sense of balance, located near its center of mass.

Components: It contains an accelerometer (measures linear acceleration) and a gyroscope (measures angular velocity).
Function: By integrating the data from the IMU, the robot can estimate its orientation (roll, pitch, yaw) and track its motion. It is absolutely critical for maintaining balance.

Joint Encoders

These are the most fundamental proprioceptive sensors. An encoder is attached to every joint (motor) on the robot and provides a precise measurement of the joint's angle. Without encoders, the robot would have no reliable knowledge of its own posture.

Force-Torque (F/T) Sensors

Located in critical areas like the wrists and ankles, F/T sensors measure the forces and torques resulting from interaction with the environment. They allow the robot to feel when its foot is firmly on the ground or how much force it's exerting while grasping an object. This is essential for compliant motion and safe physical interaction.

The Power of Fusion

No single sensor is perfect. GPS can be inaccurate, IMUs drift over time, and camera-based estimates can be noisy. Sensor fusion is the process of intelligently combining data from multiple sensors to produce a state estimate that is more accurate, complete, and reliable than any individual sensor could provide.

The most common tool for this is the Kalman Filter (and its variants like the Extended Kalman Filter for non-linear systems). It's a two-step process:

Predict: The filter uses a motion model to predict the robot's state at the next point in time.
Update: It then uses a new sensor measurement to correct and update this prediction.

By fusing high-frequency data from an IMU with lower-frequency but more stable data from a camera or joint encoders, the robot can maintain a highly accurate and smooth estimate of its state.

Code Example: Basic Obstacle Detection

Let's simulate a basic perception task. Imagine we have a depth image from an RGB-D camera, represented as a 2D NumPy array where each value is the distance in meters. Our goal is to detect if there is an imminent obstacle directly in front of the robot.

Python

import numpy as np

def check_for_imminent_obstacle(depth_image, danger_zone_m, obstacle_threshold_m):
  """
  Checks a central region of a depth image for obstacles.
  
  Args:
      depth_image (np.array): 2D array representing the depth image.
      danger_zone_m (float): How close an object must be to be an obstacle.
      obstacle_threshold_m (float): Percentage of pixels in the danger zone
                                   that must be obstacles to trigger a warning.
                                   
  Returns:
      bool: True if an obstacle is detected, False otherwise.
  """
  height, width = depth_image.shape
  
  # Define a central region of interest (e.g., middle 20% of the width)
  roi_start = int(width // 2 - width * 0.1)
  roi_end = int(width // 2 + width * 0.1)
  center_region = depth_image[:, roi_start:roi_end]
  
  # Find all pixels where the distance is less than our danger zone
  # and greater than zero (ignoring invalid depth readings)
  obstacle_pixels = center_region[(center_region < danger_zone_m) & (center_region > 0)]
  
  # Calculate the percentage of the region that is considered an obstacle
  obstacle_percentage = len(obstacle_pixels) / center_region.size
  
  print(f"Obstacle percentage in center region: {obstacle_percentage:.2%}")
  
  if obstacle_percentage > obstacle_threshold_m:
      print("WARNING: Imminent obstacle detected!")
      return True
  else:
      print("Path ahead is clear.")
      return False

# --- Simulation ---
# Create a simulated 480x640 depth image, mostly clear (5 meters away)
simulated_depth = np.full((480, 640), 5.0)

# Add a close obstacle (0.5 meters away) in the center
simulated_depth[180:300, 280:360] = 0.5

# --- Run the check ---
# An object is an obstacle if it's closer than 1.0 meter
# and if more than 30% of the center region is blocked.
check_for_imminent_obstacle(simulated_depth, danger_zone_m=1.0, obstacle_threshold_m=0.3)

The Goal: From Raw Data to Understanding​

Exteroceptive Sensors: Seeing the World​

Vision (Cameras)​

LiDAR (Light Detection and Ranging)​

Proprioceptive Sensors: Understanding Self​

Inertial Measurement Unit (IMU)​

Joint Encoders​

Force-Torque (F/T) Sensors​

The Power of Fusion​

Code Example: Basic Obstacle Detection​