12. Safety and Failure Analysis

The First Principle: Do No Harm

A humanoid robot is a physically powerful machine designed to operate in unstructured, human-centric environments. This combination makes safety the single most important design consideration, overriding all other performance metrics. A robot that is 99% reliable is a failure if the remaining 1% involves a risk of injury to a person.

Safety is not a feature to be added at the end; it is a principle that must be woven into every layer of the hardware and software stack. This chapter explores the methodologies engineers use to build and validate safe robotic systems.

Understanding Failure

To build a safe system, we must first understand the ways it can fail. Failures can be broadly categorized:

Hardware Failures: A motor seizes, a power supply fails, a sensor starts returning garbage data due to a broken wire.
Software Failures: A bug in the control loop causes an oscillation, a perception model misidentifies a human as an inanimate object, a planner generates a path that intersects with a forbidden zone.
Unexpected Scenarios: The robot encounters a situation its designers never anticipated, like a slippery, freshly-waxed floor or an object with strange visual properties.

A robust safety architecture must be prepared to handle all of these.

Level 1: Proactive Safety Design

The best way to handle a failure is to prevent it from happening in the first place. This is proactive safety.

Hardware-Level Safety

Compliant Actuators: As discussed in Chapter 11, using actuators with low gear ratios and good force sensing makes the robot inherently "softer" and safer during physical contact.
Lightweight Materials: Using lightweight alloys and composites reduces the robot's overall inertia, lessening the potential force of any impact.
Redundancy: Critical components like power systems, communication buses, and key sensors (like the IMU) should be redundant.
Emergency Stops (E-Stops): A large, physical red button that, when pressed, cuts power to the robot's motors at the hardware level, bypassing all software.

Software-Level Safety

Watchdogs: A "watchdog timer" is a simple, independent process that monitors the "heartbeat" of the main control loop. If the main loop freezes (due to a software bug), it stops sending signals to the watchdog. After a short timeout, the watchdog forces the robot into a safe shutdown state.
Sanity Checks: These are assertions and checks embedded throughout the code. They validate that sensor values and commands are within a physically plausible range. For example, if a bug causes the planner to command a joint to move at 1000 radians/second, a sanity check should catch this impossible command and halt the system.
Virtual Fences (Geofencing): A virtual boundary defined in the robot's world model that it is forbidden to cross, preventing it from entering unsafe areas.

Level 2: Reactive Safety Protocols

When a failure does occur, the robot must react in a way that minimizes harm.

Safe Fall Strategies: For an unstable bipedal robot, falling is inevitable. Instead of falling like a rigid statue, the robot should have a pre-programmed "fall" behavior. This involves quickly folding its limbs and curling up to absorb the impact, protecting its most valuable components (like its head) and presenting a smaller, smoother profile to whatever it might be falling on.
Graceful Degradation: If a non-critical sensor fails (e.g., one of its many cameras), the robot shouldn't just shut down. A safe system will detect the failure and switch to a lower-performance mode that relies on its remaining sensors, perhaps moving more slowly or disabling manipulation tasks until the sensor is fixed.

Level 3: Formal Analysis

To ensure all possibilities are considered, robotics engineers use structured methods to analyze their systems.

FMEA (Failure Mode and Effects Analysis): This is a bottom-up, preventative tool. Engineers create a spreadsheet of every component in the system (e.g., "ankle motor encoder"). They then brainstorm all the ways it could fail (e.g., "wire breaks," "signal becomes noisy") and trace the potential effects on the entire system. Each failure is ranked by its Severity, Occurrence, and Detectability, which gives a Risk Priority Number (RPN) to guide mitigation efforts.
Hazard Analysis: This is a top-down approach. Engineers identify system-level hazards (e.g., "Robot swings arm uncontrollably") and work backward to identify all the potential root causes, both in hardware and software.

Code Example: Sanity Check and Watchdog Logic

This conceptual code shows how two core software safety patterns might be implemented.

Python

import time

# --- Sanity Check Example ---
MAX_JOINT_VELOCITY = 10.0 # rad/s

def apply_joint_command(velocity):
  """Applies a velocity command to a joint after a sanity check."""
  
  # This is the sanity check.
  if abs(velocity) > MAX_JOINT_VELOCITY:
      print(f"UNSAFE COMMAND: Velocity {velocity:.2f} exceeds max {MAX_JOINT_VELOCITY}. Ignoring command.")
      # In a real system, this would also trigger a high-level fault.
      return False
  
  print(f"SAFE COMMAND: Applying velocity {velocity:.2f}.")
  # Code to send command to motor controller would go here.
  return True

# --- Watchdog Example ---
last_heartbeat_time = time.time()
WATCHDOG_TIMEOUT = 0.1 # seconds. Main loop must "pet" the watchdog faster than this.

def pet_watchdog():
  """The main loop calls this function to show it's still alive."""
  global last_heartbeat_time
  last_heartbeat_time = time.time()

def check_watchdog():
  """A separate, high-priority thread would run this function."""
  if time.time() - last_heartbeat_time > WATCHDOG_TIMEOUT:
      # The main control loop has frozen!
      print("WATCHDOG TIMEOUT: Main control loop is unresponsive! Triggering E-Stop.")
      # Code to cut motor power would go here.
      return False
  return True

# --- Simulation ---
print("--- Sanity Check Demo ---")
apply_joint_command(5.0)  # Safe
apply_joint_command(-15.0) # Unsafe

print("
--- Watchdog Demo ---")
print("Simulating a healthy loop...")
pet_watchdog()
time.sleep(0.05)
if check_watchdog():
  print("Watchdog check passed.")

print("
Simulating a frozen loop...")
# We forget to call pet_watchdog() here.
time.sleep(0.2)
if not check_watchdog():
  print("Watchdog correctly identified the frozen loop.")

The First Principle: Do No Harm​

Understanding Failure​

Level 1: Proactive Safety Design​

Hardware-Level Safety​

Software-Level Safety​

Level 2: Reactive Safety Protocols​

Level 3: Formal Analysis​

Code Example: Sanity Check and Watchdog Logic​