9. Foundation Models in Robotics

Giving Robots "Common Sense"

One of the greatest challenges in robotics is bridging the gap between specific, low-level actions and abstract, high-level human goals. How does a robot translate the command "clean up the desk" into a series of concrete motions? This requires a degree of "common sense" reasoning that has historically been missing from robotic systems.

Foundation Models are a new class of AI models that are changing the game. These are massive neural networks trained on internet-scale datasets (e.g., billions of images and trillions of words). This vast training allows them to learn rich, general-purpose representations of the world that can be adapted to many different tasks.

In robotics, these models are not used for low-level control (like balancing). Instead, they are used at the very top of the planning stack to provide the semantic reasoning and "common sense" that was previously lacking.

Vision-Language Models (VLMs): Understanding "What"

A Vision-Language Model (VLM), such as OpenAI's CLIP or Google's PaLI, is trained on a massive dataset of image-text pairs. As a result, it learns to connect visual information with natural language concepts.

For a robot, this is a superpower. A traditional vision system can only detect the object categories it was explicitly trained on (e.g., 'cup', 'bottle'). A VLM allows for open-vocabulary scene understanding.

You can give the robot a command like:

"Pick up the blue mug."
"Find the can of sparkling water next to the microwave."
"Hand me the largest red block."

The VLM can understand these free-form descriptions and locate the corresponding objects in its camera feed, even if it has never seen that specific object before.

Large Language Models (LLMs): Understanding "How"

A Large Language Model (LLM), such as those in the GPT family, is an expert in sequence and reasoning, trained on the vast corpus of human text. In robotics, LLMs are used as high-level task planners. They act as the "common sense brain" that decomposes abstract goals into concrete steps.

If a human gives the command, "I'm thirsty," an LLM can reason about the world and infer a plausible plan:

Search for a bottle.
If bottle is found, pick it up.
Bring the bottle to the human.

This list of sub-tasks is then passed to the robot's mid-level task executor, such as a Behavior Tree, which knows how to perform each of those individual actions.

The New Architecture: VLM + LLM

The most powerful approach combines these two models.

The robot looks at a scene. The VLM processes the camera image and outputs a textual description: "There is a red can on the table and a blue bottle on the counter."
This text, along with a human's command, is fed into the LLM. For example: World State: "A red can is on the table..." Human Command: "Get me the red can."
The LLM acts as the reasoning engine. It combines the world state and the command to generate a high-level plan: [go_to_table, pick_up_red_can, bring_to_human].
This plan is then executed by the robot's lower-level control systems.

This architecture allows for unprecedented flexibility and intelligence, enabling robots to respond to novel commands and reason about complex, unstructured environments.

Code Example: LLM for Task Decomposition

We can't run a real LLM here, but we can simulate how a developer would interact with one. You send a text prompt and get back a structured plan.

Python

import json

def call_simulated_llm_planner(human_command, world_description):
  """
  Simulates making a call to an LLM to get a task plan.
  In a real system, this would be an API call to a service like OpenAI or Google.
  """
  print(f"--- Calling Simulated LLM ---")
  print(f"Human Command: '{human_command}'")
  print(f"World State: '{world_description}'")
  
  # The magic of the LLM is its ability to generate this structured output
  # from the unstructured text prompt.
  if "tidy up" in human_command and "soda can" in world_description:
      plan = [
          {"action": "find", "object": "soda_can"},
          {"action": "go_to", "object": "soda_can"},
          {"action": "pick_up", "object": "soda_can"},
          {"action": "find", "object": "trash_bin"},
          {"action": "go_to", "object": "trash_bin"},
          {"action": "drop_in", "object": "trash_bin"},
      ]
      print("LLM Response: Generated plan to throw away the can.")
      return plan
  else:
      print("LLM Response: Could not determine a valid plan.")
      return []

# --- Simulation ---
# 1. A VLM first scans the room and outputs a text description.
world_state = "A person is at the desk. A soda can is on the floor. A trash bin is in the corner."

# 2. The human gives a high-level command.
command = "Could you please tidy up the room?"

# 3. We call the LLM to get a plan.
task_plan = call_simulated_llm_planner(command, world_state)

# 4. The robot's task executor would now run this plan.
print("
--- Robot Executing Plan ---")
if task_plan:
  for i, step in enumerate(task_plan):
      print(f"Step {i+1}: Executing {step['action']} on {step['object']}")
else:
  print("No plan to execute.")

Giving Robots "Common Sense"​

Vision-Language Models (VLMs): Understanding "What"​

Large Language Models (LLMs): Understanding "How"​

The New Architecture: VLM + LLM​

Code Example: LLM for Task Decomposition​

Giving Robots "Common Sense"

Vision-Language Models (VLMs): Understanding "What"

Large Language Models (LLMs): Understanding "How"

The New Architecture: VLM + LLM

Code Example: LLM for Task Decomposition