Research into how artificial agents can make decisions has evolved rapidly through advances in deep reinforcement learning. Compared to generative ML models like GPT-3 and Imagen, artificial agents can directly influence their environment through actions, such as moving a robot arm based on camera inputs or clicking a button in a web browser. While artificial agents have the potential to be increasingly helpful to people, current methods are held back by the need to receive detailed feedback in the form of frequently provided rewards to learn successful strategies. For example, despite large computational budgets, even powerful programs such as AlphaGo are limited to a few hundred moves until receiving their next reward.

In contrast, complex tasks like making a meal require decision making at all levels, from planning the menu, navigating to the store to pick up groceries, and following the recipe in the kitchen to properly executing the fine motor skills needed at each step along the way based on high-dimensional sensory inputs. Hierarchical reinforcement learning (HRL) promises to automatically break down such complex tasks into manageable subgoals, enabling artificial agents to solve tasks more autonomously from fewer rewards, also known as sparse rewards. However, research progress on HRL has proven to be challenging; current methods rely on manually specified goal spaces or subtasks, and no general solution exists.

To spur progress on this research challenge and in collaboration with the University of California, Berkeley, we present the Director agent, which learns practical, general, and interpretable hierarchical behaviors from raw pixels. Director trains a manager policy to propose subgoals within the latent space of a learned world model and trains a worker policy to achieve these goals. Despite operating on latent representations, we can decode Director’s internal subgoals into images to inspect and interpret its decisions. We evaluate Director across several benchmarks, showing that it learns diverse hierarchical strategies and enables solving tasks with very sparse rewards where previous approaches fail, such as exploring 3D mazes with quadruped robots directly from first-person pixel inputs.

Director learns to solve complex long-horizon tasks by automatically breaking them down into subgoals. Each panel shows the environment interaction on the left and the decoded internal goals on the right.

How Director Works
Director learns a world model from pixels that enables efficient planning in a latent space. The world model maps images to model states and then predicts future model states given potential actions. From predicted trajectories of model states, Director optimizes two policies: The manager chooses a new goal every fixed number of steps, and the worker learns to achieve the goals through low-level actions. However, choosing goals directly in the high-dimensional continuous representation space of the world model would be a challenging control problem for the manager. Instead, we learn a goal autoencoder to compress the model states into smaller discrete codes. The manager then selects discrete codes and the goal autoencoder turns them into model states before passing them as goals to the worker.

Left: The goal autoencoder (blue) compresses the world model (green) state (st) into discrete codes (z). Right: The manager policy (orange) selects a code that the goal decoder (blue) turns into a feature space goal (g). The worker policy (red) learns to achieve the goal from future trajectories (s1, …, s4) predicted by the world model.

All components of Director are optimized concurrently, so the manager learns to select goals that are achievable by the worker. The manager learns to select goals to maximize both the task reward and an exploration bonus, leading the agent to explore and steer towards remote parts of the environment. We found that preferring model states where the goal autoencoder incurs high prediction error is a simple and effective exploration bonus. Unlike prior methods, such as Feudal Networks, our worker receives no task reward and learns purely from maximizing the feature space similarity between the current model state and the goal. This means the worker has no knowledge of the task and instead concentrates all its capacity on achieving goals.

Benchmark Results
Whereas prior work in HRL often resorted to custom evaluation protocols — such as assuming diverse practice goals, access to the agents’ global position on a 2D map, or ground-truth distance rewards — Director operates in the end-to-end RL setting. To test the ability to explore and solve long-horizon tasks, we propose the challenging Egocentric Ant Maze benchmark. This challenging suite of tasks requires finding and reaching goals in 3D mazes by controlling the joints of a quadruped robot, given only proprioceptive and first-person camera inputs. The sparse reward is given when the robot reaches the goal, so the agents have to autonomously explore in the absence of task rewards throughout most of their learning.

The Egocentric Ant Maze benchmark measures the ability of agents to explore in a temporally-abstract manner to find the sparse reward at the end of the maze.

We evaluate Director against two state-of-the-art algorithms that are also based on world models: Plan2Explore, which maximizes both task reward and an exploration bonus based on ensemble disagreement, and Dreamer, which simply maximizes the task reward. Both baselines learn non-hierarchical policies from imagined trajectories of the world model. We find that Plan2Explore results in noisy movements that flip the robot onto its back, preventing it from reaching the goal. Dreamer reaches the goal in the smallest maze but fails to explore the larger mazes. In these larger mazes, Director is the only method to find and reliably reach the goal.

To study the ability of agents to discover very sparse rewards in isolation and separately from the challenge of representation learning of 3D environments, we propose the Visual Pin Pad suite. In these tasks, the agent controls a black square, moving it around to step on differently colored pads. At the bottom of the screen, the history of previously activated pads is shown, removing the need for long-term memory. The task is to discover the correct sequence for activating all the pads, at which point the agent receives the sparse reward. Again, Director outperforms previous methods by a large margin.

The Visual Pin Pad benchmark allows researchers to evaluate agents under very sparse rewards and without confounding challenges such as perceiving 3D scenes or long-term memory.

In addition to solving tasks with sparse rewards, we study Director’s performance on a wide range of tasks common in the literature that typically require no long-term exploration. Our experiment includes 12 tasks that cover Atari games, Control Suite tasks, DMLab maze environments, and the research platform Crafter. We find that Director succeeds across all these tasks with the same hyperparameters, demonstrating the robustness of the hierarchy learning process. Additionally, providing the task reward to the worker enables Director to learn precise movements for the task, fully matching or exceeding the performance of the state-of-the-art Dreamer algorithm.

Director solves a wide range of standard tasks with dense rewards with the same hyperparameters, demonstrating the robustness of the hierarchy learning process.

Goal Visualizations
While Director uses latent model states as goals, the learned world model allows us to decode these goals into images for human interpretation. We visualize the internal goals of Director for multiple environments to gain insights into its decision making and find that Director learns diverse strategies for breaking down long-horizon tasks. For example, on the Walker and Humanoid tasks, the manager requests a forward leaning pose and shifting floor patterns, with the worker filling in the details of how the legs need to move. In the Egocentric Ant Maze, the manager steers the ant robot by requesting a sequence of different wall colors. In the 2D research platform Crafter, the manager requests resource collection and tools via the inventory display at the bottom of the screen, and in DMLab mazes, the manager encourages the worker via the teleport animation that occurs right after collecting the desired object.

Left: In Egocentric Ant Maze XL, the manager directs the worker through the maze by targeting walls of different colors. Right: In Visual Pin Pad Six, the manager specifies subgoals via the history display at the bottom and by highlighting different pads.
Left: In Walker, the manager requests a forward leaning pose with both feet off the ground and a shifting floor pattern, with the worker filling in the details of leg movement. Right: In the challenging Humanoid task, Director learns to stand up and walk reliably from pixels and without early episode terminations.
Left: In Crafter, the manager requests resource collection via the inventory display at the bottom of the screen. Right: In DMLab Goals Small, the manager requests the teleport animation that occurs when receiving a reward as a way to communicate the task to the worker.

Future Directions
We see Director as a step forward in HRL research and are preparing its code to be released in the future. Director is a practical, interpretable, and generally applicable algorithm that provides an effective starting point for the future development of hierarchical artificial agents by the research community, such as allowing goals to only correspond to subsets of the full representation vectors, dynamically learning the duration of the goals, and building hierarchical agents with three or more levels of temporal abstraction. We are optimistic that future algorithmic advances in HRL will unlock new levels of performance and autonomy of intelligent agents.