rock-paper-scissorsComputational Rationalization: The Inverse Equilibrium Problem

Kevin Waugh, Brian D. Ziebart, J. Andrew Bagnell
Carnegie Mellon University

In submission to the 28thth International Conference on
Machine Learning (ICML), 2011

Link to Paper arXiv

Abstract: Modeling the behavior of imperfect agents from a small number of observations is a difficult, but important task. In the single-agent decision-theoretic setting, inverse optimal control has been successfully employed. It assumes that observed behavior is an approximately optimal solution to an unknown decision problem, and learns the problem’s parameters that best explain the examples. The inferred parameters can be used to accurately predict future behavior, describe the agent’s preferences, or imitate the agent’s behavior in similar unobserved situations.

In this work, we consider similar tasks in competitive and cooperative multi-agent domains. Here, unlike single-agent settings, a player cannot myopically maximize its reward — it must speculate on how the other agents may act to influence the game’s outcome. Employing the game-theoretic notion of regret and the principle of maximum entropy, we introduce a technique for predicting and generalizing behavior, as well as recovering a reward function in these domains.

Mario BrosA Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Stéphane Ross Geoffrey J. Gordon J. Andrew Bagnell,
Carnegie Mellon University

To Appear in Proceedings of the 14th International Conference on
Artificial Intelligence and Statistics (AISTATS), 2011

Link to Paper

Abstract: Sequential prediction problems such as imitation learning, where
future observations depend on previous predictions (actions), violate the
common i.i.d. assumptions made in statistical learning. This leads to poor
performance in theory and often in practice. Some recent approaches
provide stronger guarantees in this setting, but remain somewhat
unsatisfactory as they train either non-stationary or stochastic policies
and require a large number of iterations. In this paper, we propose a new
iterative algorithm, which trains a stationary deterministic policy, that
can be seen as a no regret algorithm in an online learning setting. We
show that any such no regret algorithm, combined with additional reduction
assumptions, must find a policy with good performance under the
distribution of observations it induces in such sequential settings. We
demonstrate that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling

Maximum Causal Entropy Correlated Equilibria
for Ma
rkov Games
Brian D. Ziebart, J. Andrew Bagnell, Anind K. Dey
Carnegie Mellon University
To appear at International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2011).

Link to Paper

Motivated by a machine learning perspective|that game theoretic equilibria constraints should serve as guidelines for predicting agents’ strategies, we introduce maximum causal entropy correlated equilibria (MCECE), a  novel solution concept for general-sum Markov games. In line with this perspective, a MCECE strategy pro le is a uniquely-de ned joint probability distribution over actions for each game state that minimizes the worst-case  prediction of agents’ actions under log-loss. Equivalently, it maximizes the worstcase growth rate for gambling on the sequences of agents’ joint actions under uniform odds. We present a convex optimization technique for obtaining MCECE strategy pro les that resembles value iteration in nite-horizon games. We assess the predictive bene ts of our approach by predicting the strategies generated by previously proposed correlated equilibria solution concepts, and compare against those previous approaches on that same prediction task.

3-D Scene Analysis via Sequenced Predictions over Points and Regions
Xuehan Xiong, Daniel Munoz, J. Andrew Bagnell, Martial Hebert

To appear: ICRA 2011.
Preprint (pdf)

We address the problem of understanding scenes from 3-D laser scans via per-point assignment of semantic labels. In order to mitigate the difficulties of using a graphical model for modeling the contextual relationships among the 3-D points, we instead propose a multi-stage inference procedure to capture these relationships. More specifically, we train this procedure to use point cloud statistics and learn relational information (e.g., tree-trunks are below vegetation) over fine (point-wise) and coarse (region-wise) scales. We evaluate our approach on three different datasets, that were obtained from different sensors, and demonstrate improved performance.

ECCV 2010

Stacked Hierarchical Labeling
Daniel Munoz, J. Andrew Bagnell, Martial Hebert

To appear: ECCV 2010.
Preprint (pdf)

In this work we propose a hierarchical approach for labeling semantic objects and regions in scenes. Our approach is reminiscent of early vision literature in that we use a decomposition of the image in order to encode relational and spatial information. In contrast to much existing work on structured prediction for scene understanding, we bypass a global probabilistic model and instead directly train a hierarchical inference procedure inspired by the message passing mechanics of some approximate inference procedures in graphical models. This approach mitigates both the theoretical and empirical difficulties of learning probabilistic models when exact inference is intractable. In particular, we draw from recent work in machine learning and break the complex inference process into a hierarchical series of simple machine learning subproblems. Each subproblem in the hierarchy is designed to capture the image and contextual statistics in the scene. This hierarchy spans coarse-to-fine regions and explicitly models the mixtures of semantic labels that may be present due to imperfect segmentation. To avoid cascading of errors and overfitting, we train the learning problems in sequence to ensure robustness to likely errors earlier in the inference sequence and leverage the stacking approach developed by Cohen et al.

littledogModeling Interaction via the Principle of Maximum Causal Entropy
Brian D. Ziebart, J. Andrew Bagnell, Anind K. Dey

To appear: ICML 2010.
Preprint (pdf)

The principle of maximum entropy provides a powerful framework for statistical models of joint, conditional, and marginal distributions. However, there are many important distributions with elements of interaction and feedback where its applicability has not been established. This work presents the principle of maximum causal entropy — an approach based on causally conditioned probabilities that can appropriately model the availability and influence of sequentially revealed side information. Using this principle, we derive Maximum Causal Entropy Influence Diagrams, a new probabilistic graphical framework for modeling decision making in settings with latent information, sequential interaction, and feedback. We describe the theoretical advantages of this model and demonstrate its applicability for statistically framing inverse optimal control and decision prediction tasks.

An Optimization Approach to Rough Terrain Locomotion


We present a novel approach to legged locomotion over rough terrain that is thoroughly rooted in optimization. This approach relies on a hierarchy of fast, anytime algorithms
to plan a set of footholds, along with the dynamic body motions required to execute them. Components within the planning framework coordinate to exchange plans, cost-to-go estimates, and “certi?cates” that ensure the output of an abstract high- level planner can be realized by deeper layers of the hierarchy. The burden of careful engineering of cost functions to achieve desired performance is substantially mitigated by a simple inverse optimal control technique. Robustness is achieved by real-time re-planning of the full trajectory, augmented by re?exes and feedback control. We demonstrate the successful application of our approach in guiding the LittleDog quadruped robot over a variety of rough terrains.

Mario Bros

Imitation learning in Mario Bros from observed image features and actions taken by near-optimal planner.

Efficient Reductions for Imitation Learning, by Stephane Ross and J. A. Bagnell, to appear in Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.

We present two new algorithms for learning a policy/controller from expert demonstrations that have better performance guarantees than the typical supervised learning approach. Supervised learning approaches train the policy only under the situations encountered by the expert policy. This leads to poor performance guarantees as the learner encounters new situations it hasn’t been trained on whenever it makes mistakes. The approaches we present alleviate this problem by training the policy over several iterations, slowly allowing the learner to influence more and more the situations encountered during training. This allows to learner to learn how to recover from mistakes made by previous policies. We show that this leads to improved performance guarantees on large classes of problems. All proofs are available in the Supplementary Material.

A short video of our approach’s performance at Mario Bros can be seen by clicking on the image to the left. The computer learns to play the game from image features and corresponding actions taken by a near-optimal planner having full access to the game state.

Policy Gradient Methods, by Jan Peters, J. A. Bagnell, to appear in Springer Encyclopedia of Machine Learning

Online modeling of experience base for potential environmental hazards

Online modeling of experience base for potential environmental hazards

Anytime Online Novelty Detection for Vehicle Safeguarding, by Boris Sofman, J. A. Bagnell, and Anthony Stentz. Currently under review for ICRA 2010.

We present an online novelty detection algorithm that allows a mobile robot to identify stimuli that are outside its experience base, avoiding potentially hazardous situations. This algorithm addresses many of the limitations of existing novelty detection approaches, including sensitivity to high-dimensional and noisy feature spaces and the inability to efficiently update their models online. Additionally, this algorithm has anytime properties that make it highly suitable for mobile robot use. The included appendix provides a proof that the run-time of our algorithm for maintaining previously seen examples is constant-competitive with respect to any other algorithm.

A short video of this system’s performance can be seen by clicking on the image to the left. The novelty model is initialized to be empty and as the environment is perceived by the robot, the model is adjusted online so that future similar stimuli are no longer novel. Novel examples are shown with a red shade.

UPI rough terrain navigation system.

Online modeling of experience base for potential environmental hazards

Learning Rough-Terrain Autonomous Navigation, by J. A. Bagnell, D. M. Bradley, D. Silver, B. Sofman, A. Stentz. Currently under review for Robotics and Automation Magazine.

Autonomous navigation by a mobile robot through natural, unstructured terrain is one of the premier challenges in field robotics. The DARPA UPI program was tasked with advancing the state of the art in robust autonomous performance through challenging and widely varying environments. In order to accomplish this goal, machine learning techniques were heavily utilized to provide robust and adaptive performance, while simultaneously reducing the required development and deployment time. This paper describes the autonomous system, Crusher, developed for the UPI program, and the learning approaches that aided in its successful performance.

A short video of the UPI system’s performance leveraging the learning techniques discussed can be seen by clicking on the image to the left.

Learning from Demonstration for Autonomous Navigation in Complex Unstructured Terrain by David Silver, J. A. Bagnell, and Anthony Stentz. International Journal of Robotics Research.

Rough terrain autonomous navigation continues to pose a challenge to the robotics community. Robust navigation by a mobile robot depends not only on the individual performance of perception and planning systems, but on how well these systems are coupled. When traversing complex unstructured terrain, this coupling (in the form of a cost function) has a large impact on robot behavior and performance, necessitating a robust design. This paper explores the application of Learning from Demonstration to this task for the Crusher autonomous navigation platform. Using expert examples of desired navigation behavior, mappings from both online and offline perceptual data to planning costs are learned. Challenges in adapting existing techniques to complex online planning systems and imperfect demonstration are addressed, along with additional practical considerations. The benefits to autonomous performance of this approach are examined, as well as the decrease in necessary designer effort. Experimental results are presented from autonomous traverses through complex natural environments.