91ɬÂþ

Event

PhD defence of Amit Sinha – Planning and learning for agent-state based policies in POMDPs

Tuesday, February 24, 2026 11:00to13:00
McConnell Engineering Building Room 603, 3480 rue University, Montreal, QC, H3A 0E9, CA

Abstract

Several real world applications in autonomous systems (robotics, autonomous driving, energy grids, etc.) involve a decision maker or an agent that selects actions based on limited available information. Such applications, where the decision maker does not observe the global state of the system, can be modeled as partially observable Markov decision processes (POMDP). Although it is possible to obtain a history dependent policy by treating the entire history of observations as a state, such an approach has complexity that increases exponentially with time. An alternative approach is to use belief states, which is the posterior distribution of the environment state given the history. Such belief states can be updated recursively and provide a dynamic programming decomposition whose complexity scales linearly with time.

However, it may be challenging to consider the belief state for problems with large state spaces. In practice, it is often more convenient to work with a general representation, which is often referred to as an agent state and is essentially the agent's internal representation of all the information available to the agent for decision making. Since such an agent state representation may not be a sufficient statistic like a belief state, it falls into a non-classical information structure. As a result, the standard dynamic programming techniques that are applied to POMDPs with belief states are not applicable to POMDPs with agent states.

We first analyze the finite horizon POMDP setting and consider the use of model information to develop a planning-based approach to optimize for agent-state based policies. We achieve this by introducing a policy search method that guarantees monotonic performance improvements at every step and also guarantees convergence. Based on this policy search method, we develop a simple planning-based policy search algorithm called partially observable conservative policy iteration (POCPI). Although such an algorithm only guarantees convergence to locally optimal solutions, we show empirically that it often converges to the globally optimal solution.

Secondly, we analyze infinite horizon POMDPs without the use of model information to develop a learning-based approach to optimize for agent-state based policies. We consider the use of Q-learning since it is a popular learning algorithm and has a strong theoretical basis with provable guarantees. One of the noteworthy features of Q-learning is that it gives us stationary and deterministic policy solutions. When considering belief states, this is not an issue because the optimal value can be achieved by stationary deterministic policies. However, for the case of agent-state policies, the optimal value may be achieved by a non-stationary deterministic policy. But it is difficult in practice to have a realizable non-stationary deterministic policy for the infinite horizon case, and so, we propose using periodic policies instead. Periodic policies are not only realizable in practice but also offer some degree of non-stationarity in contrast to stationary policies.

We provide a learning-based algorithm called periodic agent-state based Q-learning (PASQL) which combines the standard Q-learning approach with the idea of periodicity. In addition, since Q-learning only gives us deterministic policies, we investigate the use of regularization with PASQL to obtain stochastic policies. We rigorously prove the convergence of such periodic forms of Q-learning and we precisely characterize the solutions quantitatively. We also show through empirical studies that such periodic policies are capable of outperforming stationary policies.

Back to top