Authors:
            
                    Francesco Belardinelli
                    
                        
                                1
                            
                                ; 
                            
                                2
                            
                    
                    ; 
                
                    Borja G. León
                    
                        
                                1
                            
                    
                     and
                
                    Vadim Malvone
                    
                        
                                3
                            
                    
                    
                
        
        
            Affiliations:
            
                    
                        
                                1
                            
                    
                    Department of Computing, Imperial College London, London, U.K.
                
                    ; 
                
                    
                        
                                2
                            
                    
                    Département d’Informatique, Université d’Evry, Evry, France
                
                    ; 
                
                    
                        
                                3
                            
                    
                    INFRES, Télécom Paris, Paris, France
                
        
        
        
        
        
             Keyword(s):
            Markov Decision Processes, Partial Observability, Extended Partially Observable Decision Process, non-Markovian Rewards.
        
        
            
                
                
            
        
        
            
                Abstract: 
                Markovian systems are widely used in reinforcement learning (RL), when the successful completion of a task depends exclusively on the last interaction between an autonomous agent and its environment. Unfortunately, real-world instructions are typically complex and often better described as non-Markovian. In this paper we present an extension method that allows solving partially-observable non-Markovian reward decision processes (PONMRDPs) by solving equivalent Markovian models. This potentially facilitates Markovian-based state-of-the-art techniques, including RL, to find optimal behaviours for problems best described as PONMRDP. We provide formal optimality guarantees of our extension methods together with a counterexample illustrating that naive extensions from existing techniques in fully-observable environments cannot provide such guarantees.