PARTIALLY-CONTROLLED MARKOV DECISION PROCESSES

FOR COLLISION AVOIDANCE SYSTEMS

∗

Mykel J. Kochenderfer and James P. Chryssanthacopoulos

Lincoln Laboratory, Massachusetts Institute of Technology, 244 Wood Street, Lexington, MA, U.S.A.

Keywords:

Markov decision processes, Dynamic programming, Collision avoidance.

Abstract:

Deciding when and how to avoid collision in stochastic environments requires accounting for the likelihood

and relative costs of future sequences of outcomes in response to different sequences of actions. Prior work has

investigated formulating the problem as a Markov decision process, discretizing the state space, and solving

for the optimal strategy using dynamic programming. Experiments have shown that such an approach can be

very effective, but scaling to higher-dimensional problems can be challenging due to the exponential growth of

the discrete state space. This paper presents an approach that can greatly reduce the complexity of computing

the optimal strategy in problems where only some of the dimensions of the problem are controllable. The

approach is demonstrated on an airborne collision avoidance problem where the system must recommend

maneuvers to an imperfect pilot.

1 INTRODUCTION

Manually constructing a robust collision avoidance

system, whether it be for an autonomous or human-

controlled vehicle, is challenging because the future

effects of the system cannot be known exactly. Due

to their safety-critical nature, collision avoidance sys-

tems must maintain a high degree of reliability while

minimizing unnecessary path deviation. Recent work

has investigated formulating the problem of colli-

sion avoidance as a Markov decision process (MDP)

and solving for the optimal strategy using dynamic

programming (DP) (Kochenderfer and Chryssantha-

copoulos, 2010; Kochenderfer et al., 2010a; Temizer

et al., 2010). One limitation of this approach is that

the computation and memory requirements grow ex-

ponentially with the dimensionality of the state space.

Hence, these studies focused on MDP formulations

that capture only a subset of the relevant state vari-

ables at the expense of impaired performance.

This paper presents a new approach for signif-

icantly reducing the computation and memory re-

quirements for partially-controlled collision avoid-

∗

This work is sponsored by the Federal Aviation Admin-

istration under Air Force Contract #FA8721-05-C-0002.

Opinions, interpretations, conclusions, and recommenda-

tions are those of the authors and are not necessarily en-

dorsed by the United States Government.

ance problems. The approach involves decomposing

the problem into two separate subproblems, one con-

trolled and one uncontrolled, that can be solved inde-

pendently ofﬂine using dynamic programming. Dur-

ing execution, the results from the ofﬂine computation

are combined to determine the approximately optimal

action from the current state.

The approach is demonstrated on an airborne col-

lision avoidance system that recommends vertical ma-

neuvers to an imperfect pilot. Although the pilot may

maneuver horizontally, it is assumed that the collision

avoidance system does not inﬂuence the horizontal

motion. The problem is naturally represented using

seven state variables, which is impractical to solve

with a reasonable level of discretization. By carefully

decomposing the problem into two lower-dimensional

problems, a solution can be obtained quickly and

stored in primary physical memory. The performance

of the optimized system is compared in simulation

against the Trafﬁc Alert and Collision Avoidance Sys-

tem (TCAS), which is currently mandated worldwide

on all large transport aircraft (RTCA, 2008).

The next section brieﬂy summarizes related work

on collision avoidance. Section 3 reviews Markov

decision processes. Section 4 describes the solu-

tion method and outlines the required assumptions.

Section 5 applies the method to an airborne colli-

sion avoidance problem. Section 6 evaluates the suc-

J. Kochenderfer M. and P. Chryssanthacopoulos J..

PARTIALLY-CONTROLLED MARKOV DECISION PROCESSES FOR COLLISION AVOIDANCE SYSTEMS.

DOI: 10.5220/0003135800610070

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 61-70

ISBN: 978-989-8425-40-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

cess of the method in simulation and presents results

that show that the method signiﬁcantly outperforms

TCAS. Section 7 concludes and outlines further work.

2 RELATED WORK

A common technique for collision avoidance in au-

tonomous and semi-autonomous vehicles is to deﬁne

conﬂict zones for each obstacle and then use a deter-

ministic model, such as linear extrapolation, to pre-

dict whether a conﬂict will occur (Bilimoria, 2000;

Dowek et al., 2001; Chamlou, 2009). If a conﬂict

is anticipated, the collision avoidance system selects

the maneuver that provides the minimal path devia-

tion while preventing the conﬂict. Such an approach

requires very little computation and can prevent col-

lision much of the time, but it lacks robustness be-

cause the deterministic model ignores the stochastic

nature of the environment. Although one may mit-

igate collision risk to some extent by artiﬁcially en-

larging the conﬂict zones to accommodate uncertainty

in the future behavior of the vehicles, this approach

frequently results in unnecessary path deviation. The

TCAS collision avoidance logic adopts an approach

along these lines but incorporates a large collection of

hand-crafted, heuristic rules to enhance robustness to

unexpected behavior.

Several other approaches to collision avoidance

can be found in the literature that do not use a prob-

abilistic model of vehicle behavior, including poten-

tial ﬁeld methods (Khatib and Maitre, 1978; Duong

and Zeghal, 1997; Eby and Kelly, 1999) and rapidly-

expanding random trees (LaValle, 1998; Kuwata

et al., 2008; Saunders et al., 2009). However, avoid-

ing collision with a high degree of reliability while

keeping the rate of path deviation low requires the use

of a probabilistic model that accounts for future state

uncertainty. Several methods have been suggested

that involve using a probabilistic model to estimate

the probability of conﬂict and to choose the maneu-

ver that keeps the probability of conﬂict below some

set threshold (Yang and Kuchar, 1997; Carpenter and

Kuchar, 1997; Kochenderfer et al., 2010a). One limi-

tation of these threshold-based approaches is that they

do not model the effects of delaying the avoidance

maneuver. In many cases, it can be beneﬁcial to see

how the encounter develops before committing to a

particular maneuver. The dynamic programming ap-

proach pursued in this work, in contrast, takes into ac-

count every possible future sequence of actions taken

by the collision avoidance system and their outcomes

when making a decision.

3 MARKOV DECISION

PROCESSES

An MDP is deﬁned by a transition function T and

cost function C. The probability of transitioning from

state s to state s

after executing action a is given by

T (s,a,s

). The immediate cost when executing action

a from state s is given by C(s,a). In this paper, the

state space S and action space A are assumed to be

ﬁnite (Puterman, 1994; Bertsekas, 2005).

A policy is a function π that maps states to actions.

The expected sum of immediate costs when following

a policy π for K steps starting from state s is denoted

(s) and is often called the cost-to-go function. The

optimal solution to an MDP with a receding horizon

of K is a policy π

∗

that minimizes the cost to go from

every state.

One way to compute π

∗

is to ﬁrst compute J

∗

the cost-to-go function for the optimal policy, using a

dynamic programming algorithm known as value iter-

ation. The function J

∗

(s) is set to zero for all states s.

If the state space includes terminal states with imme-

diate cost C(s), then J

∗

(s) = C(s) for those terminal

states. The function J

∗

(s) is computed from J

∗

k−1

follows:

∗

(s) = min

C(s, a) +

∑

T (s,a, s

∗

k−1

)

. (1)

The iteration continues until horizon K.

The expected cost to go when executing a from s

and then continuing with an optimal policy for K − 1

steps is given by

∗

(s,a) = C(s,a) +

∑

T (s,a, s

∗

K−1

). (2)

An optimal policy may be obtained directly from

∗

(s,a):

∗

(s) = argmin

∗

(s,a). (3)

If the state space contains continuous variables,

which is common for collision avoidance problems,

the state space can be discretized using a multi-

dimensional grid or simplex scheme (Davies, 1997).

The transition function T (x, a,x

) in continuous space

can be translated into a discrete transition function

T (s,a, s

) using a variety of sampling and interpo-

lation methods (Kochenderfer et al., 2010a). Once

the state space, transition function, and cost function

have been discretized, J

∗

(s,a) may be computed for

each discrete state s and action a. For a continuous

state x and action a, J

∗

(x,a) may be approximated us-

ing, for example, multilinear interpolation. The best

action to execute from continuous state x is simply

argmin

∗

(x,a).

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

Discretizing the full state space can result in a large

number of discrete states, exponential in the number

of dimensions, which makes computing J

∗

infeasible

for many problems. This “curse of dimensionality”

(Bellman, 1961) has led to a variety of different ap-

proximation methods (Powell, 2007).

4 PARTIAL CONTROL

This paper explores a new solution technique for

partially-controlled MDPs that is applicable to cer-

tain collision avoidance problems. It may be ap-

plied to interception-seeking or goal-oriented prob-

lems as well by incorporating negative costs. So long

as the problem satisﬁes a set of assumptions, this so-

lution method will provide a ﬁnite-horizon solution.

The approach involves independently solving a con-

trolled subproblem and an uncontrolled subproblem

and combining the results online to identify the ap-

proximately optimal action.

4.1 Assumptions

It is assumed that the state is represented by a set

of variables, some controlled and some uncontrolled.

The state space of the controlled variables is denoted

, and the state space of the uncontrolled variables

is denoted S

. The state of the controlled variables

at time t is denoted s

(t), and the state of the uncon-

trolled variables at time t is denoted s

(t). The so-

lution technique may be applied when the following

three assumptions hold:

1. The state s

(t + 1) depends only upon s

(t). The

probability of transitioning from s

to s

is given

by T (s

2. The episode terminates when s

∈ G ⊂ S

with im-

mediate cost C(s

3. In nonterminal states, the immediate cost c(t + 1)

depends only upon s

(t) and a(t). If the controlled

state is s

and action a is executed, the immediate

cost is denoted C(s

,a).

Figure 1 shows the inﬂuence diagram for this model.

4.2 Controlled Subproblem

Solving the controlled subproblem involves comput-

ing the optimal policy for the controlled variables un-

der the assumption that the time until s

enters G, de-

noted τ, is known. In an airborne collision avoidance

context, τ may be the number of steps until another

aircraft comes within 500 ft horizontally. Of course,

τ cannot be determined exactly from s

(t) because it

(t) s

(t + 1)

(t) s

(t + 1)

a(t) a(t + 1)

c(t) c(t + 1)

Figure 1: Inﬂuence diagram illustrating partial control in a

Markov decision process.

depends upon an event that occurs in the future, but

this will be addressed by the uncontrolled subproblem

(Section 4.3).

The cost to go from s

given τ is denoted J

The series J

,. .. ,J

is computed recursively, starting

with J

) = C(s

) and iterating as follows:

) = min

C(s

,a) +

∑

T (s

,a, s

k−1

)

(4)

The expected cost to go from s

when executing a

for one step and then following the optimal policy is

given by

,a) = C(s

,a) +

∑

T (s

,a, s

k−1

). (5)

The K-step expected cost to go when τ > K is de-

noted J

. It is computed by initializing J

) = 0 for

all states and iterating equation 4 to horizon K. The

series J

,. .. ,J

is saved in a table in memory, re-

quiring O(K|A||S

|) entries.

4.3 Uncontrolled Subproblem

Solving the uncontrolled subproblem involves using

the probabilistic model of the uncontrolled dynam-

ics to infer a distribution over τ for each uncontrolled

state s

. This distribution is referred to as the entry

time distribution because it represents the distribution

over the time for s

to enter G. The probability that

enters G in τ steps is denoted D

) and may be

computed using dynamic programming. The proba-

bility that τ = 0 is given by

) =



1 if s

∈ G,

0 otherwise.

(6)

PARTIALLY-CONTROLLED MARKOV DECISION PROCESSES FOR COLLISION AVOIDANCE SYSTEMS

The probability that τ = k, for k > 0, is computed from

k−1

as follows:

) =



0 if s

∈ G,

∑

T (s

k−1

) otherwise.

(7)

Depending on s

, there may be some probability that

does not enter G within K steps. This probabil-

ity is denoted D

) and may be computed from

,. .. ,D

) = 1 −

∑

k=0

). (8)

The sequence D

,. .. ,D

is stored in a table with

O(K|S

|) entries. Multilinear interpolation of the dis-

tributions may be used to determine D

) at an ar-

bitrary continuous state x

4.4 Online Solution

After J

,. .. ,J

and D

,. .. ,D

have been

computed ofﬂine, they are used together online to de-

termine the approximately optimal action to execute

from the current state. For any discrete state s in the

original state space, J

∗

(s,a) may be computed as fol-

lows:

∗

(s,a) = D

,a) +

∑

k=0

,a), (9)

where s

is the discrete uncontrolled state and s

the discrete controlled state associated with s. Com-

bining the controlled and uncontrolled solutions on-

line in this way requires time linear in the size of the

horizon. Multilinear interpolation can be used to es-

timate J

∗

(x,a) for an arbitrary state x, and from this

the optimal action may be obtained.

The memory requirements for directly storing

∗

(s,a) is O(|A||S

||S

|). However, the method pre-

sented in this section allows the solution to be repre-

sented using O(K|A||S

| + K|S

|) storage, which can

be a tremendous savings when |S

| and |S

| are large.

For the collision avoidance problem discussed in the

next section, this method allows the cost table to be

stored in 500 MB instead of over 1 TB. The ofﬂine

computational savings are even more signiﬁcant.

An alternative to using dynamic programming for

computing the entry time distribution ofﬂine is to use

Monte Carlo to estimate the entry time distribution

online. A Monte Carlo approach does not require the

uncontrolled variables to be discretized and does not

require D

,. .. ,D

to be stored in memory. How-

ever, using Monte Carlo increases the amount of com-

putation required online. In addition, for problems

where the conﬂict region is small, the number of sam-

ples required to produce an adequate distribution may

be large. Importance sampling and other sampling

methods may be used to help improve the quality of

the estimates of the entry time distribution (Chryssan-

thacopoulos et al., 2010).

5 AIRBORNE COLLISION

AVOIDANCE SYSTEM

This section demonstrates the approach suggested in

the previous section on an MDP representing an air-

borne collision avoidance problem. In this problem,

the collision avoidance system issues resolution ad-

visories to pilots who then adjust their vertical rate

to avoid coming within 500 ft horizontally and 100 ft

vertically of an intruding aircraft. This section con-

siders a simpliﬁed version of the collision avoidance

problem in which one aircraft equipped with a col-

lision avoidance system, called the own aircraft, en-

counters only one other unequipped aircraft, called

the intruder aircraft. The remainder of the section

outlines the assumptions and decomposes the prob-

lem into controlled and uncontrolled subproblems.

5.1 Assumptions

In this problem, s

represents the state of the verti-

cal motion variables, and s

represents the state of

the horizontal motion variables. This problem deﬁnes

coming within 500 ft horizontally and 100 ft vertically

of an intruder as a conﬂict. This deﬁnition matches

the near mid-air collision (NMAC) deﬁnition used in

prior TCAS studies (RTCA, 2005).

The ﬁrst assumption in Section 4.1 requires that

(t + 1) depend only upon s

(t). In this collision

avoidance problem, it is assumed that pilots randomly

maneuver horizontally, and that the advisories issued

by the collision avoidance system do not inﬂuence the

horizontal motion.

The second assumption requires the episode to ter-

minate when s

enters G. In this problem, G is the set

of states where there is a horizontal conﬂict, deﬁned

to be when an intruder comes within 500 ft horizon-

tally. The immediate cost when this occurs is given

by C(s

), which is one when the intruder is within

100 ft vertically and zero otherwise. In simulation, the

episode does not terminate when s

enters G, since

entering G does not necessarily imply that there has

been a conﬂict (e.g., the two aircraft may have safely

missed each other by 1000 ft vertically). However, it

is generally sufﬁcient to plan up to the moment where

enters G because adequate separation at that mo-

ment usually indicates that the encounter is resolved.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

The third assumption requires that for states where

6∈ G the immediate cost function depends on the

controlled state variables and the action. As outlined

in Section 5.2.3, the nonterminal cost function only

depends on the advisory state and the advisory being

issued.

5.2 Controlled Subproblem

The controlled subproblem, formulated as an MDP,

is deﬁned by the available actions, the dynamics, and

the cost function. The dynamics are determined by

the pilot response model and aircraft dynamic model.

The cost function takes into account both safety and

operational considerations. In addition to describing

these components of the MDP, this section discusses

the resulting optimal policy.

5.2.1 Resolution Advisories

The airborne collision avoidance system may choose

to issue one of two different initial advisories: climb

at least 1500 ft/min or descend at least 1500 ft/min.

Following the initial advisory, the system may choose

to either terminate, reverse, or strengthen the advi-

sory. An advisory that has been reversed requires a

vertical rate of 1500 ft/min in the opposite direction

of the original advisory. An advisory that has been

strengthened requires a vertical rate of 2500 ft/min in

the direction of the original advisory. After an advi-

sory has been strengthened, it can then be weakened

to reduce the required vertical rate to 1500 ft/min in

the direction of the original advisory.

5.2.2 Dynamic Model

The state is represented using four variables:

• h: altitude of the intruder relative to the own air-

craft,

•

: vertical rate of the own aircraft,

•

: vertical rate of the intruder aircraft, and

• s

: the state of the resolution advisory.

The discrete variable s

contains the necessary in-

formation to model the pilot response, which includes

the active advisory and the time to execution by the

pilot. Five seconds are required for the pilot to begin

responding to an initial advisory. The pilot then ap-

plies a 1/4 g acceleration to comply with the advisory.

Subsequent advisories are followed with a 1/3 g accel-

eration after a three second delay. When an advisory

is not active, the pilot applies an acceleration selected

at every step from a zero-mean Gaussian with 3 ft/s

standard deviation. At each step, the intruder pilot

independently applies a random acceleration from a

zero-mean Gaussian with 3 ft/s

standard deviation.

The continuous state variables are discretized ac-

cording to the scheme in Table 1. The discrete

state transition probabilities were computed using

sigma-point sampling and multilinear interpolation

(Kochenderfer et al., 2010a). This discretization

scheme produces a discrete model with 213 thousand

discrete states.

Table 1: Controlled Variable Discretization.

Variable Grid Edges

h −1000, −900,. . ., 1000 ft

−2500,−2250,. .. , 2500 ft/min

5.2.3 Cost Function

An effective collision avoidance system must satisfy

a number of competing objectives, including maxi-

mizing safety and minimizing the rate of unnecessary

alerts and path deviation. These objectives are en-

coded in the cost function. In addition to incurring

a cost for conﬂict, it may be desirable to incur a cost

for other events such as alerting or strengthening the

advisory. The costs of various events are summarized

in Table 2. A small negative cost is awarded at every

time step the system is not alerting to provide some

incentive to discontinue alerting after the encounter

has been resolved.

Table 2: Event Costs.

Conﬂict Alert Strengthening Reversal Clear of Conﬂict

1 0.001 0.009 0.01 −1 · 10

−4

5.2.4 Optimal Policy

The optimal cost-to-go tables J

,. .. ,J

were com-

puted ofﬂine in less than two minutes on a single

3 GHz Intel Xeon core using a horizon of K = 39

steps. Storing only the values for the valid state-action

pairs requires 263 MB using a 64-bit ﬂoating point

representation. Figure 2 shows plots of the optimal

policy through different slices of the state space where

the own aircraft is initially climbing at 1500 ft/min

and the intruder is level.

Figure 2(a) shows the policy when an alert has

not been issued. The blue region indicates where the

logic will issue a descend advisory, and the green re-

gion indicates where the logic will issue a climb advi-

sory. The optimal policy will sometimes issue a climb

even when the intruder is above. This occurs when

PARTIALLY-CONTROLLED MARKOV DECISION PROCESSES FOR COLLISION AVOIDANCE SYSTEMS

the aircraft are closely separated in altitude and little

time remains until potential conﬂict. Because the own

aircraft is already climbing, there is insufﬁcient time

to accelerate downward to avoid conﬂict. Climbing

above the intruder is more effective. Another notable

feature of the plot is that no advisory is issued when

τ ≤ 5 s. Because an advisory has no effect until ﬁve

seconds after it is issued in this model, alerting less

than ﬁve seconds prior to conﬂict is no more effective

than not alerting.

In Figure 2(b), a strengthened climb advisory was

issued two seconds previously. Should the logic con-

tinue issuing the strengthened climb advisory, the pi-

lot will begin climbing to 2500 ft/min in one second.

The white region indicates where the logic will dis-

continue the advisory. The advisory is typically dis-

continued when the vertical separation is large and

there is much time until conﬂict. The gold region

indicates where the logic will continue to issue the

strengthened advisory. In the red region, the logic

will reverse the climb to a descend when the intruder

is above the own aircraft and maintaining the climb

might induce conﬂict. In the teal region of the plot,

the logic will weaken the advisory when the intruder

is far enough above the own aircraft that reversing the

advisory is unnecessary.

5.3 Uncontrolled Subproblem

The uncontrolled subproblem involves estimating the

distribution over τ (i.e., the time until the aircraft are

separated less than 500 ft horizontally) given the cur-

rent state. This section describes the horizontal dy-

namics and three methods for estimating the entry

time distribution.

5.3.1 Dynamic Model

The aircraft move in the horizontal plane in re-

sponse to independent random accelerations gener-

ated from a zero-mean Gaussian with a standard de-

viation of 3 ft/s

. The motion can be described by a

three-dimensional model, instead of the typical four-

dimensional (relative positions and velocities) model,

due to rotational symmetry in the dynamics. The three

state variables are as follows:

• r: horizontal range to the intruder,

• r

: relative horizontal speed, and

• θ

: difference in the direction of the relative hori-

zontal velocity and the bearing of the intruder.

These variables are illustrated in Figure 3.

Climb

Descend

No advisory

0 10 20 30 40

−1000

−500

500

1000

τ (s)

h (ft)

(a) s

= “no advisory”

Continue strengthened climb

Reverse

Weaken

Discontinue advisory

0 10 20 30 40

−1000

−500

500

1000

τ (s)

h (ft)

(b) s

= “increase climb in one second”

Figure 2: Optimal action plots for

= 1500 ft/min,

0 ft/min.

Relative

velocity vector

Own

Intruder

Figure 3: Three-variable model of horizontal dynamics.

5.3.2 Dynamic Programming Entry Time

Distribution

The entry time distribution can be estimated ofﬂine

using dynamic programming as discussed in Sec-

tion 4.3. The state space was discretized using the

scheme in Table 3, resulting in 730 thousand dis-

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

crete states. The ofﬂine computation required 92 sec-

onds on a single 3 GHz Intel Xeon core. Storing

,. .. ,D

in memory using a 64-bit ﬂoating point

representation requires 222 MB.

Table 3: Uncontrolled Variable Discretization.

Variable Grid Edges

r 0,50,. .. , 1000,1500, .. . ,40000 ft

0,10,. .. , 1000 ft/s

−180

◦

,−175

◦

,.. ., 180

◦

5.3.3 Monte Carlo Entry Time Distribution

Monte Carlo estimation can be used online to estimate

the entry time distribution as explained at the end of

Section 4.3. The experiments in this paper use 100

Monte Carlo samples to estimate τ.

5.3.4 Simple Entry Time Distribution

A point estimate of τ can be obtained online as fol-

lows. The range rate is given by

˙r = r

cos(θ

). (10)

If the aircraft are converging in range, then τ can be

approximated by r/|˙r|. Otherwise, τ is set beyond the

horizon.

6 RESULTS

This section evaluates the performance of the col-

lision avoidance system using simulated encounters

and compares it against the current version of TCAS,

Version 7.1.

6.1 Encounter Initialization

Encounters are initialized in the horizontal plane by

randomly and independently generating the initial

ground speeds of both aircraft, s

and s

, from a

uniform distribution between 100 and 500 ft/s. The

horizontal range between the aircraft is initialized to

r = t

target

)+u

, where u

is a zero-mean Gaus-

sian with 500 ft standard deviation. The parameter

target

, nominally set to 40 s, controls how long until

the aircraft come into conﬂict.

The bearing of the intruder aircraft with respect

to the own aircraft, χ, is sampled from a zero-mean

Gaussian distribution with a standard deviation of 2

◦

The heading of the intruder with respect to the head-

ing of the own aircraft, β, is sampled from a Gaussian

distribution with a mean of 180

◦

and a standard devi-

ation of 2

◦

. When β = 180

◦

, the intruder is heading

directly toward the own aircraft.

The initial vertical rates

and

are drawn

independently from a uniform distribution spanning

−1000 and 1000 ft/min. The initial altitude of the

own aircraft, h

, is set to 43,000 ft. The initial altitude

of the intruder is h

target

(

−

) +u

, where u

a zero-mean Gaussian with 25 ft standard deviation.

6.2 Example Encounter

Figure 4 shows an example encounter comparing the

behavior of the system using the DP entry time dis-

tribution against the TCAS logic. Figure 5 shows

the entry time distribution computed using the three

methods of Section 5.3 at the points in time when the

system issues alerts.

Seventeen seconds into the encounter, the DP

logic issues a descend to pass below the intruder. The

expected cost to go for issuing a descend advisory

is approximately 0.00928, lower than the expected

cost to go for issuing a climb advisory (0.0113) or

for not issuing an advisory (0.00972). The DP entry

time distribution at this time has a conditional mean

E[τ | τ < 40s] of approximately 12.01 s, and a con-

siderable portion of the probability mass (∼40%) is

assigned to τ ≥ 40 s. The Monte Carlo entry time dis-

tribution, in comparison, has less support but a com-

parable conditional mean of 17.12 s. Only 15% of

the probability mass is concentrated on τ ≥ 40 s. The

point estimate of τ using the simple method is 21.65 s.

After the descend advisory is issued, the intruder

begins to increase its descent, causing the DP logic

to reverse the descend to a climb 20 seconds into the

encounter. The pilot begins the climb maneuver three

seconds later. Once the aircraft are safely separated,

the DP logic discontinues the advisory at t = 31 s.

The minimum horizontal separation is 342 ft, at which

time the vertical separation is 595 ft. No conﬂict oc-

curs.

TCAS initially issues a climb advisory four sec-

onds into the encounter because it anticipates, using

straight-line projection, that by climbing it can safely

pass above the intruder. Nine seconds later, when the

own aircraft is executing its climb advisory, TCAS re-

verses the climb to a descend because it projects that

maintaining the climb will not provide the required

separation. TCAS strengthens the advisory three sec-

onds later, but fails to resolve the conﬂict. The air-

craft miss each other by 342 ft horizontally and 44 ft

vertically. Although the TCAS logic alerts earlier and

more often, the DP logic still outperforms it in this

example encounter.

PARTIALLY-CONTROLLED MARKOV DECISION PROCESSES FOR COLLISION AVOIDANCE SYSTEMS

Climb

Descend

Increase descent

Descend

Climb

−40 −20 0 20 40

4.2

4.4

·10

Time (s)

Altitude (ft)

DP logic

TCAS logic

No logic

Intruder

(a) Vertical proﬁle.

−1

−0.5

0.5

·10

0.5

1.5

·10

East (ft)

North (ft)

Own

Intruder

(b) Horizontal proﬁle.

Figure 4: Example encounter comparing the system with

the DP entry time distribution against TCAS.

6.3 Performance Evaluation

Table 4 summarizes the results of simulating the DP

logic and the TCAS logic on one million encounters

generated by the model of Section 5. The table sum-

marizes the number of conﬂicts, alerts, strengthen-

ings, and reversals.

As the table shows, the DP logic can provide a

much lower conﬂict rate while signiﬁcantly reducing

the alert rate. The Monte Carlo entry time distribu-

tion results in a greater number of conﬂicts, but it

alerts less frequently than the other methods. Increas-

ing the number of samples used generally improves

performance but increases online computation time.

The DP logic using the simple point estimate of τ re-

solves all but one conﬂict while rarely reversing or

strengthening the advisory, but alerts more frequently

than Monte Carlo.

0 10 20 30 ≥ 40

0.2

0.4

0.6

0.8

τ (s)

Probability

Simple

(a) t = 17 s.

0 10 20 30 ≥ 40

0.2

0.4

0.6

0.8

τ (s)

Probability

Simple

(b) t = 20 s.

Figure 5: Entry time distribution computed using dynamic

programming (DP), Monte Carlo (MC), and the simple

(Simple) methods at the two alerting points of the DP logic

in the example encounter of Figure 4.

Table 4: Performance Evaluation.

DP Logic (DP Entry) DP Logic (MC Entry)

Conﬂicts 2 11

Alerts 540,113 400,457

Strengthenings 39,549 37,975

Reversals 1242 747

DP Logic (Simple Entry) TCAS Logic

Conﬂicts 1 101

Alerts 939,745 994,640

Strengthenings 26,485 45,969

Reversals 129 193,582

6.4 Safety Curve

The results of Section 6.3 considered the performance

of the system optimized using the ﬁxed event costs of

Table 2. Figure 6 shows the safety curves for the DP

logic and TCAS when different parameters are varied.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

The DP logic safety curves were produced by

varying the cost of alerting from zero to one while

keeping the other event costs ﬁxed. Separate curves

were produced for the three methods for estimating

the entry time distribution. The upper-right region of

the plot corresponds to costs of alerting near zero and

the lower-left region corresponds to costs near one.

The safety curve for TCAS was generated by vary-

ing the sensitivity level of TCAS. The sensitivity level

of TCAS is a system parameter of the logic that in-

creases with altitude. At higher sensitivity levels,

TCAS will generally alert earlier and more aggres-

sively to prevent conﬂict.

The safety curves show that the DP logic can ex-

ceed or meet the level of safety provided by TCAS

while alerting far less frequently. The safety curves

can aid in choosing an appropriate value for the cost

of alerting that satisﬁes a required safety threshold.

Figure 6 also reveals that the DP and Monte Carlo

methods for estimating τ offer similar performance

and that they both outperform the simple method, es-

pecially when the cost of alerting is high and the logic

can only alert sparingly to prevent conﬂict. In the

upper-right region of the plot, the three methods are

nearly indistinguishable.

0 0.2 0.4

0.6

0.8 1

0.95

0.96

0.97

0.98

0.99

Pr(alert)

Pr(safe)

Simple

TCAS

Figure 6: Safety curves. Each point on the curves was esti-

mated from 10,000 simulations.

7 CONCLUSIONS AND FURTHER

WORK

This paper presented a method for solving large

MDPs that satisfy certain assumptions by decompos-

ing the problem into controlled and uncontrolled sub-

problems that can be solved independently ofﬂine and

recombined online. The method was applied to air-

borne collision avoidance and was compared against

TCAS, a system that was under development for sev-

eral decades and has a proven safety record.

The experiments demonstrate that the collision

avoidance logic that results from solving the MDP

using the method presented in this paper reduces the

risk of collision by a factor of 50 while issuing fewer

alerts than TCAS in the simulated encounters. The

system reverses less than 1% of the time that TCAS

reverses, and the system strengthens less frequently

as well. It should be emphasized that further simula-

tion studies using more realistic encounter models are

required to quantify the expected performance of the

DP logic (Kochenderfer et al., 2010b).

Real collision avoidance systems have imperfect

sensors, which results in state uncertainty. TCAS cur-

rently relies on radar beacon surveillance, which re-

sults in somewhat signiﬁcant uncertainty in the in-

truder bearing. When state uncertainty is signiﬁ-

cant, the uncertainty must be taken into account when

choosing actions. With a sensor model, the prob-

lem may be transformed into a partially-observable

Markov decision process (POMDP) and solved ap-

proximately using various methods (Smith and Sim-

mons, 2005; Kurniawati et al., 2008).

ACKNOWLEDGEMENTS

This work is the result of research sponsored by the

TCAS Program Ofﬁce at the Federal Aviation Admin-

istration. The authors appreciate the support provided

by the TCAS Program Manager, Neal Suchy. This

work has beneﬁted from discussions with Leslie Kael-

bling and Tomas Lozano-Perez from the MIT Com-

puter Science and Artiﬁcial Intelligence Laboratory.

REFERENCES

Bellman, R. E. (1961). Adaptive control processes: A

guided tour. Princeton University Press.

Bertsekas, D. P. (2005). Dynamic Programming and Opti-

mal Control, volume 1. Athena Scientiﬁc, Belmont,

Mass., 3rd edition.

Bilimoria, K. D. (2000). A geometric optimization ap-

proach to aircraft conﬂict resolution. In AIAA Guid-

ance, Navigation, and Control Conference and Ex-

hibit, Denver, Colo.

Carpenter, B. D. and Kuchar, J. K. (1997). Probability-

based collision alerting logic for closely-spaced paral-

lel approach. In AIAA 35th Aerospace Sciences Meet-

ing, Reno, NV.

PARTIALLY-CONTROLLED MARKOV DECISION PROCESSES FOR COLLISION AVOIDANCE SYSTEMS

Chamlou, R. (2009). Future airborne collision avoidance—

design principles, analysis plan and algorithm devel-

opment. In Digital Avionics Systems Conference.

Chryssanthacopoulos, J. P., Kochenderfer, M. J., and

Williams, R. E. (2010). Improved Monte Carlo sam-

pling for conﬂict probability estimation. In AIAA

Non-Deterministic Approaches Conference, Orlando,

Florida.

Davies, S. (1997). Multidimensional triangulation and in-

terpolation for reinforcement learning. In Mozer,

M. C., Jordan, M. I., and Petsche, T., editors, Ad-

vances in Neural Information Processing Systems,

volume 9, pages 1005–1011. MIT Press, Cambridge,

Mass.

Dowek, G., Geser, A., and Mu

noz, C. (2001). Tactical con-

ﬂict detection and resolution in a 3-D airspace. In 4th

USA/Europe Air Trafﬁc Management R&D Seminar,

Santa Fe, New Mexico.

Duong, V. N. and Zeghal, K. (1997). Conﬂict resolution

advisory for autonomous airborne separation in low-

density airspace. In IEEE Conference on Decision and

Control, volume 3, pages 2429–2434.

Eby, M. S. and Kelly, W. E. (1999). Free ﬂight separa-

tion assurance using distributed algorithms. In IEEE

Aerospace Conference, volume 2, pages 429–441.

Khatib, O. and Maitre, J.-F. L. (1978). Dynamic control

of manipulators operating in a complex environment.

In Symposium on Theory and Practice of Robots and

Manipulators, pages 267–282, Udine, Italy. Elsevier.

Kochenderfer, M. J. and Chryssanthacopoulos, J. P. (2010).

A decision-theoretic approach to developing robust

collision avoidance logic. In IEEE International

Conference on Intelligent Transportation Systems,

Madeira Island, Portugal.

Kochenderfer, M. J., Chryssanthacopoulos, J. P., Kaelbling,

L. P., and Lozano-Perez, T. (2010a). Model-based

optimization of airborne collision avoidance logic.

Project Report ATC-360, Massachusetts Institute of

Technology, Lincoln Laboratory.

Kochenderfer, M. J., Edwards, M. W. M., Espindle, L. P.,

Kuchar, J. K., and Grifﬁth, J. D. (2010b). Airspace en-

counter models for estimating collision risk. Journal

of Guidance, Control, and Dynamics, 33(2):487–499.

Kurniawati, H., Hsu, D., and Lee, W. (2008). SARSOP: Ef-

ﬁcient point-based POMDP planning by approximat-

ing optimally reachable belief spaces. In Robotics:

Science and Systems.

Kuwata, Y., Fiore, G. A., Teo, J., Frazzoli, E., and How, J. P.

(2008). Motion planning for urban driving using RRT.

In IEEE/RSJ International Conference on Intelligent

Robots and Systems, pages 1681–1686.

LaValle, S. M. (1998). Rapidly-exploring random trees: A

new tool for path planning. Technical Report 98-11,

Computer Science Department, Iowa State University.

Powell, W. B. (2007). Approximate Dynamic Program-

ming: Solving the Curses of Dimensionality. Wiley,

Hoboken, NJ.

Puterman, M. L. (1994). Markov Decision Processes: Dis-

crete Stochastic Dynamic Programming. Wiley series

in probability and mathematical statistics. Wiley, New

York.

RTCA (2005). Safety analysis of proposed change to TCAS

RA reversal logic, DO-298. RTCA, Inc., Washington,

D.C.

RTCA (2008). Minimum operational performance stan-

dards for Trafﬁc Alert and Collision Avoidance Sys-

tem II (TCAS II), DO-185b. RTCA, Inc., Washington,

D.C.

Saunders, J., Beard, R., and Byrne, J. (2009). Vision-based

reactive multiple obstacle avoidance for micro air ve-

hicles. In American Control Conference, pages 5253–

5258.

Smith, T. and Simmons, R. G. (2005). Point-based POMDP

algorithms: Improved analysis and implementation.

In Uncertainty in Artiﬁcial Intelligence.

Temizer, S., Kochenderfer, M. J., Kaelbling, L. P., Lozano-

erez, T., and Kuchar, J. K. (2010). Collision avoid-

ance for unmanned aircraft using Markov decision

processes. In AIAA Guidance, Navigation, and Con-

trol Conference, Toronto, Canada.

Yang, L. C. and Kuchar, J. K. (1997). Prototype conﬂict

alerting system for free ﬂight. Journal of Guidance,

Control, and Dynamics, 20(4):768–773.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence