TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE

Alvise Span`o, Michele Bugliesi and Agostino Cortesi

Dipartimento di Scienze Ambientali, Informatica e Statistica, Universit`a Ca’ Foscari Venezia, Venice, Italy

Keywords:

Static analysis, Analyzer, Type system, Type ﬂow, Flow Type, Type rule, Storage, Picture, Record, Cobol,

Label, Variable, Branch, Termination, Status, Convergence, Abstract interpretation, Coercion, Coerce, Envi-

ronment, Judgement, Substitution, Grammar, Island grammar, Parser, Island parsing, Lexer, Parsing, LALR,

Yacc, Lex, F#, .NET, IBM, z/OS, COBOL, COBOL85.

Abstract:

Many business applications today still rely on COBOL programs written decades ago that are difﬁcult to

maintain and upgrade due to technological limitations and lack of experts in the language. Several companies

have been trying to migrate their software base to modern platforms, but code translation is problematic

because most business processes implemented are often no longer documented or even known. Applying

existing Program Understanding techniques to COBOL could be a way for aiding IT specialists in charge of

a porting - but useful raw information must be extracted from the source code in order to get these techniques

yield to meaningful results. We believe that the types of variables used in programs are an important part

of such raw information and we present an approach based on static analysis of types rather than data. Our

system is capable of reconstructing the type-ﬂow of a COBOL program throughout branches, jumps and loops

in ﬁnite time and to track type information on reused variables occurring in the code. It also detects a number

of error-prone situations, type mismatches or misuses and notiﬁes that by means of messages annotated in the

code along with types inferred for each variable occurrence.

1 INTRODUCTION

Analyzing COBOL code using type inference tech-

niques has been proposed many times in the last

decade and before. From the system ﬁrst described

in (van Deursen and Moonen, 1998) to its later re-

ﬁnement in (Moonen, 2003), giving informative types

to COBOL variables seems to be a good way for

automatically generating a basic tier of documenta-

tion of legacy software (van Deursen and Moonen,

2006) and is also a reliable starting point for fur-

ther Program Understanding approaches (Kuipers and

Moonen, 2000). These systems are quite sophisti-

cate and rely on a number of complex side models

and tools aimed to extract properties and information

from COBOL programs at a high level of abstraction,

thus inevitably omitting several details at a lower lan-

guage and type level - e.g. how to deal exactly with

the many picture formats supported by COBOL and

with control constructs that alter the program ﬂow.

In this paper, we propose a light-weight system for

typing COBOL with rich yet simple types that pursue

a number goals:

1. model the COBOL picture system without ap-

proximating storage format information such as

computational ﬁelds or the amount of digits in

a numeric, in order to reconstruct the exact in-

memory representation of datatypes and perform

precise comparisons among the many formats

COBOL supports;

2. deal with what in (van Deursen and Moonen,

2001) is called pollution in such a way that no

complex relational property system among types

is needed, by tracking type alterations that vari-

ables are subject to in the following scenarios:

(a) when data is reused for different purposes in a

program: many COBOL programmers are used

to this practice in order to save memory and the

result is often poor maintainability and error-

proneness;

(b) when the language performs an implicit

datatype cast, reformatting values to ﬁt target

variables, either at compile-time or at run-time.

3. deal with branches in the program ﬂow that are

not statically decidable (i.e. conditional state-

ments) by embedding into the type itself multiple

types a variable may possibly assume during the

execution.

Spanò A., Bugliesi M. and Cortesi A..

TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE.

DOI: 10.5220/0003506700640075

In Proceedings of the 6th International Conference on Software and Database Technologies (ICSOFT-2011), pages 64-75

ISBN: 978-989-8425-77-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

We introduce a kind of storage type for variables

declared as pictures in COBOL and a special ﬂow-

type for collecting storage types resulting from condi-

tional branches in the program.

On top of that, having to do with a language

where

GOTO

and other low-level commands altering

the control-ﬂow are frequently used in programs, our

system cannot behave like an ordinary type-checker

or type-inference algorithm: it is a type analyzer able

to follow jumps and branches in the code, detect cy-

cles and avoid loops by checking for a convergence in

the status of the typing function - in pretty much the

same way as many basic techniques of Abstract In-

terpretation for Static Analysis do (F. Nielson, 1999).

The status here consists in a special topological envi-

ronment mapping variable occurrences to their ﬂow-

type at that point in the program. Overall, this ap-

proach resembles a sort of data-ﬂow analysis where

data are actually types rather than values.

A prototype of the system described in this article

has been developed in F# for the .NET 4.0 platform

and implements a Lex & Yacc tweak for reproduc-

ing the behavior of Island Grammars (Moonen, 2001)

with the beneﬁt of efﬁcient LALR parsing.

It is able to parse large COBOL source programs

(up to many thousands of lines) and to type them

generating as output the ﬂow-types annotated at ev-

ery variable occurrence (i.e. the topological environ-

ment mentioned above). Additionally, it produces

useful information about type usage in form of er-

ror messages, warnings and hints. Again as opposed

to a compiler, here errors do not imply a failure:

the system adopts a keep-going approach and is tol-

erant to most recoverable error scenarios. All type

mismatches or misuses are notiﬁed and other hints

over possible error-prone situations are signaled; an

undeﬁned variable, though, would make the system

fail. Thus, we assume to process production code that

compiled successfully and does work.

1.1 Overview

Our system do not manipulate COBOL code directly:

as other remarkable systems do (van Deursen and

Moonen, 1999), we translate COBOL into a more

comfortable intermediate language (from now on re-

ferred to as IL) resembling modern imperative lan-

guages without altering COBOL semantics and prin-

ciples. Notably, what in COBOL speak is referred to

as a program (i.e. a compilation unit), here is trans-

lated into a procedure, with its own static variable

declarations. A COBOL application made up of many

units becomes a single large IL program, where the

main code shows up as the bottom unnamed block.

Before performing the type analysis, the system

must also label all variables occurring in the program

with an unique identiﬁer - simply a fresh integer tag.

The type analyzer eventually explores the code, state-

ment by statement and recursively descending into

expressions, basically performing two operations that

affect either the topological or type environment:

1. keeping track of the current type(s) of variables

by updating ﬂow-types in the type environment;

2. annotating variable occurrences with their ﬂow-

type at that point in the program, i.e. creating new

bindings in the topological environment.

Assignments and call-by-reference argument ap-

plications are two scenarios where variables could be

subject to an implicit cast, hence the ﬂow-type of a

variable appearing for example at the left-hand side

of an assignment must be updated. Conditional con-

structs, instead, lead to branches in the code explo-

ration, thus the analyzer would produce two parallel

results for the two sub-blocks of an

if-then-else

statement: the resulting environments must therefore

be merged somehow to reﬂect that the same variables

may possibly have different types after the

block

and these multiple choices are collected in the ﬂow-

type itself.

Look at the following example code directly writ-

ten in IL:

{

x := x + 1;

if x > 0 then

{

x := "foo";

}

x := x + 23;

}

where x : num[2] := 11

What we want to achieve is reconstructing the

types of the program and producing annotations for

each occurrence of variable

with its type in that

point of the code, as well as outputting error and

warning messages. For doing that the system has to

follow all branches in the control ﬂow and keep up-

dated the type of

: by the end of the conditional

block we want to to show somehow that

might have

become a string. And where there is an ambiguous

operation, we want the system to recover to a default

decision and add a comment about it.

{

(x : num[2]) := (x : num[2]) + 1;

// [WARNING] possible truncation

// detected in assignment:

// num[3] :> num[2]

if (x : num[2]) > 0 then

{

(x : alpha[2]) := "foo";

// [ERROR] truncation detected in

// assignment:

// alpha[3] :> num[2]

}

(x : num[2]) := (x : num[3]|alpha[3]) + 23;

TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE

// [HINT] type of ’x’ is ambiguous in

// expression at right-hand of

// assignment: assuming

// initialization type num[2]

// [WARNING] possible truncation

// detected in assignment:

// num[3] :> num[2]

}

where x : num[2] := 11

In the ﬁrst statement, where

is incremented by 1,

the type of the variable is annotated both in its usage

as an expression term and as the target on the left side

of an assignment. In the right-hand case its type is the

initialization type

num[2]

that appears in the global

declaration, which happens to be its current type at

the beginning of the program; in the left-hand case

should be given a wider numeric type, because the re-

sult of the sum of a

num[2]

and a literal whose type is

num[1]

would lead to

num[3]

, but it gets truncated

in order to ﬁt the initialization type as COBOL run-

time would do and therefore, being the resulting stor-

age class still

num

, its ﬁnal type happens to be equiv-

alent to its initialization type.

The system tracks the type that variable are sup-

posed to have from a type-ﬂow point of view, i.e. as if

data movements were tracked across expressions and

statements and the type of what variables are sup-

posed to contain is recorded.

Encountering the

statement makes the analyzer

descend into its

then

block: a truncation is detected

therein, being

alpha[3]

surely wider than the tar-

get type

num[2]

, and the truncated type

alpha[2]

is given to

, which ﬁts the initialization type. Such

information must be then merged to that previously

collected before branching: hence the reason why the

type of

in the expression at the right hand of the as-

signment after the

block is not a simple type. The

ﬂow-type has grown here due to the merge and it now

consists of all possible types

might have at the mo-

ment. That leads to an ambiguous choice when typing

the sum operation and so the system needs to recover

to the initial type declaration - which might seem odd,

but is in fact a viable solution, as in COBOL every

variable strictly adheres to its picture declaration, thus

falling back to it is not an unsafe decision in case a

better information cannot be reconstructed.

1.2 Comparisons and Motivation

As already mentioned, the legacy software analysis

system thoroughly presented in (Moonen, 2003)

rely

In general, a number made of 2 digits plus a number made

of 1 digit could possibly lead to a number made of 3 digits, as in

99+ 9 = 109. See type rules for expressions in table 6 for details

on how arithmetic operations formally affect numeric type formats.

That is a Ph.D. thesis collecting previous works on the same

subject and anticipating some that yet had to come. In general, that

on mechanisms for producing information over types

that mainly serve Program Understanding techniques,

Concept Analysis (Kuipers and Moonen, 2000) and

other high-level elaborations. In general, its scope

is wider than ours and not entirely overlapping.

Nonetheless there is something in common, that is

giving somehow interesting types to COBOL vari-

ables, that can be taken into consideration for mak-

ing a comparison with what we believe is the most

advanced system for COBOL analysis based on types

available to date.

• We translate COBOL into a simpler interme-

diate language as (van Deursen and Moonen,

1998) does, though without leaving out impor-

tant language constructs whose behavior is rele-

vant to typing real-world programs, such as

goto

perform

and

perform-thru

jump statements,

call-by-reference procedure calls and

state-

ments.

• Our type syntax is more complete, clearer and

open to more orthodox type manipulation, as

it doesn’t provide just a plain AST-ization of

COBOL picture declarations

• The type inference

rules given in (van Deursen

and Moonen, 2001) are sometimes trivial. We de-

ﬁne a type-system that reconstruct more detailed

type information, e.g. our type rules for arith-

metic operators in table 6 recalculate the resulting

type format length in order to include within the

type itself as much information as possible about

changes in value ranges.

• We don’t infer a type equivalence when two or

more types are expected to be the same (as would

happen in ML in a homogeneous binary applica-

tion, for example). Our system rather falls back to

a variable initialization type in case a type mis-

match or ambiguity is detected. This trade off

makes type derivations simpler, does not neces-

sarily imply a loss of information and reﬂects

COBOL run-time semantics better.

system has been proposed several times in more articles with some

additions - we might therefore refer to either (van Deursen and

Moonen, 1998), (van Deursen and Moonen, 2001), (van Deursen

and Moonen, 2000), (Kuipers and Moonen, 2000) or (Moonen,

2003).

Syntax of types in (van Deursen and Moonen, 1998) oddly

carries along the variable identiﬁers and picture format strings as

is, leaving unclear how the type environment and type comparisons

formally related to them.

That system uses the word inference, with a clear reference

to the world of ML and functional languages, though we’d prefer

reconstruction, as there is actually no use of type variables and uni-

ﬁcation for resolving a set of constraints over type equations.

ICSOFT 2011 - 6th International Conference on Software and Data Technologies

• The system in (Kuipers and Moonen, 2000) repre-

sents the inferred set of type relations via a Rela-

tional Algebra and resolves them applying an al-

gorithm written in Grok (Holt, 2008): the resolu-

tion is actually a simpliﬁcation process perform-

ing iterative uniﬁcation. This approach is rather

inefﬁcient and does not take into account type dy-

namics due to control-ﬂow jumps. Our system

performs a code analysis at typing-timeby follow-

ing jumps

, thus detects a wider range of possible

type anomalies and variable reuses.

• According to (van Deursen and Moonen, 2001),

pollution occurs whenever a type-equivalence in-

volves types that are not equivalent or subtypes:

we do not handle this as a special case, but it

automatically comes from non-singleton choices

within ﬂow-types, which are natively supported

by our type-system and do not require any further

processing.

• Our subtype relation deals with the in-memory

representation of a wider range of type formats

and qualiﬁers that are very common in COBOL

programs, such as all

COMP

ﬁelds (translated into

native integer, ﬂoating point and binary-coded-

decimal types), signed/unsigned numeric formats

and mixed alphabetic/alphanumeric strings.

• In (van Deursen and Moonen, 2001) there is no

mention on how COBOL references

are handled,

nor on how COBOL run-time data conversions af-

fect type rules of commands that manipulate dif-

ferent picture formats and computational ﬁelds

(e.g. the

COMPUTE

instruction). A major feature

of our system is reproducing such behaviors at

typing-time by giving temporary types to R-value

expressions

and eventually promoting them to

storage types when a type coercion in invoked (see

deﬁnitions 3.1 and 3.6).

Let’s now apply our system to a COBOL code

fragment mentioned in (van Deursen and Moonen,

1998) and other papers of the series:

DATA DIVISION.

WORKING-STORAGE SECTION.

01 N000.

05 N100-N PIC S9(03) COMP-3.

Until a convergence in the topological environment is detected

(see section 2.4).

According to COBOL syntax speciﬁcation in (IBM, 2009),

accessible memory cells are called references. We renamed them as

Left Values in our intermediate language for the sake of symmetry

with imperative languages such as C that deﬁne them as a sub-class

of expressions that can appear at the left side on an assignment and

refer to an in-memory value (Kernighan and Ritchie, 1988).

Symmetrically, right values are expressions that can stand on

the right side of an assignments, hence evaluate to a temporary

value (Stroustrup, 2000).

01 TAB000.

05 TAB100-NAME-PART.

10 TAB100-POS PIC X(01) OCCURS 40.

05 TAB100-MAX PIC S9(03) COMP-3 VALUE 40.

05 TAB100-FILLED PIC S9(03) COMP-3 VALUE ZERO.

01 RAR001-RECORD.

03 RAR001-VAST.

05 RAR001-INITIALS PIC X(05).

PROCEDURE DIVISION.

R210-INITIALS.

MOVE RAR001-INITIALS TO TAB100-NAME-PART

PERFORM R300-COMPOSE-NAME

EXIT.

R300-COMPOSE-NAME.

MOVE TAB100-MAX TO N100.

MOVE ZERO TO TAB100-FILLED.

PERFORM UNTIL N100 EQUAL ZERO

IF TAB100-POS (N100) EQUAL SPACE

SUBTRACT 1 FROM N100

ELSE

MOVE N100 TO TAB100-FILLED

MOVE ZERO TO N100

END-IF

END-PERFORM.

The whole code above is translated into one sin-

gle annotated IL program showing operations among

types and warning messages

{

R210-INITIALS: // main code

{

(TAB100 : { NAME-PART : alphanum[40]; .. })

.NAME-PART :=

RAR100-RECORD.VAST.INITIALS;

// [WARNING] reverse subsumption

// detected in assignment: right-hand

// type is smaller that left-hand type

perform R300-COMPOSE-NAME;

return;

}

R300-COMPOSE-NAME:

{

N00.N := TAB100.MAX;

TAB000.FILLED := 000;

__loop0:

{

if N000.N = 000 then goto __loop0_exit;

else

{

if (TAB100 : { NAME-PART :

{ POS : alphanum[1] array[40] };..})

.NAME-PART.POS[N00.N] = ’ ’

// [WARNING] possible access to

// corrupted data: accessing TAB100

// with its initialization type

// but its type had changed to

// { NAME-PART : alphanum[40]; .. }

// [WARNING] possible error in array

// subscript: type ’num.bcd[S3]’ has

// signed format

then {

(N000 : { N : num.bcd[S3] }).N :=

N000.N - 1;

// [WARNING] possible truncation

// detected in assignment:

// num.bcd[S4] :> num.bcd[S3]

}

else

{

TAB000.FILLED := N000.N;

(N000 : { N : num.bcd[S3] }).N :=

000;

}

goto __loop0;

}

__loop0_exit: {}

}

where N000 : { N : num.bcd[S3] };

We omit type annotations where ﬂow-types do not differ from

the previous variable occurrence or from its initialization type.

TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE

TAB000 : {

NAME-PART : {

POS : alphanum[1] array[40] };

MAX : num.bcd[S3] := 40;

FILLED : num.bcd[S3] := 0 };

RAR001-RECORD : {

VAST : { INITIALS : alphanum[5] } }

(Moonen, 2003) unfortunately does not contain

practical code samples of pollution or other anoma-

lies, thus we can’t compare how the two systems be-

have in that regard. As a matter of facts, though,

our system seems ultimately more involved in accu-

rate typing and detecting error-proneness rather than

program reasoning and collecting statistics.

2 TYPE SYSTEM

2.1 Storage Types and Flow-types

COBOL picture declarations in the Working Storage

section of the Data Division deﬁne data instances

along with their own storage format: they’re not type

declarations for instantiating data elsewhere as most

modern languages do. Our system must of course

reproduce this design, but mapping COBOL picture

format strings into types. For example, consider the

following picture declaration:

DATA DIVISION.

WORKING-STORAGE SECTION.

01 A PIC 9(3) COMP-3 OCCURS 10.

01 N PIC COMP 9(8).

01 R1.

02 R1-S PIC A(2).

02 R1-B PIC X(3)9(2)A(3).

01 R2 OCCURS 7.

02 X PIC S99V9 COMP-2.

We translate it into more orthodox type bindings

that are quite self-explanatory:

A : num

bcd

[3] array[10]

N : num

int

[8]

R1 : {S : alpha[2]; B : alphanum[8]}

R2 : {X : num

ﬂoat

[S2.1]} array[7]

Picture format strings are mapped into either nu-

meric, pure alphabetic or alphanumeric types accord-

ing to their structure; arrays and records are also

ﬁrst-class citizens of the type language in our sys-

tem and can therefore be nested at will, yielding to

types that resemble those of modern functional lan-

guages. Moreover, numeric types carry along de-

tailed information on their in-memory representa-

tion at machine level, sign and length of both in-

tegral and fractional parts; while arrays and alpha-

betic/alphanumeric strings simply carry their length.

The full syntax of the type-system follows:

τ := storage types

num

[ρ] numeric

| alpha[n] alphabetic string

| alphanum[n] alphanumeric string

| τ array[n] array

| {x

: τ

.. x

: τ

} record

σ := temporary types

| bool boolean

| num[ρ] abstract numeric

q := numeric storage qualiﬁer

ascii display or ASCII

| bcd binary-packed decimal

| int

16|32|64

native integer

| float

32|64

native ﬂoat

ρ := [S]n.d numeric format

ϕ := {τ

.. τ

} ﬂow-item or choice

Φ := hϕ;τi ﬂow-type

where

k ≥ 1

n ∈ N

∗

d ∈ N

There are two distinct classes of types:

• τ is the type of storage variables and L-values in

general, i.e. the type of data that stands in memory

and has some representation

;

• σ, where σ ⊃ τ, is the type given to expression

terms only and is never produced by picture trans-

lation, serving just as a temporary light-weight

type whose in-memory representation is yet to be

known in that context.

As typing rules will show, such temporary types

are eventually promoted to ordinary τ types as soon as

the storage type of an actual variable becomes known,

for example when an expression that’s given a tem-

porary is then assigned to a L-value or passed as a

call-by-ref argument in an procedure call.

Finally, a ﬂow-type is a simply a pair of possi-

bly multiple storage types (those a variable may con-

currently have following statically undecidable condi-

tional branches in the program ﬂow, as stated in sec-

tion 1) and an additional single storage type, which is

the type initially declared for the variable in the global

environment. We’ll be often referring to the ﬁrst com-

ponent of a ﬂow-type as ﬂow-item or choice.

ASCII is the default qualiﬁer for numeric types: whenever

unspeciﬁed this one holds, as in

num[3]

for example.

ICSOFT 2011 - 6th International Conference on Software and Data Technologies

2.2 Environments

Type judgements operate over a number of environ-

ments.

Type Environment Γ maps variable identiﬁers to

ﬂow-types: this environment is initially populated

with global type declarations and its bindings are

then updated when the current ﬂow-type changes

during typing. It contains bindings of form x : Φ.

Topological Environment Θ collects all annotations

producedby the type analyzer by mapping labeled

variableoccurrences x

to its ﬂow-type at that pro-

gram point. It represents also the status of the typ-

ing function in the detection of loop termination.

It contains bindings of form x

: Φ.

Procedure Environment Π maps procedure names

to signatures (see deﬁnition 3.8). It contains bind-

ings of form p 7→ hy

: τ

.. y

: τ

;Γ

Block Environment Σ maps label identiﬁers to

blocks of statements. It contains bindings of form

l 7→ {st

.. st

Type environments also support a binary function

merge, used by rules IF and IF-ELSE, which recom-

pacts the bindings collected in separate environments

by the typing of program branches, as informally in-

troduced by section 1. Such merge function is alone

responsible for the growth of the ﬂow-item compo-

nent within a ﬂow-type.

2.3 Coercion of L-Values

Take the following example:

{

a[0].l := "boo";

}

where a : { l : num[2]; m : alpha[10] } array[5]

And its annotated form resulting from the type

analysis:

{

(a : { { l : alpha[2]; m : alpha[10] } array[5])[0].l

:= "boo";

}

where a : { l : num[2]; m : alpha[10] } array[5]

The literal

"boo"

having type

alpha[3]

is as-

signed to ﬁeld

of a record within a cell of an array.

The ﬂow-type of variable

needs to be updated here

somehow with the type of the right-hand of the as-

signment - and of course it’s not to

that such type

must naively be given, but to the record ﬁeld

nested

within. Nonetheless the environment binds variable

identiﬁers to ﬂow-types, thus there is no way to up-

date the type of a record label (as

in our case) or

of an array cell alone. Therefore the whole type of a

variable must be updated keeping the original struc-

ture layout and replacing the appropriate bit nested

within it. Hence, the whole type of

in the exam-

ple becomes {

l : alpha[2]; m : alpha[10]

}

array[5]

This shows also that the expected type

alpha[3]

of the literal

"boo"

has been adapted to ﬁt into the

initialization type

num[2]

: coercion in assignments

needs therefore both to replace a piece of a type and

to resize it accordingly, keeping the original storage

class (

num

in our example) and recalculating the for-

mat in such a way that the overall size of the new re-

sulting type ﬁts the initialization one.

For this reasons, judgements for L-value terms are

slightly different: Π;Σ;Γ;Θ

⊢

lv : τ\θ

⊲ Θ

means that the L-value lv has a storage type τ co-

ercible by the substitution θ

, where x is the root vari-

able of lv (formally x = ℜ(lv) as of deﬁnition 3.7) and

is its labeled occurrence. θ is a function from stor-

age types to storage types that can be passed by typing

rules that need to update the type of the root variable

of an L-value to the coerce function C (see deﬁnition

3.6), which performs the proper ﬁt operation among

other things.

2.4 Loops and Convergence

As informally stated in section 1, the type analyzer

follows

goto

and

perform

statements unless already

visited and a convergence in the status of the typing

function is detected. In subsection 2.2 we said that

this status actually consists of the topological envi-

ronment Θ. The typing function at step i of the anal-

ysis can be deﬁned as a function taking the statement

fetched at that step and the topological environment:

(st

B,p

,Θ

) = Θ

i+1

where st

B,p

is the statement located within block

B at position p.

Each time the typing function encounters a jump

statement, it performs a number of operations. Say

a jump statement st

A,q

≡

goto

l is encountered by T

at step i while typing block A = {st

A,1

.. st

A,n

} (with

q ∈ [1,n]):

1. it saves the topological environmentΘ

built up so

far, binding it to the current program location;

2. it looks up the destination block of state-

ments from the block environment, hence B =

{st

B,1

.. st

B,m

} = Σ(l);

3. it continues the analysis from there, i.e. from

statement st

B,1

Let’s consider that later at step j (obviously j > i)

T reaches the jump statement st

A,q

again: then the

new current topological environment Θ

is compared

TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE

against Θ

, which had formerly been saved at that pro-

gram location. If Θ

⊑ Θ

(see deﬁnition 3.11) then

it means that no further type information has been

collected during the second pass and we can there-

fore assume that the analysis can safely skip the jump

statement st

A,q

and continue from st

A,q+1

. Else, the

new topological environment Θ

is saved (replacing

the old Θ

previously stored) and the analysis contin-

ues from the jump statement destination st

B,1

again.

We observed that even in complex spaghettish

scenarios with several

goto

statements within nested

conditional blocks the system detects a convergence

pretty soon: averagely in 1 and anyway in up to 3 re-

iterations of the same piece of code. The reason is

twofold:

• the topological environment cannot by deﬁni-

tion be subject to binding removal, hence ∀x

∈

. x

∈ Θ

i+1

at any given step i;

• ﬂow-types bound to variable occurrences in the

topological environment can only grow - they can

neverdiminish in width. Given we’re dealing with

types and not values, the stability is certain: stor-

age types of variables do not change from pass to

pass for obvious reasons and the only thing that

could change and modify the status Θ of the typ-

ing function T is the ﬂow-item ϕ part of ﬂow-

types bound to variable occurrences. ϕ is deﬁned

as a set of storage types τ in table 2.1 and it is sub-

ject to a single operation: the merge function as

of deﬁnition 3.9, which basically consists in a set-

union between ﬂow-items. Duplicate types can

therefore never occur and no element could be re-

moved.

2.5 Ambiguity

Having non-singleton ﬂow-items within ﬂow-types is

indeed a central feature of this system, signaling that

the programmer reused a variable in different ways

along the program. Nonetheless, that makes judg-

ments for L-values problematic: howare we supposed

to type an L-value appearing in an expression, for in-

stance, if its current ﬂow-type says that it could have

many storage types at the same time? In fact, we can’t

- that’s exactly what ﬂow-types stand for: detecting

anomalous scenarios that may lead to unwanted re-

sults at run-time.

In our code example in section 1, imagine the

system had output another hint message for the am-

biguous statement claiming that among the possible

choices

num[3]

would have been suitable. And the

typing then proceeded selecting

num[3]

as candidate,

leading to a different type for

- not the one shown in

the original example.

{

(x : num[3]) := (x : num[2]) + 1;

// [WARNING] possible truncation detected

// in assignment:

// num[3] :> num[2]

if (x : num[3]) > 0 then

{

(x : alpha[3]) := "foo";

// [ERROR] truncation detected in

// assignment:

// alpha[3] :> num[2]

}

(x : num[6]) := (x : num[3]|alpha[3]) + 23;

// [HINT] type of ’x’ is ambiguous in

// expression at right-hand of

// assignment: choice num[3]

// would fit

// [WARNING] possible truncation detected

// in assignment:

// num[6] :> num[2]

}

where x : num[2] := 11

What if more than one type was suitable, though?

The ﬂow-type would literally explode for tracking

several implications among possible typing paths and

in the end it would hardly be useful.

Our proposal in such situations is to do the sim-

plest thing: falling back to the initial type of the vari-

able; and of course notifying the choice with a hint

message. However, this leads to a duplication of the

type rule for variables, as table 4 shows.

3 FORMAL SPECIFICATION

In this section we give the full speciﬁcation of the

type-system described in section 2. A number of def-

initions is given below that will be used by type rules.

Deﬁnition 3.1 (Promote). The promotion JσK

of a

temporary type σ to a storage type τ produces a stor-

age type that transform σ into a storable type inher-

iting the characteristics of τ. The promotion function

is deﬁned as follows (top-down closest-match rule on

the left hand holds):

Jnum[ρ

num

[ρ

]

= num

[ρ

]

Jnum[ρ]K

= num

ascii

[ρ]

JboolK

= Jnum[1.0]K

Jτ

= τ

Deﬁnition 3.2 (Representation). We deﬁne a func-

tion rep : τ → N for calculating the in-memory byte

size of a storage type:

rep(num

ascii

[n.d]) = n+ d

rep(num

bcd

[n.d]) = ⌈

n+d+1

⌉

rep(num

int

[ρ]) = b/8

rep(num

float

[ρ]) = b/8

rep(alpha[n]) = n

rep(alphanum[n]) = n

rep(τ array[n]) = rep(τ) ∗ n

ICSOFT 2011 - 6th International Conference on Software and Data Technologies

rep({x

: τ

.. x

: τ

}) =

∑

i=1

rep(τ

)

Deﬁnition 3.3 (Subtype). We deﬁne a total-order be-

tween storage types such that the relation τ

 τ

holds when rep(τ

) ≤ rep(τ

Deﬁnition 3.4 (Var-Bound Substitution). A substi-

tution θ

is a function from storage types to stor-

age types that carries along a labeled identiﬁer x

which stands for the variable occurrence whose type

the substitution has been built from and is supposed

to replace

Deﬁnition 3.5 (Fit). The ﬁt ⌊τ

⌋

of a storage type

to a storage type τ

produces a storage type whose

storage class is equivalent to that of τ

and whose

size ﬁts into that of τ

. The ﬁt function is deﬁned as

follows:

⌊num

[ρ]⌋

= num

[ρ

′

]

for some ρ

′

such that

rep(num

[ρ

′

]) = rep(τ)

⌊alpha[n]⌋

= alpha[n

′

]

for some n

′

such that

rep(alpha[n

′

]) = rep(τ)

⌊alphanum[n]⌋

= alphanum[n

′

]

for some n

′

such that

rep(alphanum[n

′

]) = rep(τ)

⌊τ

array[n]⌋

= τ

′

array[n

′

]

for some τ

′

and n

′

such that

rep(τ

′

array[n

′

]) = rep(τ)

⌊{l

: τ

..l

: τ

}⌋

= {l

: τ

′

..l

: τ

′

}

for some τ

′

..τ

′

such that

rep({l

: τ

′

..l

: τ

′

}) = rep(τ)

Deﬁnition 3.6 (Coerce). The coerce function C up-

dates the given type and topological environments by

applying a given substitution function θ

to the types

a given ﬂow-item ϕ consists of; it produces a new pair

of form hΓ;Θi consisting of the type and topological

environments endowed with updated bindings for the

variable x and the occurrence label κ annotated on

the substitution function θ

itself:

C (ϕ,θ

,Γ,Θ) = hΓ,x : Φ

′

;Θ,κ : Φ

′

where

Substitution functions are recursively deﬁned by type rules for

L-Values as shown in table 4. They’re meant for generically replac-

ing a term nested within a storage type of arbitrary complexity by

reproducing its original structure ofrecursive type terms andchang-

ing the innermost part only.

hϕ;τ

i = Γ(x)

′

= h{τ

′

| ∀τ

∈ ϕ.τ

′

= ⌊θ

(τ

)⌋

};τ

Deﬁnition 3.7 (Root Variable). Given an L-value lv,

its root variable is the identiﬁer x evaluated by the

recursive function deﬁned as:

ℜ(x) = x

ℜ(lv[e]) = ℜ(lv)

ℜ(lv.l) = ℜ(lv)

Deﬁnition 3.8 (Signature). A signature is a pair

;Γ

i where p is a procedure name, Y

are its for-

mal parameters y

: τ

.. y

: τ

and Γ

is the output

type environment returned by typing the body of p.

Deﬁnition 3.9 (Type Environment Merge). The bi-

nary function ⊕ merges two given type environments

into one as follows:

⊕ Γ

= Γ

∗

∪ (Γ

\Γ

) ∪ (Γ

\Γ

)

where

∗

= {x : hϕ

∪ ϕ

;τ

i | Γ

(x) = hϕ

;τ

∧ Γ

(x) = hϕ

;τ

∧ τ

= τ

}

Deﬁnition 3.10 (Partial Ordering of Flow-Types).

We deﬁne a partial order between ﬂow-types such that

⊑ Φ

holds when, let Φ

= hϕ

;τ

i and Φ

hϕ

;τ

i, then ϕ

⊆ ϕ

∧ τ

= τ

Deﬁnition 3.11 (Partial Ordering of Topological

Environments). We deﬁne a partial order between

topological environments such that Θ

⊑ Θ

holds

when ∀x : Φ

∈ Θ

. x ∈ dom(Θ

) ∧ Φ

⊑ Φ

, where

= Θ

(x).

3.1 Type Rules

Syntax-directed type rules are divided by category.

Rules for Programs are shown in table 1, for State-

ments in table 2, for Expressions in table 6, for Argu-

ments in table 3 and for Literals in table 5.

Most judgements give a type to a term of the lan-

guage in a context consisting of a tuple of environ-

ments and output the updated Γ and Θ, except judge-

ments for Statements and Programs that give no type

and simply update the environments. As a general

rule, the topological environment Θ is always for-

warded to and returned by all judgements (except lit-

erals), because ﬂow-types must be annotated recur-

sively on each variable occurring in any subterm of

the program. While the type environment Γ is output

only by rules that actually update it: consider it as re-

turned back untouched when there’s no mention of it

among outputs.

Judgements are of a number of forms, each

TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE

Table 1: Type Rules for Programs and Body.

MAIN

Π;

0;Θ

⊢

B ⊲ Γ;Θ

Π;Θ

⊢

B ⊲ Θ

PROC

0,y

: h{τ

};τ

i .. y

: h{τ

};τ

Π;Γ

;Θ

⊢

B ⊲ Γ

′

;Θ

Π, p 7→ hy

: τ

.. y

: τ

;Γ

′

i;Θ

⊢

P ⊲ Θ

Π;Θ

⊢

proc p(y

: τ

.. y

: τ

) B in P ⊲ Θ

BODY

∀i ∈ [1,n]. Π;Σ;Γ

;Θ

⊢

lit

: σ

∧ Jσ

 τ

Π;

0;Γ

: h{τ

};τ

i .. x

: h{τ

};τ

i;Θ

⊢

st ⊲ Γ

;Θ

Π;Γ

;Θ

⊢

st where x

: τ

:= lit

.. x

: τ

:= lit

⊲ Γ

;Θ

Table 2: Type Rules for Statements.

ASSIGN

Π;Σ;Γ

;Θ

⊢

e : σ

⊲ Θ

Π;Σ;Γ

;Θ

⊢

lv : τ

\θ

⊲ Θ

= ℜ(lv)

hΓ

;Θ

i = C (Jσ

,θ

x,κ

,Γ

,Θ

)

Π;Σ;Γ

;Θ

⊢

lv := e ⊲ Γ

;Θ

Π;Σ;Γ

;Θ

⊢

e : bool ⊲ Θ

Π;Σ;Γ

;Θ

⊢

st ⊲ Γ

;Θ

= Γ

⊕ Γ

Π;Σ;Γ

;Θ

⊢

if e then st

⊲ Γ

;Θ

IF-ELSE

Π;Σ;Γ

;Θ

⊢

e : bool ⊲ Θ

Π;Σ;Γ

;Θ

⊢

⊲ Γ

;Θ

Π;Σ;Γ

;Θ

⊢

⊲ Γ

;Θ

= Γ

⊕ Γ

Π;Σ;Γ

;Θ

⊢

if e then st

else st

⊲ Γ

;Θ

PERFORM

Π;Σ;Γ

;Θ

⊢

Σ(l) ⊲ Γ

;Θ

Π;Σ;Γ

;Θ

⊢

perform l ⊲ Γ

;Θ

PERFORM-THRU

∀i ∈ [a, b). Π;Σ;Γ

i−a

;Θ

i−a

⊢

Σ(l

) ⊲ Γ

i−a+1

;Θ

i−a+1

Π;Σ;Γ

;Θ

⊢

perform l

⊲ Γ

b−a−1

;Θ

b−a−1

GOTO

∈ dom(Σ) | ∄l

. m > n

∀i ∈ [k,n]. Π;Σ;Γ

i−k

;Θ

i−k

⊢

Σ(l

) ⊲ Γ

i−k+1

;Θ

i−k+1

Π;Σ;Γ

;Θ

⊢

goto l

⊲ Γ

n−k

;Θ

n−k

CALL

: τ

.. y

: τ

;Γ

i = Π(p)

∀i ∈ [1, n]. Π;Σ;Γ

i−1

;Θ

i−1

⊢

: τ

⊲ Γ

;Θ

Π;Σ;Γ

;Θ

⊢

p(a

.. a

) ⊲ Γ

;Θ

BLOCK

′

= Σ, l

7→ {st

j,1

.. st

j,n

}.. (∀ j | st

0, j

≡ l

:{st

j,1

.. st

j,n

})

∀i ∈ [1, n]. Π;Σ

′

;Γ

i−1

;Θ

i−1

⊢

⊲ Γ

;Θ

Π;Σ;Γ

;Θ

⊢

:] { st

0,1

.. st

0,n

} ⊲ Γ

;Θ

syntactic category having its own, though most

of them are quite self-explanatory. For example,

Π;Σ;Γ;Θ

⊢

e : σ ⊲ Θ

denotes that, in the given

environments, expression e is given a temporary type

σ and the topological environment Θ

is output.

Judgements for Arguments probably need some

extra words. Call-by-ref calls need to update the type

environment of the the caller because the ﬂow-type

of argument might be modiﬁed by the invoked proce-

dure. The procedure environment Π stores the type

environment Γ

for each procedure p of the program,

thus the ﬂow-type of a variable passed by reference

to p can be updated according to the ﬂow-type of the

corresponding formal parameter bound in Γ

. Such

update is carried on by the coerce function C , as

shown by rule BYREF in table 3. The mechanism

resembles that in rule ASSIGN in table 2: call-by-

reference argument application indeed behaves like

an assignment (call-by-value doesn’t).

Rules for Arguments have form Π;Σ;Γ

;Θ

⊢

a : τ

⊲ Γ

;Θ

, meaning that, in the given envi-

ronments, the actual argument a has type τ

, which is

the type of the i-th formal parameter of procedure p.

As a ﬁnal notice, for the sake of simplicity we as-

sume that all labels in the program are named in order

of occurrence: if l

and l

are two labels and m > n,

ICSOFT 2011 - 6th International Conference on Software and Data Technologies

Table 3: Type Rules for Arguments.

BYVAL

Π;Σ;Γ

;Θ

⊢

e : σ ⊲ Θ

JσK

 τ

Π;Σ;Γ

;Θ

⊢

val e : τ

⊲ Γ

;Θ

BYREF

Π;Σ;Γ

;Θ

⊢

lv : τ

′

\θ

⊲ Θ

x = ℜ(lv) τ

′

 τ

: τ

.. y

: τ

;Γ

i = Π(p)

hϕ

;τ

i = Γ

)

hΓ

;Θ

i = C (ϕ

,θ

x,κ

,Γ

,Θ

)

Π;Σ;Γ

;Θ

⊢

ref lv : τ

⊲ Γ

;Θ

Table 4: Type Rules for L-Values.

VAR-INIT

Γ(x) = Φ = h{τ

.. τ

};τ

= Θ

: Φ θ

(τ) = τ

Π;Σ;Γ;Θ

⊢

: τ

\θ

⊲ Θ

VAR-CURR

Γ(x) = Φ = h{τ

};τ

= Θ

: Φ θ

(τ) = τ

Π;Σ;Γ;Θ

⊢

: τ

\θ

⊲ Θ

SUBSCRIPT

Π;Σ;Γ;Θ

⊢

e : num[ρ] ⊲ Θ

Π;Σ;Γ;Θ

⊢

lv : τ array[n]\θ

⊲ Θ

x = ℜ(lv)

(τ) = θ

(τ array[n])

Π;Σ;Γ;Θ

⊢

lv[e] : τ\θ

⊲ Θ

SELECT

Π;Σ;Γ;Θ

⊢

lv : {z

: τ

.. z : τ .. z

: τ

}\θ

⊲ Θ

x = ℜ(lv)

(τ) = θ

({z

: τ

.. z : τ .. z

: τ

})

Π;Σ;Γ;Θ

⊢

lv.z : τ\θ

⊲ Θ

then l

appear below l

in the program. That makes

type rules for jump statements simpler.

4 RESULTS AND CONCLUSIONS

Our implementation of the system, as already said,

can detect a number of type misuses and mismatches

besides producing ﬂow-type annotations for each

variable occurrence. At the time of writing several

tests have been run over real-world legacy business

code, mainly written in COBOL85 for z/OS during

the 1990s and owned by a big local company within

the mechanical vehicle industry. The following con-

siderations and evidences have emerged:

• variable reuse involves up to 30% of overall vari-

able usage in COBOL programs

– nearly 90% of these, though, accumulate less

than 5 storage types simultaneously within their

ﬂow-type; averagely 3

– remaining 10% however unlikely grow wider

than 8

– 75% of non-singleton ﬂow-types indicates

reuse of numeric types

∗ 80% of these come from in-place arithmetic

operations possibly exceeding target variable

space, such as the typical scenario

(x :

num[3]) := (x : num[2]) + 1

∗ probably few of such operations are poten-

tially risky at run-time, because programmers

typically declare pictures wider than actually

needed for their numerics

∗ remaining 20% are re-assignments or data

movements, i.e. assignments where variables

on the right-hand do not appear in left-hand

• 25% of non-singleton ﬂow-types indicates reuse

of non-numeric types

– 70% of these are alphanumeric-strings-to-array

type switches and viceversa

– 10% involve complex data types, such as nested

records overlapping arrays

– only 2% occurs between incompatible types,

thus probably leading to data corruption and

bugs

– remaining 18% involve data movementsimply-

ing no truncation, thus might be bad code but

does not lead to run-time unwanted behaviors

• 80% of jump statements require up to 3 visits (in-

cluding the ﬁrst one, hence 2 re-visits) to reach

TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE

Table 5: Type Rules for Literals.

NUM-U

n = len(n

) d = len(n

)

Π;Σ;Γ;Θ ⊢

lit

[.n

] : num[n.d]

NUM

n = len(n

) d = len(n

)

Π;Σ;Γ;Θ ⊢

lit

−n

[.n

] : num[Sn.d]

STRING-ALPHANUM

{0 .. 9}∩

"str.."

n = len(str)

Π;Σ;Γ;Θ ⊢

lit

"str.."

: alphanum[n]

STRING-ALPHA

n = len(

"str.."

)

Π;Σ;Γ;Θ ⊢

lit

"str.."

: alpha[n]

TRUE

Π;Σ;Γ;Θ ⊢

lit

true : bool

FALSE

Π;Σ;Γ;Θ ⊢

lit

false : bool

Table 6: Type Rules for Expressions.

DEMOTE-NUM

Π;Σ;Γ;Θ

⊢

e : num

[ρ] ⊲ Θ

Π;Σ;Γ;Θ

⊢

e : num[ρ] ⊲ Θ

Π;Σ;Γ;Θ

⊢

lv : τ\θ

⊲ Θ

Π;Σ;Γ;Θ

⊢

lv : τ ⊲ Θ

LIT

Π;Σ;Γ;Θ ⊢

lit

lit : σ

Π;Σ;Γ;Θ

⊢

lit : σ ⊲ Θ

NEG-S

Π;Σ;Γ;Θ

⊢

e : num[Sn.d] ⊲ Θ

Π;Σ;Γ;Θ

⊢

− e : num[Sn.d] ⊲ Θ

NEG-U

Π;Σ;Γ;Θ

⊢

e : num[n.d] ⊲ Θ

Π;Σ;Γ;Θ

⊢

− e : num[Sn.d] ⊲ Θ

NOT

Π;Σ;Γ;Θ

⊢

e : bool ⊲ Θ

Π;Σ;Γ;Θ

⊢

not e : bool ⊲ Θ

PLUS-U

Π;Σ;Γ;Θ

⊢

: num[n

] ⊲ Θ

Π;Σ;Γ;Θ

⊢

: num[n

] ⊲ Θ

n = max(n

) d = max(d

)

Π;Σ;Γ;Θ

⊢

+ e

: num[Sn.d] ⊲ Θ

PLUS-MINUS-S

Π;Σ;Γ;Θ

⊢

: num[S

] ⊲ Θ

Π;Σ;Γ;Θ

⊢

: num[S

] ⊲ Θ

S = S

∨ S

n = max(n

) + 1

d = max(d

)

Π;Σ;Γ;Θ

⊢

(+ | −) e

: num[Sn.d] ⊲ Θ

MULT

Π;Σ;Γ;Θ

⊢

: num[S

] ⊲ Θ

Π;Σ;Γ;Θ

⊢

: num[S

] ⊲ Θ

S = S

∨ S

n = n

+ n

d = d

+ d

Π;Σ;Γ;Θ

⊢

∗ e

: num[Sn.d] ⊲ Θ

DIV

Π;Σ;Γ;Θ

⊢

: num[S

] ⊲ Θ

Π;Σ;Γ;Θ

⊢

: num[S

] ⊲ Θ

S = S

∨ S

n = n

+ d

d = d

+ n

Π;Σ;Γ;Θ

⊢

/ e

: num[Sn.d] ⊲ Θ

BIN-REL-NUM

Π;Σ;Γ;Θ

⊢

: num[S

] ⊲ Θ

Π;Σ;Γ;Θ

⊢

: num[S

] ⊲ Θ

Π;Σ;Γ;Θ

⊢

: bool ⊲ Θ

BIN-REL-ALPHANUM

Π;Σ;Γ;Θ

⊢

: alphanum[n1] ⊲ Θ

Π;Σ;Γ;Θ

⊢

: alphanum[n2] ⊲ Θ

Π;Σ;Γ;Θ

⊢

: bool ⊲ Θ

BIN-LOGIC

Π;Σ;Γ;Θ

⊢

: bool ⊲ Θ

Π;Σ;Γ;Θ

⊢

: bool ⊲ Θ

Π;Σ;Γ;Θ

⊢

: bool ⊲ Θ

a convergence in the typing function status; aver-

agely 2, hence 1 re-visit

– 98% of those are actually pretty ordinary loops

coming from COBOL iterative constructs; just

2% are weird custom cycles created by the pro-

grammer

– remaining 20% of jump statements need any-

way up to 5 visits before a convergence occurs

ICSOFT 2011 - 6th International Conference on Software and Data Technologies

– 70% the latter are actually just nested condi-

tional loops that COBOL iterative constructs

cannot express and are explicitly written by

programmers via

and

GOTO

statements.

All this suggests that type-ﬂow analysis is actually

able to detect a number of possible errors in COBOL

programs coming from bad reuse of variables or in-

compatible data movements. Either ways lead to data

truncation or corruption, which are the major sources

of run-time bugs. And, by the way, the statistics above

do not differ a lot from those collected and shown by

(Moonen, 2003).

In the following example we show how a data

move from a smaller type to a larger one might lead to

unwanted scenarios where previous data has not been

replaced by new one:

{

a := r;

// [WARNING] reverse subsumption detected in

assignment: right-hand type

is smaller that left-hand type

n := a[3];

// [WARNING] possible access to corrupted

data: accessing ’a’ with its

initialization type

’alphanum[2] array[4]’ but its

content and type have changed

}

where a : alphanum[2] array[4];

r : { x : num[3];

y : alphanum[2];

z : num[2] };

n : alphanum[2]

Record

is 7-bytes long and array

is 8 bytes,

therefore, once

is copied into

, accesses to the latter

as its initialization array type would lead to unwanted

data in case the last byte is accessed. Although in the

example we used a literal in the subscript, in general

the analyzer cannot know what is accessed and there-

fore the warning is output.

For this matter, static evaluation of constant ex-

pressions has been implemented in our prototype,

even though we haven’t considered it in this article

- that would avoid the warning in case the assign-

ment was

n := a[1]

and we generally noted that it

does slightly reduce the number of messages logged

by the analyzer, overall. Also, a GUI front-end is un-

der development for letting users browse annotated

source programs and understand complex ﬂow-type

more easily.

Finally, we’re considering to extend the system

with the following features:

• dealing with unknown statements in some inter-

esting way, type-wise, such as adding weak types

to the type-system indicating that type assump-

tions might get broken whenevera variable is used

by a COBOL command whose semantics are un-

known

• support for COBOL language extensions such as

SQL, introducing the notion of cursor and table

types within the system for detecting possible in-

consistencies between declared records and actual

row layout in the database

• adding some form of data-ﬂow analysis over value

domains and ranges

• designing some custom Program Understanding

approaches, such as pattern recognition over iden-

tiﬁer names or code snippets for making the

system aware of typical COBOL programming

trends, styles, practices and design patterns

REFERENCES

F. Nielson, H.R. Nielson, C. H. (1999). Principles of Static

Analysis. Springer Verlag.

Holt, R. C. (2008). WCRE 1998 most inﬂuential paper:

Grokking software architecture. In WCRE (Work-

ing Conference on Reverse Engeneering), pages 5–14.

IEEE.

IBM (2009). Cobol z/OS language reference. Website.

http://publib.boulder.ibm.com/ infocenter/ pdthelp/

v1r1/ index.jsp?topic=/ com.ibm.debugtool.doc 7.1/

eqa7rm0293.htm.

Kernighan, B. W. and Ritchie, D. (1988). The C Program-

ming Language, Second Edition. Prentice-Hall.

Kuipers, T. and Moonen, L. (2000). Types and concept anal-

ysis for legacy systems. In IWPC, pages 221–230.

IEEE Computer Society.

Moonen, L. (2001). Generating robust parsers using island

grammars. In WCRE (Working Conference on Reverse

Engeneering).

Moonen, L. (2003). Exploring software systems. In ICSM,

pages 276–280. IEEE Computer Society.

Stroustrup, B. (2000). The C++ Programming Lan-

guage. Addison-Wesley Longman Publishing Co.,

Inc., Boston, MA, USA, 3rd edition.

van Deursen, A. and Moonen, L. (1998). Type inference

for cobol systems. In WCRE (Working Conference on

Reverse Engeneering), pages 220–230.

van Deursen, A. and Moonen, L. (1999). Understanding

cobol systems using inferred types. In IWPC. IEEE

Computer Society.

van Deursen, A. and Moonen, L. (2000). Exploring legacy

systems using types. In WCRE (Working Conference

on Reverse Engeneering), pages 32–41.

van Deursen, A. and Moonen, L. (2001). An empirical

study into cobol type inferencing. Sci. Comput. Pro-

gram., 40(2-3):189–211.

van Deursen, A. and Moonen, L. (2006). Documenting

software systems using types. Sci. Comput. Program.,

60(2):205–220.

TYPE-FLOW ANALYSIS FOR LEGACY COBOL CODE