A COMPARATIVE STUDY OF DOCUMENT CORRELATION

TECHNIQUES FOR TRACEABILITY ANALYSIS

Anju G. Parvathy, Bintu G. Vasudevan and Rajesh Balakrishnan

SETLabs, Infosys Technologies Ltd., Bangalore

Keywords:

Traceability matrix, impact analysis, document correlation, TFIDF, LSI, LDA, CTM.

Abstract:

One of the important aspects of software engineering is to ensure traceability across the development lifecycle.

Traceability matrix is widely used to check for completeness and to aid impact analysis. We propose that this

computation of traceability can be automated by looking at the correlation between the documents. This

paper describes and compares four novel approaches for traceability computation based on text similarity,

term structure and inter-document correlation algorithms. These algorithms base themselves on different

information retrieval techniques for establishing document correlation. Observations from our experiments are

also presented. The advantages and disadvantages of each of these approaches are discussed in detail. Various

scenarios where these approaches would be applicable and the future course of action are also discussed.

1 INTRODUCTION

Requirements traceability can be deﬁned as “An ex-

plicit tracing of requirements to other requirements,

models, test requirements and other traceability items

such as design and user documentation”. A traceabil-

ity item in turn is “Any textual or model item, which

needs to be explicitly traced from another textual, or

model item, in order to keep track of the dependen-

cies between them” (Spence and Probasco, 1998). A

general deﬁnition of a traceability item would be thus,

a “project artifact”. In the life cycle of large software

projects several artifacts like Requirements speciﬁca-

tion, Architecture, Design speciﬁcation, Test Plans,

Use Case documents etc. get created. Any change

in the requirements or design would imply change in

many other artifacts as well. The ability to assess the

impact of a change, in any one of the artifacts, on the

rest of the project artifacts and on the project execu-

tion itself is a critical aspect of software engineering.

Traceability links is one popular mechanism used

to perform impact analysis. This allows a user to

follow the life of a requirement both forwards and

backwards, from origin through implementation (Go-

tel and Finkelstein, 1994). Thus the user can keep

track of the requirement’s development, speciﬁcation,

its subsequent deployment and use.

We start with a discussion on related works and

proceed to present our hypotheses. Following this

we explain the four different implementation schemes

we experimented with. The ﬁrst scheme involves Co-

sine Similarity (CosSim), the second employs Latent

Semantic Indexing (LSI) (Deerwester et al., 1990),

the third uses Latent Dirichlet Allocation (LDA) (Blei

et al., 2003) and the fourth is based on topic modeling

using Correlated Topic Models (CTM) (Blei and Laf-

ferty, 2007). We also explain how we classiﬁed the

dependent artifacts into high, medium and low match

categories. This is followed by a comparative analysis

of the recall and precision inferred from their respec-

tive traceability matrices.

2 RELATED WORKS

Traceability is essential for many purposes, including

assuring that systems conform to their requirements,

that terms are deﬁned and used consistently, and that

the structures of models correspond to requirements

(Alexander, 2002). A variety of software engineer-

ing tasks require tools and techniques to recover inter

document traceability links, particularly the ones be-

G. Parvathy A., G. Vasudevan B. and Balakrishnan R. (2008).

A COMPARATIVE STUDY OF DOCUMENT CORRELATION TECHNIQUES FOR TRACEABILITY ANALYSIS.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - ISAS, pages 64-69

DOI: 10.5220/0001676100640069

 SciTePress

tween documentation and source code. These tools

enable general maintenance tasks, impact analysis,

program comprehension and reverse engineering for

redevelopment and systematic reuse. Several such

tools are currently available, some of which are dis-

cussed here. There are different types of traceability

techniques like cross reference centered, document

centered and structure centered (Gotel and Finkel-

stein, 1994). We use document correlation to build

the traceability links.

Different methods to improve the overall qual-

ity of dynamic candidate link generation for require-

ments tracing have been recently studied. (Dekhtyar

et al., 2006). A novel approach to automate trace-

ability when one artifact is derived from the other has

been illustrated(Richardson and Green, 2004). ‘Qua

Trace’ is yet another tool to aid impact analysis that

allows traces to be established, analyzed, and main-

tained effectively and efﬁciently (Knethen and Grund,

2003). Some other tools that support traceability be-

tween artifacts are (RDD-100, 2006), (RequisitePro,

2006), (DOORS, 2006), RETH (Ebner and Kaindl,

2002),CORE (Vitech Corp.), icCONCEPT RTM (In-

tegrated Chipware), SLATE (TD Technologies, Inc.)

and XTie-RT (Teledyne Brown Engineering) .

The user interface supported by most commercial

tools detailed above, introduces overhead on software

development process. This is pronounced by the trac-

ing tools adding a lot of extra work on the users like

Business Analyst, Architect, Designer, Developer,

Tester, which is distinctly different from what their

corresponding job function demands. These over-

heads include maintenance of the database, storing

the concept deﬁnitions and relations, maintaining ac-

cess rights and permissions to different kinds of users

etc. Further the user is supposed to mark require-

ments, build the requirement hierarchy and maintain

it. Moreover, it is necessary to start using the tool

from project initiation as it would be virtually impos-

sible to capture the required trace information from

different project artifacts otherwise. These tools also

provide limited support for impact analysis. Most of

the tools do not give a quantitative relationship be-

tween two concepts, either. Further the links estab-

lished by such tools are rarely based on the semantic

structure of the documents. Hence some sort of se-

mantic analysis should be incorporated in the tools

to eliminate ‘false positive links’ established by the

tools.

Some approaches for semantic analysis can be

used for information retrieval for document corre-

lation. Probabilistic Latent Semantic Analysis is

a probabilistic variant of Latent Semantic Analy-

sis which deﬁnes a proper generative model of data

(Hoffman, 1999). Aspect model (Hoffmann et al.,

1999) and semi discrete matrix decomposition (Kolda

and O’Leary, 1998) are other models for semantic

analysis. An algorithm used for classiﬁcation of

words into semantic classes (or concepts) is ‘Un-

supervised Induction of Concepts’ (Lin and Pantel,

2001).

The four approaches for requirements tracing dis-

cussed in this paper reduce the burden on the user by

automatically identifying concepts and trace links be-

tween these concepts. Moreover, approaches based

on LSI, LDA and CTM aid in semantic analysis.

3 AUTOMATING TRACEABILITY

In this section, we look at the details of the ap-

proaches towards automating requirements tracing.

They are based on two key hypotheses: The ﬁrst,

is usage of common terminology. We believe that

the application-domain knowledge that programmers

process when writing the code is often captured by

the mnemonics for identiﬁers; therefore, the analy-

sis of these mnemonics can help associate high-level

concepts (eg:use case) with program concepts (eg:

class design speciﬁcation) and vice-versa. The sec-

ond, hypothesis is the usage of connected words as

in class/design speciﬁcation documents. A connected

word contains two or more words in mixed case. The

beginning of a new token within a word is marked by

an upper case character. Often preﬁxes are also at-

tached to these words. The preﬁxes which begin with

a lower case character are neglected as they do not

contribute to the meaning of the word. Appropriate

rules are set to tokenize the connected word depend-

ing on the document type in use (eg: of a connected

word is ‘getCostParameter’; ‘get’ is the preﬁx and the

tokens are ‘Cost’ and ‘Parameter’). Accordingly, if a

particular document has the words ‘cost’ and ‘param-

eter’ and another has the connected word ‘getCostPa-

rameter’ then the two documents are related.

Formally, a project consists of a set of artifacts

like Business Process Model, Requirements Speciﬁ-

cation, Class/Design Speciﬁcation, Test Case docu-

ment etc. Each Artifact could contain multiple con-

cepts having different concept types. A concept type

can be a requirement, use case, test case, class de-

sign or some other relevant topic. A set of “Concept

Extraction Algorithms” can be used to specify how

to extract a particular concept type from an artifact.

A simple CEAL for extracting requirement concept

type from a requirement document could be to look

for some key word like “Requirement” in the differ-

ent headings in the document. In some extreme cases,

A COMPARATIVE STUDY OF DOCUMENT CORRELATION TECHNIQUES FOR TRACEABILITY ANALYSIS

we may have to apply a combination of text segmen-

tation and text abstraction algorithms to extract con-

cepts from artifacts (Goldin and Berry, 1997). Our ap-

proaches propose to derive the relationships, between

concepts, through different ”Relationship Extraction

Algorithms” (REAL) that specify how to derive re-

lationships between two different types of concepts.

For eg:, if a project has a use case artifact and a class

design artifact and the template used for documenting

use cases have a section called “Related Classes”; an

REAL to extract relationships can look for the items

under the “Related Classes” section. Another REAL

could be based on text similarity between use case de-

scription and class design speciﬁcations. If the “Re-

lated Classes” section is well maintained, the ﬁrst al-

gorithm is the more efﬁcient one.

The approaches may not help in deriving all the

concepts and relations in the documents. However,

even if an automated mechanism based on the above

approach is able to derive 60-70 % of the concepts

and relations, we believe that there would be an over-

all increase in productivity through out the lifecycle

of a software project. The approaches themselves are

not tied to any speciﬁc set of project artifacts or con-

cept types or any speciﬁc document format. Further,

we can use any of the four approaches in any stage of

software development lifecycle for inter artifact trac-

ing.

We used sample documents of “Use Case” (UC)s

and related “Class Speciﬁcation” (CS)s as input cor-

pora for all the four approaches. The outputs are

quantiﬁed relationship matrices and classiﬁed trace-

ability matrices. The input corpora are preprocessed

and freed from ‘stop words’ which are very common

words, that will adversely effect the relationship ma-

trix computation. Every new word is also checked

for being a connected word. The four approaches

based on cosine similarity, LSI, LDA and CTM are

discussed subsequently.

Cosine Similarity. Here the documents are pre-

processed and represented in the vector space using

T FIDF of terms, as weights. The popular text simi-

larity algorithm of cosine measure is applied to every

pair of document vectors to get the relationship index

i j

, which quantiﬁes the relationship between them.

The greater the cosine value the higher is the corre-

lation. Further, we classify the dependencies based

on their degree of match into three categories high,

medium and low matches respectively. For classiﬁ-

cation the RIs with CSs are ﬁrst sorted in descending

order for each UC. Then the average values of high-

est three RIs are used as threshold values for classi-

ﬁcation. Thus corresponding to each UC, we have

three classes of CSs, one or more of which could be

empty, as per the degree of correlation between the

documents.

Latent Semantic Indexing. Documents are prepro-

cessed as before and subsequently represented in vec-

tor space. The term by document matrix is then sub-

ject to singular value decomposition for LSI ((Deer-

wester et al., 1990) and (Dumais, 1991)). Choosing

optimal value for rank k for dimensionality reduction

produces a reduced term by document matrix. Differ-

ent values of k result in different links being recov-

ered. The RI here is the cosine of angle between a

pair of document vectors (of documents to be corre-

lated). This RIs are used for classiﬁcation of CSs as

in the earlier approach.

Latent Dirichlet Allocation. LDA is a general prob-

abilistic model for collections of discrete data such as

text corpora (Blei et al., 2003). The input corpus size

is normalized. The maximum document size amongst

the documents in the corpora is obtained. Then ev-

ery document with size less than 50% of the maxi-

mum document size is appended with copies of itself

until the size exceeds the maximum document size.

This normalization is done to ensure uniform length

of documents so that LDA captures importance words

from smaller documents and quantizes their relation-

ships.

The documents are preprocessed as in the ear-

lier cases. The vector form of every document(d)

is determined by the unique words(w) and their

frequencies( f ) in it. The j

document vector is : d

hni hw

: f i..., where n is the total number of unique

words in d

and i = 1, 2, ...n.

We assign the number of topics for which LDA

pools the terms to be equal to the total number of de-

pendent documents in the entire corpora. LDA pro-

cesses the document vectors with predeﬁned settings

and produces the bag of words and log beta ﬁle which

is a matrix of word to topic probabilities. We deﬁne a

text similarity algorithm where RI

i j

is computed as:

i j

∑

k=1

× f

k j

× β

where, N is the total number of unique words in

LDA bag of words, p

is the weight with respect

to maximum number of tokens in connected words,

and f

k j

are the frequencies of word

in document

and document

respectively and β

is the sum of beta

probabilities of word

across all topics as given by

LDA. The classiﬁcation of the CSs is done as in the

previous cases, the only difference being a different

form of RI.

Correlated Topic Model. CTM is a fast variational

inference procedure for carrying out approximate in-

ference that can be used for semantic analysis (Blei

ICEIS 2008 - International Conference on Enterprise Information Systems

and Lafferty, 2007). The procedure for traceability

is exactly the same as the one which involves LDA,

except that the bag of words is obtained using CTM

techniques.

4 EXPERIMENTAL RESULTS

Based on the above mentioned approaches the exper-

iments were carried on sets of two distinct corpora,

each one containing documents belonging to a partic-

ular concept type in the artifact - speciﬁcally 22 UCs

and 21 CSs

. Similar experiments were conducted on

a project of 26 UCs and 235 related CSs for actions

dealing with waste water management, recall and pre-

cision of which are discussed later.

4.1 Quantiﬁed Relationship Matrices

In this section, we provide the quantiﬁed relationship

matrices (of dimension 22 X 21) for four different ap-

proaches mentioned.

Table 1 shows a portion of the matrix of normal-

ized RI computed using cosine similarity. In the ex-

ample, RI

is greater than RI

, RI

and RI

. This means that the highest correlation

is between CS

and UC

, amongst the seven retrieved

relations for UC

as in the table.

Table 1: Relationship matrix of RI for CosSim.

UC CS

1 0.0 0.0 0.01 0.40 0.03 0.02 0.20

2 0.0 0.02 0.0 0.01 0.02 0.003 0.26

3 0.0 0.0 0.0 0.01 0.004 0.0 0.21

4 0.03 0.009 0.02 0.02 0.01 0.001 0.29

5 0.002 0.002 0.002 0.31 0.04 0.03 0.05

Table 2 shows the normalized RI computed after

applying LSI with an optimal latent parameter k = 10.

The higher the value of k, the lesser is the reduction in

dimension. The highest possible value of k is equal to

the total number of documents in the corpus in which

case the results are the same as in the case of simple

The results elaborated were generated from experi-

ments on artifacts (UCs and CSs) pertaining to a shopping

portal project. UC

elaborates the steps involved in mak-

ing replacement costs available to regions, UC

elucidates

the actions related to posting prices, UC

on publishing

these posted prices, UC

on getting these posted prices and

explains the action sequence for entering parameters

for quotes. CS

is a CS on cost parameter control within

the shopping portal which also deals with quote prices, re-

placement costs and ﬁxed price deals . CS

sequences the

process involved in searching cost parameters and replace-

ment costs and CS

describes posted price control.

cosine similarity. In the table, the RI

is the largest

amongst all RIs for UC

. Hence UC

is highly related

to CS

Table 2: Relationship matrix of RI for LSI.

UC CS

1 0.0 0.02 0.004 0.46 0.36 0.01 0.20

2 0.007 0.02 0.007 0.09 0.17 0.003 0.17

3 0.004 0.02 0.005 0.04 0.13 0.0 0.15

4 0.005 0.02 0.008 0.10 0.18 0.009 0.17

5 0.001 0.01 0.003 0.53 0.30 0.03 0.09

Table 3 presents a part of the normalized matrix

generated using the bag of words from LDA, with

number of latent topics equal to 21 (the number of

CSs). Here again RI

of UC

with CS

is the highest

in the list of RIs for UC

and quantiﬁes a true pos-

itive ‘traceability link’. The RIs of UC

, UC

and

with CSs do not help in recovering true relation-

ships as many dominant words from these UCs and

CSs fail to be present in the LDA generated bag of

words which consequently weakens the traceability

process. As for UC

, it was found that there was no

CS in the CS corpora that was directly dependent on

. Hence the RIs represent false positive relations

for UC

Table 3: Relationship matrix of RI for LDA.

UC CS

1 0.02 0.02 0.01 0.16 0.02 0.02 0.03

2 0.01 0.01 0.01 0.01 0.01 0.004 0.01

3 0.005 0.006 0.006 0.006 0.005 0.002 0.006

4 0.04 0.05 0.03 0.04 0.02 0.02 0.03

5 0.01 0.01 0.007 0.11 0.01 0.01 0.01

The normalised matrix computed using CTM gen-

erated bag of words is as shown in Table (4). The

number of latent topics is once again chosen to be 21.

The highest RIs denote strongest relationships of UCs

with CSs for UC

, UC

and UC

Table 4: Relationship matrix of RI for CTM.

UC CS

1 0.01 0.01 0.006 0.16 0.02 0.01 0.02

2 0.0 0.0 0.004 0.0 0.006 0.0 0.005

3 0.0 0.0 0.002 0.0 0.003 0.0 0.003

4 0.007 0.01 0.007 0.01 0.01 0.008 0.01

5 0.005 0.003 0.002 0.09 0.01 0.004 0.01

4.2 Classiﬁed Traceability Matrices

The quantiﬁed relationship matrices generated previ-

ously are used to rank the CSs for every UC, in de-

scending order of RI. After applying threshold mea-

sures on the RIs,the CSs that are dependent on the

A COMPARATIVE STUDY OF DOCUMENT CORRELATION TECHNIQUES FOR TRACEABILITY ANALYSIS

UCs are classiﬁed into the high, medium or low match

categories to generate classiﬁed traceability matrices.

Table 5 A shows the matrices computed using the

measure of cosine similarity and the LSI technique.

For cosine similarity method, it is observed that for

the high and medium matches agree with ex-

perts’ trace. The relationships of UC

, UC

and UC

with CS

fall in the medium, low and medium match

categories respectively. But the experts’ trace suggest

that these are the strongest relations for these UCs.

Relations for UC

can be ignored as it does not have

signiﬁcantly matching CSs as per experts’ trace.The

matches indicated by the traceability matrix for LSI

approach are valid for UC

, UC

and UC

. Relations

for UC

and UC

can be ignored owing to the rea-

sons cited earlier. The classiﬁed traceability matrices

generated using LDA and CTM are shown in Table 5

B. The trace measure indicated by the LDA approach

validates the assertions by experts’ trace for UC

only.

The fallacy of the other links retrieved by LDA is due

to inappropriate representation in bag of words.As for

the CTM approach, UC

, UC

and UC

are observed

to have approvable matches.

Table 5: Classiﬁed traceability matrices for CosSim, LSI,

LDA and CTM approaches.

UC CS

A CosSim LSI

High Medium Low High Medium Low

1 4,10 14,17 7,21,13,9,12 4,10 12,5 11,13,14,9,17

2 7 13,11,9,16,2 5,7,13,9,19

3 7,11,13,17,16 7,13,5,19,9

4 7 9,13,11,20,16 13,12,9,11,5

5 14 4,10,17 21,12,9,7,13 4,10,14 17 21,12,11,5,13

B LDA CTM

High Medium Low High Medium Low

1 10 4,14,17,21 19,12,15,11,7 10,4 14,17,21 19,12,15,5,11

2 15,3,2,12,11 15,5,7,3,12

3 10,19,2,3,7 7,5,15,11,3

4 19,2,11,4,10 15,19,10,4,11

5 10,4,14,17 21,19,12,15,11 10,4,14,17 21,12,15,19,5

4.3 Analysis of Experimental Results

The four different approaches were experimented on

two case studies. The precision and recall shown in

Table 6 were calculated without applying a threshold

on RI values.

Table 6: The precision and recall table.

Precision% Recall%

Approach Case I Case II Case I Case II

CosSim 73 72 94 90

LSI 55 48 71 63

LDA 50 40 65 50

CTM 50 40 65 50

For a given pair of documents, the RI computation

using cosine measure involves all unique words in

both the documents. But it provides a relatively small

amount of reduction in description length and reveals

little in the way of inter or intra document statistical

structure. LSI does “noise reduction”, precluding the

term combinations which are less frequently occur-

ring in the given document collection, from the LSI

subspace used to calculate RI. The approaches that

use LDA and CTM for computing the RI are con-

ﬁned to the bag of words that they generate after se-

mantic analysis. The recalls offered by the later two

methods are poorer than the cosine similarity and LSI

based approaches because of the inability of dominant

words from certain documents to ﬁgure in the bag of

words. This could be the reason why the strength of

traceability links are different when the different ap-

proaches are used.

In general, CTM approach scores over LDA ap-

proach in the fact that the words collected under one

bag in CTM, is not conﬁned to a particular document.

So the inter document relationship is delivered by the

bag of words as well. The LDA approach was found

to extract words in a more document speciﬁc manner

and hence the words with low frequency but of high

importance in some documents didn’t ﬁgure in the

bag. However the two approaches yielded the same

precision and recall in the experiments conducted.

The best traceability scheme as suggested by the

result of experiments is thus ‘document correlation

using cosine similarity considering connected words’.

However a great deal of its accuracy is attributed to

the emphasis on connected words. When the same

procedure was carried out ignoring such words, the

recall was very poor and the result very erratic.

4.4 Advantages and Disadvantages

One of the most notable advantages is that they par-

tially automate the task of concept identiﬁcation and

relationship extraction reducing the burden on the

user for building and maintaining trace information.

Further all of these approaches can be adopted during

any phase of the project’s lifecycle. Also, as the rela-

tionships are quantiﬁed, we can differentiate between

strongly related and weakly related documents. This

aids impact analysis and helps ﬁnd the minimal set of

design speciﬁcations that cover a given set of require-

ments. Moreover, the emphasis given to connected

words by all four approaches is extremely advanta-

geous for use in many technical domains of projects.

Also, the techniques are programming language and

paradigm independent, thus offering more ﬂexibility

and automation capability. Further, the relationship

extraction using LDA and CTM bag of words greatly

reduce the description length of a document and also

ICEIS 2008 - International Conference on Enterprise Information Systems

reveal inter or intra document statistical structure.

However these approaches are not without their

share of inadequacies. Their success depend upon the

acceptance of standard templates for information cap-

ture by the project teams. Further, concepts and re-

lations, at least sometimes, will have to be extracted

using text segmentation or text similarity algorithms.

Most of them work on large volumes of text. Hence

success of our approaches will depend on whether

these algorithms can be tuned to work with small

amounts of text. Also, there are chances of the sys-

tem coming up with wrong relations. The onus of

maintaining the correctness of trace data still is on the

user.

To improve the effectiveness of the traceability,

we will consider the following improvements. A glos-

sary or a Thesaurus which aids in resolving usage of

similar words/terms can be devised. Ontology can be

used to capture domain speciﬁc concepts. We can pro-

vide the user with a mechanism to control keywords

importance and document granularity . Further im-

proved text normalization methods can be adopted so

that LDA and CTM bag of words uniformly contain

dominant words from all documents in the corpora.

5 CONCLUSIONS

Document correlation is an important step towards es-

tablishing traceability. However inter-artifact trace-

ability is often ignored due to the large overhead

added by the current tracing mechanisms. Even par-

tial automation of the trace building and maintenance

would help the user to seriously consider this aspect

through out the lifecycle of the project. The four

different approaches for tracing discussed in this pa-

per, can at least partially automate this process. The

approaches can also be incorporated into the project

at an advanced stage in project’s lifecycle. To im-

prove the efﬁciency of traceability ‘tokenization of

connected words’ is done by all four approaches .

Such words are extremely important for correlating

technical documents like design speciﬁcations and ac-

tual codes with other documents. The result of exper-

iments we conducted suggest that the best approach

for traceability amongst the ones discussed is ‘docu-

ment correlation using cosine similarity considering

connected words’. However the LSI, LDA and CTM

based approaches emphasize on extracting semantic

information from the text corpora. We also aspire to

explore alternate methods to compute index of docu-

ment similarity to improve the recall and precision.

REFERENCES

Alexander, I. (2002). Towards automatic traceability in in-

dustrial practice. In Proc. of Workshop on Traceabil-

ity, pages 26–31.

Blei, D. M. and Lafferty, J. D. (2007). A correlated topic

model of science. Appl. Stat., 1(1):17–35.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. J. of MLR3, pages 993–1022.

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas,

G. W., and Harshman, R. A. (1990). Indexing by latent

semantic analysis. J. of ASIS, 41(6):391–407.

Dekhtyar, A., Hayes, J. H., and Sundaram, S. K. (2006).

Advancing candidate link generation for requirements

tracing: The study of methods. In IEEE Trans. on SE,

volume 32, pages 4–19.

DOORS (2006). Telelogic, http://www.telelogic.com.

Dumais, S. (1991). Improving the retrieval of information

from external sources. Behavior Research Methods,

Instruments, and Computers, 23(2):229–236.

Ebner, G. and Kaindl, H. (2002). Tracing all around in

reengineering. IEEE Software, 19(3):70–77.

Goldin, L. and Berry, D. M. (1997). Abstﬁnder, a proto-

type natural language text abstraction ﬁnder for use in

requirements elicitation. ASE, 4(4):375–412.

Gotel, O. and Finkelstein, A. (1994). An analysis of the re-

quirements traceability problem. In Proc. of the IEEE

Int’l. Conf. on Req. Engg., pages 94–101.

Hoffman, T. (1999). Probabilistic latent semantic analysis.

In Proc. of UAI, pages 289–296.

Hoffmann, T., Puzicha, J., and Jordan, M. I. (1999). Learn-

ing from dyadic data. ANIPS 11.

Knethen, A. V. and Grund, M. (2003). Quatrace: A tool en-

vironment for(semi-) automatic impact analysis based

on traces. In Proc. of the Int’l. Conf. on Software

Maintenance, pages 246–255.

Kolda, T. G. and O’Leary, D. P. (1998). A semi - dis-

crete matrix decomposition for latent semantic index-

ing in information retrieval. ACM Trans. on Info. Sys.,

16(4):322–346.

Lin, D. and Pantel, P. (2001). Induction of semantic classes

from natural language text. In Proc. of the seventh

Int’l. Conf. on KDDM, pages 317–322, California.

RDD-100 (2006). Holagent corporation,

http://www.holagent.com/products/product1.html.

RequisitePro, R. (2006). Rational software,

http://www.rational.com/products/reqpro/index.jsp.

Richardson, J. and Green, J. (2004). Automating traceabil-

ity for generated software artifacts. In Proc. of the

19th IEEE Int’l. Conf. on ASE, pages 24–33.

Spence, I. and Probasco, L. (1998). Traceability strategies

for managing requirements with use cases. W. Paper.

A COMPARATIVE STUDY OF DOCUMENT CORRELATION TECHNIQUES FOR TRACEABILITY ANALYSIS