Tracker Text Segmentation Approach: Integrating

Complex Lexical and Conversation Cue Features

C. Chibelushi

and B. Sharp

School of Computing and IT, University of Wolverhampton

Technology Centre (MI Building), Room MI 317, City Campus - South

Wulfrana, Street, Wolverhampton. WV1 1SB, U.K.

Faculty of Computing Engineering and Technology, Staffordshire University, U.K.

Abstract. While text segmentation is a topic which has received a great atten-

tion since 9/11, most of current research projects remain focused on expository

texts, stories and broadcast news. Current segmentation methods are well suited

for written and structured texts making use of their distinctive macro-level

structures. Text segmentation of transcribed multi-party conversation presents a

different challenge given the lack of linguistic features such as headings, para-

graph, and well formed sentences. This paper describes an algorithm suited for

transcribed meeting conversations combining semantically complex lexical re-

lations with conversational cue phrases to build lexical chains in determining

topic boundaries.

1 Introduction

The problem of text segmentation has been the recent focus of many researchers as

more and more applications require the tracking of topics whether for summarization,

classification and/or retrieval tasks of textual documents. Since 9/11 text segmenta-

tion became a common technique used to detect the threads contained in instant mes-

saging and internet chat forums for various applications, including information re-

trieval, expert recognition and even crime prevention [3].Text segmentation can be

carried out on audio, video, and textual data. The aim of segmentation is to partition

a text into contiguous segments related to different topics. The increasing interest in

segmenting conversations is mainly explained by the number of its applications as

outlined in Table 1, whose granularity level of the segmentation depends on the size

of the chosen units of analysis which varied from utterances to words to phrases.

In this paper we present an algorithm for text segmentation relevant to transcribed

meetings involving a multi-party conversation. While previous research has focused

mostly on structured texts, broadcast news, and monologues which consist of cohe-

sive stories, our corpus consists of 17 manually transcribed meeting conversations. It

includes incomplete sentences, sentences related to social chatting, interruptions, and

references by participants made to visual context. Consequently, the analysis of our

transcripts poses an additional complexity due to their informal style, the use of visual

Chibelushi C. and Sharp B. (2008).

Tracker Text Segmentation Approach: Integrating Complex Lexical and Conversation Cue Features.

In Proceedings of the 5th International Workshop on Natural Language Processing and Cognitive Science, pages 104-113

DOI: 10.5220/0001740501040113

 SciTePress

Table 1. A review of the language processing applications to transcripts [5].

cues, and the lack of macro-level text units such as headings, paragraphs as well as

their spontaneous and often argumentative nature.

The motivation for our research project stems from the need to analyse a set of tran-

scribed meetings with the view to track a set of decisions and their associated issues

and actions discussed in the meetings in relation to software development. These

elements are then fed into a database to provide a tracking system to support software

development in identifying the decisions made at these meetings and gaining an un-

derstanding of the issues and decisions that may have led to any unnecessary rework.

In this paper we begin by reviewing the methodologies associated with text segmen-

tation, and we describe our Tracker Text Segmentation (TTS) approach to segment-

ing transcribed meeting conversations. Finally we discuss the results and the limita-

tions of our algorithm, and conclude our research outlining future research directions.

105

2 Previous Work

A review of the literature on text segmentation techniques reveals two distinct ap-

proaches: statistically based and linguistically driven methods [5, 14]. Some statistical

approaches are based on probability distributions [2], machine learning techniques

ranging from neural networks [4], to support vector machines [18] and Bayesian

networks [21], while others treat text as an unlabelled sequence of topics using a

hidden Markov model [24]. [8] developed a text segmentation tool called C99 which

uses a divisive clustering algorithm developed by [20] to identify topic boundaries.

The other text segmentation approach is derived from the lexical cohesion theory of

[9] and uses terms repetition to detect topic changes [25, 19, 10] n-gram word or

phrases [12], or word frequency [20, 1]. Some use lexical chains to identify topic

changes [10, 22], or prosodic clues to mark shifts to new topics [13, 19]. However

most lexical cohesion-based segmentation approaches use lexical repetition as a form

of cohesion and ignore the other types of lexical cohesion such as synonym, hyper-

nymy, hyponymy, meronymy [23]. A different approach is adopted by [16] who

combine decision trees with linguistic features extracted from spoken texts.

The above segmentation methods are well suited for written and structured texts mak-

ing use of their distinctive macro-level structures which are deficient in transcribed

texts. In the study of our transcripts the topic boundaries are often fuzzy, some topics

are re-visited at different stages of the meeting, and do not always follow the intended

agenda, rendering the segmentation process a very challenging task. As a result we

needed to develop a segmentation method which could handle the complexity and the

lack of structure but building on the macro-level structures pertinent to transcribed

texts such as the notion of utterance, the spontaneous speech cue phrases, and domain

specific knowledge to build an effective semantic lexical chaining.

3 The Corpus

In our research project we used 17 transcripts recorded from three diverse meeting

environments: industrial, organizational and educational, each involving a multi-party

conversation and containing an accurate and unedited record of the meetings and

corresponding speakers. The meeting transcripts which were varied in size, ranging

from 2,479 to 25,670 words, were multi-party conversation, and some had no pre-set

agendas. Consequently the analysis of these transcripts posed an additional complex-

ity due to their informal style, their lack of structure, their argumentative nature, and

the usage of common colloquial words. The transcripts also contain incomplete sen-

tences, sentences related to social chatting, interruptions, and references by partici-

pants made to visual context. In this paper, a corpus with a total of 247238 words is

used to illustrate our algorithm for confidentiality reasons.

106

Fig. 1. A Three-Stage Segmentation process.

4 Tracker Text Segmentation Algorithm (TTS)

Our TTS algorithm, which builds on the concept of sliding window, uses the utter-

ance as the base unit of analysis. The algorithm is context driven segmentation, com-

bining lexical chaining method with more semantic complex types of lexical cohesion

relationships between words in the transcripts in order to capture their sense relations,

such as synonymy, hypernymy (ISA relation), hyponymy (kind-of relation), merony-

my (part-of relation) and coordinate terms (e.g. computer and PC). Using WordNet

and our extended version of WordNet these sense cases allow us to capture the hie-

rarchical as well the transitivity relationships among the words in the transcripts and

enhance the formation of lexical chains.

The study of these transcripts has led to the identification of major speech cue phrases

used by the speakers to introduce new topics or highlighting new issues or a prob-

lems, examples of these are give in table 2. Prior to segmentation our transcripts have

been subjected to pre-processing transcripts which involve tokenisation, POS tagging

using WMATRIX, case folding, identification of compound concepts and removal of

stop words.

There are three main stages performed by TTS: (i) initial segmentation, (ii) interme-

diate, and (iii) final segmentation (shown in Fig. 1).

4.1 Initial Segmentation

This stage involves the segmentation of the stream of transcribed meetings into topi-

cally cohesive items of discussion. It is based on the sliding window approach devel-

107

oped by [10] and later adopted by [19], which divides the text into multi-paragraph

blocks and then using a vector space model it calculates the similarity of two con-

secutive blocks using the cosine value, a measure which has been widely used in

Information Retrieval (IR) systems. Instead of paragraphs as the core base for seg-

mentation our algorithm computes the similarity between utterances, referred to here-

by as the Utterance Cosine Similarity (UCS). Thus instead of measuring the similari-

ty between a query and a document as applied in IR systems, UCS measures the simi-

larity between two utterances.

An utterance

U is defined as

U = {

W . . .

W }, whereby,

W is a noun or com-

pound noun as it appears in the utterance. A term frequency vector

if is constructed

for each utterance

U by recording its frequency of occurrence within the transcript.

Let us suppose a transcript consists of 33 distinct noun concepts, and one of its utter-

ances is

U which includes the four distinct concept nouns: size, board, laptop, and

edge, its frequency vector representation will be denoted as follows.

= {1, 3, 1, 1, 0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0}

In order to identify the similarity (sim) between two utterances,

U and

, the

cosine of their frequency vectors should be close or equal to 1. The UCS measure,

denoted sim (

U ,

U ), is defined as follows:

sim(

U ,

U )= cos( fi , jf ) =

∑∑

∑

jkik

)()(

where 0

≤

cos ( if , jf ) ≤ 1.

∑

jkik

is the inner product of if and jf , which measures how much the two

vectors have in common.

∑∑

jkik

ff )()(

is a product of the two vector

lengths which is used to normalise the vectors.

The cosine similarity measure assumes that similar terms tend to occur in similar

segments. In such instances, the angle between them will be small, and so the cosine

similarity measure will be close to 1. Utterances with little in common will have dis-

similar terms, the calculated angle between them will be close to π/2 and the UCS

measure will be close to zero. A UCS matrix can then be prepared based on the com-

parison of each utterance with every other utterance in the transcript. An example of

this matrix is shown in Fig. 2. The blank lines in Fig. 2 contain zero vectors; these

zeros are removed for clarity. In our study after experimentation with our corpus the

threshold value was set to five.

108

Fig. 2. A typical UCS Matrix.

Fig. 3. Distribution of Lexical Chains within Transcript 120902TR.

4.2 Intermediate Segmentation

This stage builds the lexical chains which are generated through selected features and

grouped based on their semantic senses relations as they appear in the transcript.

Details of the algorithm to generate these chains are found in [5]. The frequency of

each chain is examined based on the occurrences of each chain member in the win-

dow (Fig. 3). The highest frequency lexical chain is then identified and is used to

extend the window or slide the window following the distribution of the topic chain

members in the transcript (Fig. 4). As the window expands, it will reach a stage whe-

reby the appearance of any of the members from that particular lexical chain fades

away. This is the point where the intermediate

topic boundary is identified. This step

109

Fig. 4. Sliding Window Effect.

is based on the algorithm of [15] who states that ‘a high concentration of chain-begin

and end points between the two adjacent textual units is a good indication of a boun-

dary point between two distinct news stories’.

4.3 Final Segmentation

The final segmentation refines further the new segments by searching for any speech

cue phrases to confirm the topic boundary or re-examine the boundary of this seg-

ment. Unlike the domain independent cues used by [11] and the domain specific cues

used by [19], our speech cue phrases were manually extracted from the corpus. An

example of these cues is shown in Table 2.

Table 2. Speech Cue Phrases Extracted from our Corpus.

5 Evaluation and Results

The segmentation was evaluated by comparing TTS against the TextTiling and C99

methods. Three types of evaluation metrics were used, the

P [1],

′

P , and

110

WindowDiff [17]. The results were very encouraging and showed that TTS has

outperformed both algorithms (Fig. 5).

TextTiling was the most underperforming algorithm for this corpus, possibly due to

1. its lexical cohesion-based algorithm which depends mainly on repetition.

There are many cases in our transcripts where few consecutive utterances in-

cludes no word repetitions and consequently TextTiling identified them as

four different topics;

2. its dependence on sentence-based structure and not utterance-based struc-

ture. The similarity measure used in TextTiling compares pair of sentences,

and consequently relevant to structured and well punctuated texts but unsuit-

able for our ill-structured corpus;

3. the unsuitability of using a fixed window size.

Fig. 5. Evaluation of TTS.

111

6 Conclusions

The TTS algorithm described in this paper is an iterative process that offers a great

potential for analysing transcribed meetings involving a multi-party conversation. The

study has extended the use of cosine similarity measure to transcribed texts and im-

proved the performance of lexical chaining methods and text segmentation algorithms

by including complex semantic relations and speech specific cue phrases.

Although the evaluation results highlighted the effectiveness of TTS compared to

TextTiling and C99, there are few limitations related to the issue of compound words

and the POS tagging system used. The identification algorithm of compound words

developed in this study, has given, in some situations, unsatisfactory results, as not all

the compound words were the result of combined nouns. Also some compound words

in the corpus such as ‘high voltage line’ and ‘natural language processing’ were not

automatically identified, partly due to the limitation of WMATRIX. Future work will

attempt to resolve these problems.

References

1. Beeferman, D., Berger, A. and Laffety, J.: Text Segmentation Using Exponential Models,

Proceedings of the Proceedings of EMNLP-2 (1997).

2. Beeferman, D., Berger, A. and Laffety, J.: Statistical Models for Text Segmentation, Ma-

chine Learning, Special Issue on Natural Language Processing, Vol. 34, No. 1-3,

(1999)177-210.

3. Bengel, J., Gauch, S., Mittur, E. and Vijayaraghavan, R.: Chattrack: Chat Room Topic

Detection Using Classification, Proceedings of the The 2nd Symposium on Intelligence and

Security Informatics (ISI-2004), Tucson, Arizona, (2004) 266-277.

4. Bilan, Z. and Nakagawa, M.: Segmentation of On-line Handwritten Japanese Text of Arbi-

trary Line Direction by a Neural Network for Improving Text Recognition Proceedings of

the Proceedings of the Eighth International Conference on Document Analysis and Recog-

nition, (2005)157 - 161.

5. Chibelushi, C.: Text Mining for Meeting Transcripts Analysis to Support Decision Man-

agement, PhD thesis, Staffordshire University (2008).

6. Chibelushi, C., Sharp, B. and Salter, A.: Transcripts Segmentation Using Cosine Similarity

Measure, In: B. Sharp (ed.), Proceedings of the Proceedings of 2nd International Work-

shop on Natural Language Understanding and Cognitive Science (NLUCS2005) Collo-

cated with ICEIS-2005, Miami, USA (2005).

7. Choi, F., Wiemer-Hastings, P. and Moore, J.: Latent Semantic Analysis for Text Segmenta-

tion, Proceedings of the Proceedings of the 6th Conference on Empirical Methods in Natu-

ral Language Processing, (2001)109 - 117.

8. Choi, F. Y. Y.: Advances in domain independent linear text segmentation, Proceedings of

the Proceedings of NAACL00, Seattle (2000).

9. Halliday, M. and Hasan, R.: Cohesion in English, Longman, London (1976).

10. Hearst, M.: Multi-paragraph Segmentation of Expository Text, Proceedings of the Proceed-

ings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las

Cruces, New Mexico, (1994)9-16.

11. Hirschberg, J. and Litman, D.: Empirical studies on the Disambiguation of Cue Phrases,

Computational Linguistics, Vol. 19, No. 3, (1993) 501-530.

112

12. Kan, M., Klavans, J. L., and. McKeown, K. R.: Linear segmentation and segment relev-

ance. In Proceedings of the Sixth Workshop on Very Large Corpora, (1998).

13. Levow, G.: Prosodic Cues to Discourse Segment Boundaries in Human-Computer

Dia]logue, Proceedings of the Proceedings of the 5th SIGdial Workshop on Discourse and

Dialogue, M. Strube and C. Sidner, ACL Publisher, USA, (2004) 93-96.

14. Manning, C.: Rethinking Text Segmentation Models: An Information Extraction Case

Study, University of Sydney (1998).

15. Okumura, M. and Honda, T.: Word Sense Disambiguation and Text Segmentation Based

on Lexical Cohesion, Proceedings of the Proceedings of the 15th International Conference

on Computational Linguistics:(COLING-94), (1994) 775-761.

16. Passoneau, R. and Litman, D.: Discourse Segmentation by Human and Automated Means,

Computational Linguistics, Vol. 23, No. 1, (1997)103-139.

17. Pevzner, L. and Hearst, M. evaluation Metric for Text Segmentation, Computational Lin-

guistics, Vol. 28, No. 1, (2002)19-36.

18. Renjie, J., Feihu, Q., Xu, L. and Wu, G.: Detecting and Segmenting Text from Natural

Scenes with 2-Stage Classification Proceedings of the Proceedings of the Sixth Interna-

tional Conference on Intelligent Systems Design and Applications:(ISDA'06), (2006). 819 –

824.

19. Reynar, J.: Statistical Models for Topic Segmentation, Proceedings of the Proceedings of

the Association for Computational Linguistics, ACL, College Park, USA, (1999) 357-364.

20. Reynar, J.: Topic Segmentation: Algorithms and Applications, PhD Thesis thesis, Universi-

ty of Pennsylvania (1998).

21. Senda, S. and Yamada, K.: A Maximum-likelihood Approach to Segmentation-based Rec-

ognition of Unconstrained Handwriting Text, Proceedings of the Proceedings of the Sixth

International Conference on Document Analysis and Recognition, (2001) 184 – 188.

22. Stokes, N.: Spoken and Written News Story Segmentation using Lexical Chains, Proceed-

ings of the Proceedings of HLT-NAACL, Student Research Workshop, Edmonton, (2003)

49-54.

23. Stokes, N.: Applications of Lexical Cohesion Analysis in the Topic Detection and Tracking

Domain., PhD Thesis, University College Dublin (2004).

24. Yamron, J., Carp, I., Gillick, L., Lowe, S. and Mulbregt, P. V.: A Hidden Markov Model

Approach to Text Segmentation and Event Tracking, Proceedings of the Proceedings of

ICASSP'98, IEEE

, Seatle, WA,:(1998) 333-336.

25. Youmans, G.: A New Tool for Discourse Analysis: The Vocabulary Management Profile,

In: Languages, (1991)763-789.

113