Recommendation System for Product Test Failures Using BERT

Xiaolong Sun

1,2,∗

, Henrik Holm

2,∗

, Sina Molavipour

2,∗ a

, Fitsum Gaim Gebre

2,∗

, Yash Pawar

2 b

Kamiar Radnosrati

2 c

and Serveh Shalmashi

KTH Royal Institute of Technology, Stockholm, Sweden

Ericsson AB, Stockholm, Sweden

{sina.molavipour, ﬁtsum.gaim.gebre, henrik.holm, yash.pawar, kamiar.radnosrati, serveh.shalmashi}@ericsson.com

Keywords:

Recommendation System, Information Retrieval, Language Models.

Abstract:

Historical failure records can provide insights to investigate if a similar situation occurred during the trou-

bleshooting process in software. However, in the era of information explosion, massive amounts of data make

it unrealistic to rely solely on manual inspection of root causes, not to mention mapping similar records.

With the ongoing development and breakthroughs of Natural Language Processing (NLP), we propose an

end-to-end recommendation system that can instantly generate a list of similar records given a new raw failure

record. The system consists of three stages: 1) general and tailored pre-processing of raw failure records;

2) information retrieval; 3) information re-ranking. In the process of model selection, we undertake a thor-

ough exploration of both frequency-based models and language models. To mitigate issues stemming from

imbalances in the available labeled data, we propose an updated Recall@K metric that utilizes an adaptive K.

We also develop a multi-stage training pipeline to deal with limited labeled data and investigate how different

strategies affect performance. Our comprehensive experiments demonstrate that our two-stage BERT model,

ﬁne-tuned on extra domain data, achieves the best score over the baseline models.

1 INTRODUCTION

A failure record is a document that outlines the de-

tails of software tests that have failed during the test-

ing phase. Usually, it contains out-of-range Key Per-

formance Indicators (KPI)s, malfunctioning modules,

and conﬁguration attributes. To ensure product qual-

ity, new features and upgrades are continuously eval-

uated in test loops at different levels. In this process,

it is inevitable to see failures of test cases that require

detailed investigations and troubleshooting. However,

the number of tests and the complexity of the data

pose a big challenge to manual inspections. There

has been an increasing demand for improving the ef-

ﬁciency and effectiveness of the testing process.

In light of advances in machine learning, espe-

cially in Natural Language Processing (NLP), tech-

niques can be leveraged to identify patterns and trends

in failure records. Previous papers (Grimalt et al.,

2022; Bosch et al., 2022) proposed solutions to han-

https://orcid.org/0000-0002-6247-1217

https://orcid.org/0000-0002-6523-1403

https://orcid.org/0000-0001-9913-3652

*Contributed equally to this paper.

dle duplicate identiﬁcation in trouble reports, which

are more general compared with failure records. In

this paper, our goal is to prevent unnecessary replica-

tion of test cases and provide better insight into fail-

ures for testers. We develop an end-to-end system that

can investigate the historical data and recommends a

list of similar items given a new failure record. The

steps can be broken down as follows:

1. Pre-process the input data with both general

cleaning strategies and tailored cleaning strategies

adapted to the telecommunication domain;

2. Retrieve a set of top relevant candidates from a

large corpus of documents;

3. Reorder the candidates in the previous stage to get

a ﬁnal list of ranked documents.

In order to ﬁnd a suitable model for our system,

we explore both traditional frequency-based mod-

els and the state-of-art language model, Bidirectional

Encoder Representations from Transformers (BERT)

(Devlin et al., 2018). The frequency-based models

we use include Term Frequency–Inverse Document

Frequency (TF-IDF) (Salton and Buckley, 1988) and

BM25 (Sanderson, 2010). The baseline BERT-based

206

Sun, X., Holm, H., Molavipour, S., Gebre, F., Pawar, Y., Radnosrati, K. and Shalmashi, S.

Recommendation System for Product Test Failures Using BERT.

DOI: 10.5220/0012160800003598

In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2023) - Volume 1: KDIR, pages 206-213

ISBN: 978-989-758-671-2; ISSN: 2184-3228

model we use is TeleRoBERTa (Holm, 2021), a pre-

trained RoBERTa (Liu et al., 2019) language model

adapted to the telecommunication domain.

TeleRoBERTa is not directly applicable to the

text ranking task. Therefore, we build a multi-stage

training strategy that incorporates the idea of trans-

fer learning (Torrey and Shavlik, 2010) to improve

the performance of our recommendation system. This

includes ﬁne-tuning our baseline model on the MS

MARCO

document ranking dataset (Bajaj et al.,

2016), the trouble reports dataset (Grimalt et al.,

2022), and our failure records dataset. For this pur-

pose, we use 23.5k trouble reports consisting of in-

ternal and external support tickets and incidents with

information about fault tags, fault headers, fault de-

scriptions, etc. Failure records, which are the main

focus of this work, contain more detailed information

about failure incidents during the testing process.

To enhance the interpretability of results, we pro-

pose a new metric called Adaptive Recall@K. The ﬁ-

nal evaluation of our system includes the following

aspects:

• The performance of both frequency-based models

and BERT-based models in the Information Re-

trieval (IR) stage.

• The result of models with and without re-ranking

in the Re-ranking (RR) stage.

• The effectiveness of different one-stage and multi-

stage training strategies.

The rest of the paper is structured as follows. We

provide preliminary knowledge of frequency-based

models and BERT-based models in Section 2. Sec-

tion 3 describes our end-to-end recommendation sys-

tem, including the data, the system architecture, the

multi-stage training pipeline, and our adaptive met-

rics. Then we present our experiments, results, and

analysis in Section 4, followed by the conclusion of

our paper in Section 5.

2 BACKGROUND

The task of building a recommendation system can

be viewed as a text ranking problem. In this paper,

we mainly focus on two types of models: frequency-

based models and BERT-based neural ranking mod-

els.

Frequency-based models utilize classic informa-

tion retrieval techniques and rely on statistical mea-

The dataset has a corpus of 3.2 million documents and

300 thousand queries in the training set, with each query

mapped to a positive passage ID that corresponds to its re-

spective document ID within the corpus.

surement to assign weights to each word, whereas

neural ranking methods leverage deep neural net-

works, such as the Transformer-based BERT lan-

guage model, to produce embeddings of given input

sequences. BERT, as a popular state-of-the-art model,

offers more precise and nuanced analysis compared to

traditional frequency-based models. However, when

faced with limited computational resources or train-

ing data, frequency-based models like TF-IDF and

BM25 still provide a practical and effective approach

for text analysis and retrieval tasks. This section pro-

vides detailed background information and existing

literature for the two models.

2.1 Frequency-Based Models

Frequency-based models determine a term’s impor-

tance within a corpus by its frequency of occur-

rence. This is achieved by representing the query and

document as vectors, using frequency-based metrics,

and assessing their similarity based on shared terms.

While these models are valued for their simplicity,

efﬁciency, and interpretability, they typically fail to

adequately capture the semantic relationship between

words and are sensitive to noise and outliers in the

data (El-Din, 2016). Considering these features, it

is conventional to use frequency-based models as a

baseline for further comparison (Grimalt et al., 2022).

TF-IDF is a widely used algorithm in information

retrieval that measures Term Frequency (TF) denoting

the importance of each term in a document, and In-

verse Document Frequency (IDF) indicating the rarity

of the term across the entire corpus of documents.

BM25 is a more sophisticated algorithm that in-

corporates additional parameters into its ranking for-

mula. Apart from TF and IDF, it also takes other fac-

tors into account, such as document length and the

average document length in the corpus. BM25 over-

comes certain drawbacks of TF-IDF, which include

its vulnerability to the inﬂuence of rare terms, and

its failure to account for variation in the document

length. Therefore, BM25 is a more potent method for

generating ranking results for speciﬁc types of queries

and documents.

2.2 BERT-Based Neural Ranking

Models

Unlike traditional frequency-based models, neural

models learn the relevance of documents to a given

query. BERT is one of the most popular models used

for neural ranking. It uses a self-attention mechanism

(Vaswani et al., 2017) to capture the relationships be-

tween words in a sentence or document with two un-

Recommendation System for Product Test Failures Using BERT

207

Figure 1: End-to-end recommendation system architecture.

supervised learning tasks: Masked Language Model-

ing (MLM) and Next Sentence Prediction (NSP).

In general, a BERT framework consists of two

steps: pre-training and ﬁne-tuning (Devlin et al.,

2018). With a pre-trained model obtained from the

training process above, we can take it as a starting

point and ﬁne-tune it for a new task or domain in the

transfer learning process (Torrey and Shavlik, 2010).

In this paper, we ﬁne-tune our model on a telecommu-

nication dataset using domain adaptation (Farahani

et al., 2021) and on a document ranking dataset us-

ing sequential learning (Aljundi et al., 2018) to adapt

to our use case.

To leverage BERT models in information retrieval

systems and semantic search, two architectures have

been suggested: the bi-encoder and the cross-encoder.

A bi-encoder consists of two encoders that share the

same weights and are trained jointly. The ﬁrst en-

coder encodes the query sequence, while the second

one encodes the candidate sequence in the corpus. Af-

ter the output embeddings pass the aggregator, usu-

ally a pooling layer, each encoder produces a ﬁxed-

length vector representation of the corresponding se-

quence as output. The similarity between the query

and the candidate is then calculated by the distance

between these two vectors using a similarity function,

such as cosine similarity. Sentence-BERT (Reimers

and Gurevych, 2019) is one of the most popular bi-

encoder models. A cross-encoder encodes both input

sentences jointly rather than separately. Mono-BERT

and Duo-BERT are examples of the cross-encoder

BERT model, where special tokens are added be-

tween two sentences to make an input. Besides the

BERT model, it includes a ﬁnal classiﬁcation layer

that takes the output representations and produces a

value ranging from 0 to 1 indicating how similar the

input sentence pairs are.

Joint encoding allows cross-encoders to capture

semantic and syntactic relationships between the two

input sentences more accurately (Lin et al., 2021).

Therefore, the cross-encoder usually has a better per-

formance than the bi-encoder. However, due to the

larger number of parameters and the requirement to

encode both input sentences simultaneously, cross-

encoders are typically more computationally expen-

sive than bi-encoders (Choi et al., 2021).

3 END-TO-END

RECOMMENDATION SYSTEM

Since both the bi-encoder and the cross-encoder have

their disadvantages, a traditional one-stage ranking

system usually fails to fully develop their potential.

Therefore, a two-stage ranking system can be used to

make the best use of their strengths and avoid their

weaknesses. Figure 1 presents the architecture of

our end-to-end recommendation system consisting of

three parts: Pre-processing, IR, and RR. We also in-

troduce a multi-stage training strategy and some im-

proved metrics to enhance the performance and model

interpretability. Detailed information is provided in

this section.

3.1 Data Pre-Processing

In this paper, we analyze failure records generated

during the execution of tests in a Continuous Inte-

gration/Continuous Delivery (CI/CD) process. These

records contain stored information, including a test

case identiﬁer, and logs detailing the faults, activity

types, and conﬁgurations. We prepare failure records

data by concatenating these ﬁelds as a data builder

and collecting a database with 28k records.

Since the messages in failure records are gener-

ated by machines, all detailed information is directly

stored, which makes it verbose and contains redun-

dancy or repeated information. However, BERT has

a limit on the input sequence length of 512 tokens.

Without pre-processing, we risk losing important in-

formation as the part exceeding the upper limit is

truncated directly. Therefore, our proposed data pre-

processing strategy aims to clean the text while pre-

serving the original information as much as possible.

Previous papers looked into the heterogeneity of

the log ﬁles, which can be a problem to extract useful

information with consistent formats (Saneifar et al.,

2009). To mitigate such a problem, we develop both

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

208

Figure 2: Number of tokens before and after pre-processing.

The median values are 4840 and 183 respectively.

general cleaning algorithms and tailored cleaning al-

gorithms:

• In general cleaning, we remove duplicate lines,

format whitespace, round decimals, and split dif-

ferent case styles to make them uniform.

• In tailored cleaning, the basic idea is to simplify

technical term combinations in the telecommuni-

cation domain with the help of experts, but we ap-

ply different strategies to different ﬁelds to ensure

only useful and important information is kept.

For instance, the raw log:

[ br />[(SUCCESS), Value of KPI is

5.46459972189E-6 (Criteria $measure >= 1)

[time 2022-10-04 07:22:58], state = 0

[ br />[(SUCCESS), Value of KPI is

5.46459972189E-6 (Criteria $measure >= 1)

[time 2022-10-04 07:22:58], state = 0

after pre-processing is converted to:

SUCCESS Value of KPI is 5.46E-6;

Criteria: >=1.

Figure 2 provides a visual representation of

the number of tokens before and after data pre-

processing. Note that we apply logarithm to the num-

ber of tokens. It’s obvious that the average number

of tokens among records decreases signiﬁcantly. In

fact, the median value decreases by over 95% after

pre-processing, which means our strategy effectively

removes many redundant tokens.

3.2 Information Retrieval

In the IR stage, given a search space of size M, an

initial set of N documents is retrieved, where N ≪ M.

The primary goal of this stage is to identify the top-

N relevant documents related to a user’s query while

maintaining efﬁciency and scalability. In other words,

the order of the relevant documents is not the main

concern; in fact, the most relevant documents do not

necessarily need to be at the top of the retrieved list.

Instead, the focus is on quickly processing large vol-

umes of data without signiﬁcantly sacriﬁcing accu-

racy. To achieve this, frequency-based and bi-encoder

models are employed for the IR stage, given their ef-

ﬁciency. Cosine similarity is used to compare the em-

beddings and retrieve the top-N relevant records.

3.3 Re-Ranking

In the RR stage, the initially retrieved N documents

are re-ordered according to their relevance to the

user’s query, producing a ﬁnal list of top-K results,

where K ≤ N (Grimalt et al., 2022). In contrast to the

IR stage, the RR stage places great emphasis on ac-

curately ordering the documents as the model will be

evaluated based on both the overall accuracy and spe-

ciﬁc accuracy, such as top-1. It is worth mentioning

that the IR stage successfully narrows down the pool

of candidates to the relevant documents, thereby re-

ducing the number of options and facilitating accept-

able ranking time.

As explained earlier, cross-encoders are good can-

didates for re-ranking due to their higher accuracy

than bi-encoders. Comparing all pairs of query-

candidate for all records is computationally expensive

as the inference process requires passing a query and

candidate simultaneously through the BERT model.

However, as the number of candidates is limited to N,

the complexity of the end-to-end retrieval can be con-

trolled by choosing a smaller N. The cross-encoder

model is trained with a binary classiﬁcation and re-

turns a score between 0 and 1, reﬂecting the similar-

ity between the two documents. Finally, candidates

are sorted according to scores computed in this stage.

3.4 Multi-Stage Training

To prepare models, we employ domain adaptation

and sequential learning. Domain adaptation trans-

fers a model that was trained on a source domain

to a different target domain with different character-

istics. Sequential learning focuses on learning and

adapting to new tasks without forgetting previously

learned information. Figure 3 shows the multi-stage

training pipeline, where the plain boxes represent

bi-encoders and the dashed boxes represent cross-

encoders. For bi-encoders, the proposed approach

involves a sequential ﬁne-tuning process for TeleR-

oBERTa, a pre-trained telecommunications-speciﬁc

language model that is integrated into both bi-encoder

and cross-encoder structures. To further ﬁne-tune the

Recommendation System for Product Test Failures Using BERT

209

Figure 3: Multi-stage training pipeline. The plain boxes represent bi-encoder and the dashed boxes represent cross-encoder.

bi-encoder model, we use the MS MARCO docu-

ment ranking dataset for sequential learning. Then,

we explore two domain adaptation strategies: one is

to directly ﬁne-tune the model on the failure records

dataset; the other is to ﬁne-tune the model on the

trouble reports dataset and the failure records dataset

respectively. For cross-encoders, we investigate do-

main adaptation of directly ﬁne-tuning TeleRoBERTa

on the failure records dataset.

3.5 Model Evaluation Metrics and

Interpretability

3.5.1 Adaptive Recall@K

Recall@K is a commonly used evaluation metric in

information retrieval and machine learning that mea-

sures the fraction of relevant items retrieved in the top

K results of a recommendation algorithm. The math-

ematical formula is:

Recall@K =

total

, (1)

where N

represents the number of relevant items re-

trieved in top K and N

total

is the total number of rele-

vant items.

However, standard Recall@K is not insightful in

our model as the number of relevant records in the

ground truth varies widely between samples. As can

be seen in Figure 4, certain failure records have more

than 100 relevant documents, while other records

are linked to only 1 document. To clarify the chal-

lenges this creates, consider a sample with 100 related

records. Calculating the Recall@10 for this sample

can at its maximum return a recall rate of 0.1, even

if all 10 retrieved records are included in the ground

truth. To mitigate the issues, we propose an adaptive

score called Adaptive Recall@K (AR@K). The for-

mula is presented in the next section.

Figure 4: Number of relevant documents for each record.

3.5.2 Improved Model Interpretability

To evaluate our recommendation in an all-round way

and improve the interpretability of our model, we in-

troduce the following two matching metrics:

• Strict Matching: We want to evaluate the overall

accuracy of the model, i.e., what is the percent-

age of relevant items among the top-K retrieved

records? Thus, all K items must be relevant to

get the full score. For example, if we have Re-

call@10 = 0.5, it shows that 50 percent of the top-

10 retrieved items are relevant. Mathematically,

the recall rate is computed as:

AR@K

Strict

min(K, N

total

)

, (2)

where the denominator saturates with too many

relevant items and avoids the situation of underes-

timation described above.

• Single Matching: We evaluate the accuracy of

retrieving from another perspective. Instead of

matching all the records, we focus on the possi-

bility of getting at least one relevant item among

the top-K retrieved records. Here, only one item

needs to be relevant to get the full score. For ex-

ample, if we have Recall@10 = 0.5, it shows that

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

210

we have a 50 percent chance of ﬁnding at least one

relevant item in the retrieved 10 candidates. Since

the matching criterion is a little different, the for-

mula of the recall rate also changes:

AR@K

Single

(

1, over one relevant in top-K

0, otherwise

(3)

where we count the recall rate as 1 as long as at

least one relevant record is found in the retrieved

top-K documents.

For both cases, we calculate the average recall rate

over K to get the ﬁnal result.

Since queries and related target samples are inter-

changeable, a query can retrieve itself and inﬂate the

scores. Therefore, we detect self-retrieval and do not

reward models during evaluation.

4 EVALUATION

In this section, detailed information on the data split

and training setups is provided. Then we present and

analyze the results under two testing scenarios using

our adaptive metrics.

4.1 Dataset Split

In our dataset of about 28k failure records, only

670 records have been labeled by domain experts,

meaning that they have been linked to other (simi-

lar) records. From these 670 labeled samples, a 50/50

train-test split is created. The number of records in the

train and test splits is displayed in Table 1. Moreover,

models are evaluated in the following scenarios:

1. Standard Testing: In this setting, both query and

search records are taken from the test set. In ad-

dition to the test set records, we randomly select

additional (unlabelled) 1,670 records from the full

dataset. These 1,670 records are included in nei-

ther the train nor the test split. They are added

to the search corpus of the test split to increase

its search space, thereby better mimicking a real-

world information-retrieval situation. These sam-

ples are denoted as distractors records as they

serve to make the retrieval task more challenging.

2. Extended Testing: In this setting, the query

records are taken from the pre-deﬁned test set. In

this scenario, however, the search records are ex-

tended with all unlabelled records in the dataset

(i.e., not only the 1,670 distractor records men-

tioned in the Standard Testing scenario). This in-

troduces more challenge, as the test corpus would

Table 1: Number of records in train and test dataset under

standard and extended scenarios.

Dataset

Standard

Testing

Extended

Testing

Train 336 336

Testquery 334 334

Testcorpus 2,004 27,955

incorporate all available data except for the sam-

ples found in the training dataset.

4.2 BERT Training Details

In the IR stage, the bi-encoder TeleRoBERTa model

is ﬁrst ﬁne-tuned on 45k query-document pairs ran-

domly selected from the MS MARCO dataset with

4 epochs. For both the trouble reports dataset and the

failure records dataset, the model is ﬁne-tuned with 10

epochs. All ﬁne-tunings use a learning rate of 10

−5

and the Multiple Negatives Ranking Loss (MNRL)

function (Henderson et al., 2017) which is designed

for the case when only positive (relevant) pairs are

available.

In the RR stage, the cross-encoder TeleRoBERTa

model is ﬁne-tuned on the Failure Records dataset

with 5 epochs and a learning rate of 2 × 10

−5

. For

each record, the model is provided together with one

positive document sample and three negative docu-

ment samples. The loss function used is Binary Cross

Entropy loss.

4.3 Results and Analysis

The results of the experiments are presented in Table

2, depicting the one-stage and two-stage results using

standard testing and extended testing. In general, with

an increasing K, the recall rate using single matching

increases owing to a wider pool of candidates enhanc-

ing the likelihood of retrieving at least one relevant

item. Conversely, the recall rate using strict matching

decreases since a larger candidate pool makes it more

demanding to hit all the relevant items.

4.3.1 Frequency-Based Models vs. One-Stage

BERT-Based Models

In the standard testing, the overall recall rates don’t

show a signiﬁcant difference between the two kinds

of models, while BERTMS+FR+TR has the highest accu-

racy under both matching scenarios (68.26%). How-

ever, as K increases, the difference becomes small,

especially when we use single matching. For almost

all the models, there is a possibility of more than 80%

that at least one of the retrieved 10 records is relevant.

Recommendation System for Product Test Failures Using BERT

211

Table 2: Results of frequency-based models and BERT-based models using standard and extended testing scenarios.

Model Matching

Standard Testing Extended Testing

AR@1

AR@5 AR@10 AR@1 AR@5 AR@10

TF-IDF

Strict 66.47% 61.63% 58.84% 35.63% 27.87% 27.05%

Single 66.47% 87.72% 91.02% 35.63% 63.47% 76.65%

BM25

Strict 67.66% 61.54% 59.72% 40.12% 34.84% 33.03%

Single 67.66% 88.02% 91.02% 40.12% 71.86% 82.93%

BERT

Strict 57.49% 48.26% 45.24% 35.03% 28.60% 27.84%

Single 57.49% 73.95% 79.94% 35.03% 61.98% 70.06%

BERTMS

Strict 65.57% 58.36% 56.23% 42.22% 33.75% 31.78%

Single 65.57% 85.33% 88.02% 42.22% 68.86% 76.35%

BERTMS+FR

Strict 64.67% 54.47% 49.02% 39.82% 29.30% 26.42%

Single 64.67% 85.03% 87.13% 39.82% 63.17% 72.46%

BERTMS+TR+FR

Strict 68.26% 61.10% 58.06% 39.82% 33.79% 30.85%

Single 68.26% 87.43% 89.52% 39.82% 69.16% 77.84%

BERTMS+TR+FR Strict 71.26% 69.28% 64.18% 54.49% 43.54% 39.24%

+ BERTFR Single 71.26% 88.02% 90.72% 54.49% 77.84% 83.23%

AR@K in both strict and single matching modes is averaged over all test queries.

All BERT models in the table are TeleRoBERTa.

MS, TR, FR are respectively short for MS MARCO, Trouble Report, and Failure Record.

We can also observe that the baseline BERT model

performs worse than frequency-based models. Simi-

lar behavior has been reported in (Cekic¸ et al., 2022).

In the extended testing, when we try to ﬁnd the

most relevant record, BERTMS outputs the best re-

sult on both matching metrics (42.22%). BM25 pro-

duces competitive results and surpasses all the one-

stage BERT models when we start evaluating the top-

5. However, recommending too many documents can

be a distraction during the execution of testing, which

means we should carefully choose the value for K.

4.3.2 One-Stage vs. Two-Stage BERT-Based

Models

We use the cross-encoder TeleRoBERTa ﬁne-tuned

on failure records dataset as our second-stage model.

The improvement in the recall rate is especially re-

markable for a larger search corpus. Compared with

the one-stage model, the recall rate of our two-stage

model increases by about 3% − 8% in the standard

testing and 8% − 14% in the extended testing using

strict matching. Even if BERTMS+FR+TR has already

achieved satisfactory results in both cases using sin-

gle matching, we can still observe a gain after the RR

stage. All the result proves that the RR stage is able

to reorder the retrieved list more accurately.

4.3.3 Comparison of Multi-Stage Models

In the standard testing, the effect of our multi-stage

training shows an approximate 11% gain over the

baseline BERT model. To be speciﬁc, sequential

learning makes the greatest contribution with an 8%

increase. In the extended testing, however, BERTMS

has the best performance.

We can also observe that BERTMS+FR has a com-

paratively low increase in the recall rate in both test-

ing cases, indicating that domain adaptation is not as

effective as sequential learning. There are two pos-

sible reasons accounting for this: 1) high overlap

between failure records and telecommunication data

makes it difﬁcult to achieve a signiﬁcant outcome; 2)

models tend to forget the previous task and domain

when learning the new ones due to catastrophic for-

getting (Bosch et al., 2022).

5 CONCLUSION

In this paper, we presented an end-to-end recommen-

dation system for failed test records using a two-stage

BERT model, including data pre-processing, IR, and

RR. We proposed a multi-stage training pipeline for a

pre-trained BERT model to learn different tasks and

domains. To enhance interpretability, we introduced

an adaptive metric and two test cases. In the IR stage,

our BERT model which went through the complete

training pipeline outperformed frequency-based mod-

els like TF-IDF and BM25, demonstrating the effec-

tiveness of our multi-stage learning strategy. In the

RR stage, the two-stage model showed signiﬁcant im-

provement over one-stage models, providing highly

accurate failure record recommendations. In practice,

while BM25-based systems are cost-effective and re-

main a strong candidate, our end-to-end recommenda-

tion system offers a more accurate alternative solution

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

212

and is worth pursuing in the long run.

Recent approaches in text ranking attempt to

leverage the in-context learning capabilitty of large

language models (LLMs). Soft Prompting (Peng

et al., 2023) addresses the challenge of insufﬁcient

domain-speciﬁc training data for dense retrieval by

using soft prompt-tuning to generate weak queries

and subsequently training task-speciﬁc dense retriev-

ers. Pairwise Ranking Prompting (Qin et al., 2023)

aims at enhancing the ranking performance of LLMs

by reducing the prompt complexities. Exploring

the practical implications and potential challenges of

these techniques when faced with real-world data re-

mains a promising avenue for future research. Addi-

tionally, further benchmark testing can shed light on

the comparison between large language models and

conventional methods and is considered an extension

of this study.

REFERENCES

Aljundi, R., Rohrbach, M., and Tuytelaars, T. (2018).

Selﬂess sequential learning. arXiv preprint

arXiv:1806.05421.

Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu,

X., Majumder, R., McNamara, A., Mitra, B., Nguyen,

T., et al. (2016). Ms marco: A human generated ma-

chine reading comprehension dataset. arXiv preprint

arXiv:1611.09268.

Bosch, N., Shalmashi, S., Yaghoubi, F., Holm, H., Gaim, F.,

and Payberah, A. H. (2022). Fine-tuning bert-based

language models for duplicate trouble report retrieval.

In 2022 IEEE International Conference on Big Data

(Big Data), pages 4737–4745. IEEE.

Cekic¸, T., Manav, Y., Helvacıoglu, B., D

undar, E. B., Deniz,

O., and Eryigit, G. (2022). Long form question an-

swering dataset creation for business use cases using

noise-added siamese-bert.

Choi, J., Jung, E., Suh, J., and Rhee, W. (2021). Improv-

ing bi-encoder document ranking models with two

rankers and multi-teacher distillation. In Proceedings

of the 44th International ACM SIGIR Conference on

Research and Development in Information Retrieval,

pages 2192–2196.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

El-Din, D. M. (2016). Enhancement bag-of-words model

for solving the challenges of sentiment analysis. Inter-

national Journal of Advanced Computer Science and

Applications, 7(1).

Farahani, A., Voghoei, S., Rasheed, K., and Arabnia, H. R.

(2021). A brief review of domain adaptation. Ad-

vances in Data Science and Information Engineering:

Proceedings from ICDATA 2020 and IKE 2020, pages

877–894.

Grimalt, N. M. I., Shalmashi, S., Yaghoubi, F., Jonsson, L.,

and Payberah, A. H. (2022). Berticsson: A recom-

mender system for troubleshooting.

Henderson, M., Al-Rfou, R., Strope, B., Sung, Y.-H.,

Luk

acs, L., Guo, R., Kumar, S., Miklos, B., and

Kurzweil, R. (2017). Efﬁcient natural language re-

sponse suggestion for smart reply. arXiv preprint

arXiv:1705.00652.

Holm, H. (2021). Bidirectional encoder representations

from transformers (bert) for question answering in

the telecom domain.: Adapting a bert-like language

model to the telecom domain using the electra pre-

training approach.

Lin, J., Nogueira, R., and Yates, A. (2021). Pretrained trans-

formers for text ranking: Bert and beyond. Synthesis

Lectures on Human Language Technologies, 14(4):1–

325.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach. arXiv preprint arXiv:1907.11692.

Peng, Z., Wu, X., and Fang, Y. (2023). Soft prompt tun-

ing for augmenting dense retrieval with large language

models. arXiv preprint arXiv:2307.08303.

Qin, Z., Jagerman, R., Hui, K., Zhuang, H., Wu, J.,

Shen, J., Liu, T., Liu, J., Metzler, D., Wang, X.,

et al. (2023). Large language models are effective

text rankers with pairwise ranking prompting. arXiv

preprint arXiv:2306.17563.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. arXiv

preprint arXiv:1908.10084.

Salton, G. and Buckley, C. (1988). Term-weighting ap-

proaches in automatic text retrieval. Information pro-

cessing & management, 24(5):513–523.

Sanderson, M. (2010). Christopher d. manning, prabhakar

raghavan, hinrich sch

utze, introduction to information

retrieval, cambridge university press. 2008. isbn-13

978-0-521-86571-5, xxi+ 482 pages. Natural Lan-

guage Engineering, 16(1):100–103.

Saneifar, H., Bonniol, S., Laurent, A., Poncelet, P., and

Roche, M. (2009). Mining for relevant terms from

log ﬁles. In KDIR’09: International Conference

on Knowledge Discovery and Information Retrieval,

pages 77–84.

Torrey, L. and Shavlik, J. (2010). Transfer learning. In

Handbook of research on machine learning appli-

cations and trends: algorithms, methods, and tech-

niques, pages 242–264. IGI global.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Recommendation System for Product Test Failures Using BERT

213