
(Q henceforth) and user query (q henceforth). But in
recent times, researchers are leaning towards more ro-
bust modeling involving FAQ answers (referred to as
A henceforth) and similarity comparison (Mass et al.,
2020). This method tends to give better results as the
other can compensate for lexical gaps in either com-
parison (q-Q or q-A).
Labeled data is essential for training a model to
predict the relationship between user queries and FAQ
questions. A dataset of this kind is often created man-
ually or collected through query logs (Mass et al.,
2020). The FAQIR dataset specifically contains pairs
of questions and answers without any labeling to
query, but a relevance score is given to each query.
The proposed system’s goal is to train the differ-
ent models for the task of retrieving frequently asked
questions (FAQs) and assessing their performance to
provide the most accurate model. This comprehensive
approach aims to enhance the effectiveness and preci-
sion of the system in handling user queries and re-
trieving relevant FAQs. The significant contributions
of the paper are as follows:
1. We developed a FAQ retrieval model and
experimented with various ranking techniques
[weighted measures, re-ranking after initial re-
trieval] to rank top FAQ pairs. Explored and
implemented various techniques for generating
embeddings in order to find query-question and
query-answer similarity.
2. We trained and evaluated the outcomes of various
models in the context of the FAQ retrieval task to
provide the most accurate model.
3. Built a website using HTML, CSS, JavaScript,
and Flask and integrated our final model [BM25
q(Q+A) + BERT qQ training] with it. Created an
end-to-end website that gives top answers based
on FAQ from the FAQIR dataset and 5 FAQ pairs
that are similar to that category.
The paper structure comprises a review of FAQ
Question Answering System-related work in Section
2, an overview of task techniques in Section 3, exper-
iments and results in Section 4, and a conclusion in
Section 5.
2 RELATED WORK
FAQ models simply need to extract the FAQ pairs in-
stead of the complete context-specific answer. These
FAQ pairs are made up of a question and an answer.
The correspondence between the query and the FAQ
pairs is determined by comparing the query to either
the questions, answers, or the concatenation of both.
The appropriate class label must be present for su-
pervised learning to rank the FAQ pairs. Recent ap-
proaches shown in Table 1 utilize both supervised and
unsupervised techniques for the FAQ retrieval task.
Unsupervised methods can act more effectively as
they require no labeling of the data. (Sakata et al.,
2019) proposes a supervised technique for FAQ re-
trieval. It leverages the TSUBAKI model for retriev-
ing the q-Q similarity and BERT for q-A matching
(Sakata et al., 2019). A novel technique that gener-
ates question paraphrases compensates for the lack of
a query-question matching training data (Mass et al.,
2020). For the re-ranking, it uses elastic search,
passage re-ranking, and finally ranks on the basis
of query-answer and query-question similarity. This
model uses BERT for training query-question and
query-answer similarity.
(Piwowarski et al., 2019) uses an attention mech-
anism for FAQ Retrieval. It compares various aggre-
gation methods to effectively represent query, ques-
tion, and answer information. It is observed that at-
tention mechanisms are consistently the most effec-
tive way to aggregate the inputs for ranking. Atten-
tive matching in FAQ retrieval eliminates the need
for feature engineering to effectively combine query-
answer and query-question embeddings. (Jeon et al.,
2005) assumed that if answers demonstrate seman-
tic resemblance, their associated questions will also
possess a comparable level of similarity. The author
employed different similarity metrics, including co-
sine similarity with TF-IDF weights, LM-Score, and a
symmetric version of the LM-Score. LM-Score mea-
sures semantic similarity by converting answers into
queries and using query likelihood language model-
ing for retrieval. However, its resulting scores are not
symmetric. The measure gauges the semantic simi-
larity between answers, with higher scores indicating
stronger semantic connections. To address the non-
symmetry issue, a modification is introduced known
as Symmetric LM-Score which employs a harmonic
mean of ranks for a balanced assessment in question-
answering systems. It uses the rank method instead of
scores, where the similarity between answers A and B
is determined by the reverse harmonic mean based on
their respective ranks.
3 DATASETS USED
3.1 FAQIR Dataset
We used the FAQIR (Karan and
ˇ
Snajder, 2016)
dataset for evaluation, which is derived from the
“maintenance & repair” domain of the Yahoo! An-
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
1190