Utilization of Clustering Techniques and Markov Chains for Long-Tail

Item Recommendation Systems

Diogo Vin

ıcius de Sousa Silva

1,2 a

, Davi Silva da Cruz

1 b

, Diego Corr

ea da Silva

2 c

ao Paulo Dias de Almeida

2 d

and Frederico Ara

ujo Dur

2 e

Federal Institute of Maranh

ao - IFMA, Coelho Neto, Brazil

Department of Computing, Federal University of Bahia - UFBA, Salvador, Brazil

Keywords:

Recommender Systems, Clustering, Markov Chains, Assessment Questionnaire, Long-Tail.

Abstract:

The primary goal of this paper is to develop recommendation models that guide users to niche but highly

relevant items in the long tail. Two major clustering techniques and representing matrices through graphs

are explored for this. The ﬁrst technique adopts Markov chains to calculate similarities of the nodes of a

user-item graph. The second technique applies clustering to the set of items in a dataset. The results show

that it is possible to improve the accuracy of the recommendations even by focusing on less popular items,

in this case, niche products that form the long tail. The recall in some cases improved by about 27.9%,

while the popularity of recommended items has declined. In addition, the recommendations to contain more

diversiﬁed items indicate better exploitation of the long tail. Finally, an online experiment was conducted

using an evaluation questionnaire with the employees of the HomeCenter store, providing the dataset. The

aim is to analyze the performance of the proposed algorithms directly with the users. The results showed that

the evaluators preferred the proposed algorithms, demonstrating the proposed approaches’ effectiveness.

1 INTRODUCTION

Recommendation strategies are used in various con-

texts to bring potential users closer to products with

a high probability of interest. In e-commerce, for

example, the global market is expected to be worth

6.3 trillion dollars by the end of 2024 - up from 5.8

trillion dollars in 2023, (Snyder, 2024). However, a

Recommender System (RS) can negatively impact the

diversity of items recommended to users, leading to

most recommendations being based on a small diver-

sity of products that have signiﬁcant success among

other system users. Since most users show interest in

a particular item, there is a high probability that a new

user will also be interested in that item. Following this

logic, it is natural for less popular items to be recom-

mended less and consumed less by users (Bobadilla

et al., 2013).

Considering the importance of niche groups,

https://orcid.org/0000-0002-5683-5133

https://orcid.org/0009-0008-3965-5764

https://orcid.org/0000-0001-7132-1977

https://orcid.org/0000-0002-6617-6696

https://orcid.org/0000-0002-7766-6666

which have particular interests, the issue of long-

tail recommendations opens up possibilities for study-

ing techniques that improve the performance of such

recommendations, not only in recommendation rele-

vance but also in the other aspects mentioned earlier,

such as popularity and diversity. Because long-tail

items are unpopular products, the accuracy of these

recommendations tends to be lower than for items that

are consumed by the majority of users (Qin, 2021).

Since long-tail items are less popular, they have fewer

ratings from other users and consequently provide

less input for calculating a more accurate recommen-

dation. In addition, low popularity can lead to more

unsold products, resulting in ﬁnancial losses and the

risk of obsolescence or expiry, especially for items

with a limited shelf life, such as specialty foods. In

the context of online stores, the term “inﬁnite inven-

tory” has emerged, which refers to the set of products

that are little consumed but have loyal customers.

With the constant growth of the area, major tech-

nology companies also started to invest resources in

developing technologies for RS. Research by compa-

nies like Google, Netﬂix, and Amazon showed that

such investments could bring good returns for com-

panies and their customers (Gomez-Uribe and Hunt,

Silva, D. V. S., Silva da Cruz, D., Corrêa da Silva, D., Dias de Almeida, J. P. and Durão, F. A.

Utilization of Clustering Techniques and Markov Chains for Long-Tail Item Recommendation Systems.

DOI: 10.5220/0012936700003825

In Proceedings of the 20th International Conference on Web Information Systems and Technologies (WEBIST 2024), pages 47-58

ISBN: 978-989-758-718-4; ISSN: 2184-3252

2015). New services and business models were con-

solidated based on the technologies developed, help-

ing companies to grow in the market, speciﬁcally e-

commerce companies like Amazon.com, since recom-

mendations facilitated product sales. At the same

time, customers and users of these companies also

beneﬁted from the advantages and conveniences that

such technologies offered, as the generated recom-

mendations saved time and money through sugges-

tions of relevant items.

According to (Anderson, 2006), companies gen-

erally focus their sales on the most popular products

because it makes logistics easier. If we imagine a

traditional store with a physical structure to receive

customers, it’s easy to understand that it’s cheaper

to put the best-selling products on the most apparent

shelves. However, with the advent of online stores,

the cost of organizing products on shelves is non-

existent. In the context of online stores, the term “inﬁ-

nite inventory” has emerged, where featured products

can be selected according to each user’s preference

online. This virtual user may not necessarily have the

same preference as most other system users.

Alongside “inﬁnite inventory”, the phenomenon

of the “long tail” emerges. This term refers to prod-

ucts that are rarely consumed but have their niche

clientele, meaning niche items. Typically, these prod-

ucts make up the majority of the store’s inventory.

This large set of products (in the long tail) is respon-

sible for only a minority of sales. On the other hand,

only a few products are responsible for most sales

made (Brynjolfsson, 2011). An analogy is Pareto’s

rule, which states that 80% of consequences come

from 20% of causes, meaning 80% of sales would be

concentrated in only 20% of the inventory.

This paper investigates the lack of long-tail rec-

ommendation approaches that prioritize relevance, di-

versity, and popularity of recommended items. The

approach in this paper prioritizes the discovery of

niche items while directing them to their target audi-

ence. For this, a hybrid approach based on two tech-

niques is used. The ﬁrst is clustering with dynamic

parameters that adapt according to the dataset used,

and the second is a type of Markov chain to calculate

the interest distance between the user and an item.

The rest of the paper is organized as follows. Sec-

tion 2 provides a related work for this paper. Sec-

tion 3 presents two hybrid approaches to the rec-

ommendation in long-tail contexts, focusing on the

diversity and relevance of recommended items. In

Section 4, the experiments conducted to evaluate the

proposed approaches will be presented. Section 5

presents the results of the experimental evaluations of

the HTCL and P-HTCL approaches. The assessment

of the questionnaire applied to business domain users

is shown in Section 6. In Section 7, we conclude this

paper by presenting the effectiveness of the proposed

techniques and discussing future work.

2 RELATED WORK

2.1 Long Tail Recommendation

(Abdollahpouri et al., 2019) researched the problem

of popular items bias being frequently ranked highly

in RS. The study focused on ﬁnding ways to address

this issue by increasing the number of recommended

long-tail items. The authors approached the problem

from the users’ perspective, analyzing how popular-

ity bias causes recommendations to deviate from what

the user expects to obtain from the RS. Experimen-

tal results show that, in many recommendation algo-

rithms, the recommendations users receive are highly

concentrated on popular items, even if a user is inter-

ested in long-tail items.

(Qin et al., 2020) summarizes in a survey the

progress of research on long-tail item recommenda-

tion methods, which began with clustering in 2008

and evolved into methods based on deep learning

by 2020. Their work reviews the problem of long-

tail item recommendation. It describes the main

techniques used in this area, such as clustering,

graphs, multi-objective optimization, and deep learn-

ing, among others. Additionally, it points out future

directions associated with this theme by discussing

trends in long-tail item recommendation research.

(Qin, 2021) provides an update of the work by (Qin

et al., 2020), expanding it with a deeper understanding

of existing techniques for long-tail item recommenda-

tions. (Wang et al., 2006) present an optimized solu-

tion for recommending long-tail items. They make

recommendations using graph structures to represent

the relationships between users and items. The scores

of the items are also compared to determine their sim-

ilarity. The authors assume that a user should evaluate

similar items with similar scores. Our proposal also

uses graphs; however, the weighted edges will be used

to calculate the path from an item node to a user node,

whereas (Wang et al., 2006) uses edges to calculate

only the similarity between items.

(Yin et al., 2012) developed four variations of an

algorithm for long-tail item recommendations. The

basic algorithm of the proposal is the Hitting Time,

where users and items are represented in a disjoint,

indirect, and bipartite graph. An adjacency matrix

is obtained from this graph. The graph’s edges are

weighted and illustrate the relevance of the connec-

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

tion between a user and an item (i.e., the user’s rat-

ing for that item). To calculate the proximity of un-

rated items, the author computes the hitting time us-

ing a type of Markov chain called “random walk”

(Bolch et al., 1998). The random walk calculates the

probability of a user reaching an unrated item. The

higher the probability, the lower the hitting time, and

therefore, the item should have higher priority in the

recommendation. Transition matrices are computed

from a probability matrix. The probability matrix is

obtained from the adjacency matrix. The study by

(Luke et al., 2018) proposes a new recommendation

system based on tripartite graphs, which aims to sug-

gest long-tail items. The proposed system, the Ex-

tended Tripartite Graph System, improves the per-

formance of existing long tail recommendation ap-

proaches, measured by widely used performance met-

rics: recall and diversity. The study highlights the

importance of algorithms such as Hitting Time and

collaborative ﬁltering approaches to improve long tail

recommendations. The results indicate that the pro-

posed algorithm improved long-tail item recommen-

dations, with improvements in Recall@N and diver-

sity.

2.2 Clusterization in Recommendation

Systems

(Pang et al., 2019) proposes an improved algorithm

for an RS where not only accuracy but also the cov-

erage of recommendation items are considered. For

this, the authors use a weighted similarity measure

based on genetic algorithms and achieve good results

in the context of long-tail item recommendations. In

the work of (Yang et al., 2021), a study has been

carried out to correct a common problem in recom-

mendations that use clustering and scalability. To this

end, co-clustering techniques and clustering models

are proposed to make the clustering latent. A score

used for each user-item pair is calculated based on

their afﬁnities with the clusters. Thus, using the mini-

mum afﬁnity (between the user and the item) for each

cluster, the authors show that the method improves

recommendation performance on large-scale datasets

with millions of users and items with considerably

smaller model sizes. (Yadav et al., 2022) addresses

a recommendation model that aims to increase the

probability of retrieving unusual and new items in rec-

ommendation lists while maintaining user relevance.

The proposed methodology, Clus-DR (Cluster-based

Diversity Recommendation), uses the individual di-

versity of users and a pre-trained model to generate

diverse recommendations. Instead of relying on a re-

classiﬁcation approach, the model uses different clus-

tering techniques to group users with similar diversi-

ties. Experimental results with data sets from various

domains show that the proposed methodology main-

tains an acceptable level of accuracy.

The clustering works mentioned above are based

on user clustering, which is used to apply a second

technique and generate recommendations. Generally,

the focus is on improving the accuracy of recommen-

dations or system performance. Our proposal differs

in its focus on clustering, which is not performed on

the dataset users but rather on the items. Additionally,

our focus is not on increasing scalability and perfor-

mance, as in some of the works presented above, but

on suggesting less popular items with more diversity

and accuracy.

2.3 Techniques for Long Tail

Recommendations

(Lin et al., 2022) propose a method that uses graphs

and neural networks to build dynamic and static rep-

resentations for social recommendation. To do this,

they consider dynamic and static representations of

users and items and incorporate their relational in-

ﬂuence to model the user’s interest in a given item.

The work of (Abdelkhalek et al., 2022) presents a

hybrid proposal for collaborative ﬁltering using be-

lief function theory (BFT), where from the cluster of

each item, the K neighboring items are selected as

similar and contribute to the recommendation. The

overall similarity between the neighbors is chosen

for future user recommendations. (Sreepada and Pa-

tra, 2021) propose two approaches inspired by econo-

physics, based on selective injection of ratings into

long-tail items using existing rating information. The

data is then used to provide recommendations. Test

results with real-world data show that the proposed

approaches outperform existing techniques in mitigat-

ing the long-tail effect and show no signiﬁcant drop in

accuracy.

The technique employed in most of the work pre-

sented above is Collaborative Filtering. However, as

long-tail items receive minimal ratings from users, the

tendency is for these items to be inadequately recom-

mended to users. This problem can be mitigated by

identifying the long-tail items and incorporating them

into the recommendations generated by Collaborative

Filtering. Therefore, our work emphasizes these char-

acteristics. We recognize that recommending long-

tail items involves identifying them and relevant niche

items and directing them to the appropriate users.

Utilization of Clustering Techniques and Markov Chains for Long-Tail Item Recommendation Systems

3 HYBRID PROPOSALS FOR

LONG TAIL

RECOMMENDATION

This section will present two approaches that combine

strategies aimed at solving the recommendation prob-

lem in long-tail contexts A recommendation strategy

focused on the long tail should direct suggestions to-

ward items with high diversity and low popularity yet

highly relevant to the user. To address these char-

acteristics and based on the analyses and investiga-

tions conducted, two hybrid RS approaches were de-

veloped in this paper using the Hitting Time algorithm

as a basis. The ﬁrst approach is called Hitting Time

Clustered (HTCL), and the second is Personalized -

Hitting Time Clustered (P-HTCL), with the latter be-

ing an evolution of the former.

3.1 Hitting Time Clustered

In the Hitting Time algorithm (Yin et al., 2012), the

set of users U and items M are represented in a bi-

partite graph, which has its corresponding adjacency

matrix. In Figure 1, we have an example representa-

tion of the graph with nodes that can be items or users.

The edges of these nodes have a weight that reﬂects

the user’s score for the item to which the edge con-

nects it. When there is no edge connecting two nodes,

it means that the weight is 0 (zero). For example, user

node U

is connected to item node M

. In the matrix

in Figure 1, we can see that the weight of this edge is

5, meaning that this user’s rating for item M

is 5. The

weight of an edge is represented by a(i, j) with some

predeﬁned settings. The variables i and j represent

the nodes of the graph in a matrix A = (a(i, j)

, j∈

The variable V represents the set of graph vertices.

Figure 1: Representation of users and items through a bi-

partite graph and its respective adjacency matrix (Yin et al.,

2012).

To calculate the proximity of two nodes in the

graph, the algorithm uses a type of Markov chain

called Random Walk (Bolch et al., 1998). A random

walk has a current state (a node in the graph), and

with each time step, the state changes, i.e., visiting

other nodes in the graph. In Hitting Time, this path

is guided by the edge weights, and the random walk

terminates its journey when it reaches the destination

node. The algorithm relies on a probability matrix

calculated from the adjacency matrix. The formula 1

shows how the edge weights are determined:

i, j

= P(s(t + 1) = j|s(t) = i) =

a(i, j)

, (1)

where d

∑

j=1

a(i, j). (2)

The ﬁrst proposed approach, HTCL

(de Sousa Silva et al., 2020), comprises two distinct

yet complementary solutions. This approach falls

under the hybridization type called castata (Burke,

2002). Next, we present the combination of the

Hitting Time algorithm with a clustering technique,

optimizing it for long-tail item recommendation.

With the Hitting Time algorithm being used to

generate recommendations, we apply an extension to

enhance the targeting towards long-tail items. We

divide the set of items into short-head and long-tail

items. At this point, we used Pareto’s rule (Yamashita

et al., 2015) as a parameter to separate the more pop-

ular items from the less popular ones. All long-tail

items are clustered, considering the average score of

each item.

The average ratings of long-tail items are taken

into account. Those with higher scores will have

higher priority in recommendation and carry more

weight. These items impact the decrease in the value

of the Hitting Time. On the other hand, items with

lower scores are grouped into clusters that will be

used to weigh the value of the Hitting Time, making

it slightly higher. The score of an item is the average

of the ratings given by all users who rated it and will

be represented in this work by the letter S. Since the

possible rating values range from 1 to 5, there will be

4 clusters, one for each score range. The similarity

of items within the cluster is calculated based on their

score, as shown in the ﬁrst column of Table 1.

3.2 Personalized-Hitting Time

Clustered

The approach of this algorithm involves calculating

dynamically the optimal values to be used as Adjust-

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

Table 1: Clustering of dataset items about mean ratings

(score) and its respective adjustment factor(AF).

Score ClusterFit AF (%)

1 ≤ S < 2 A +20

2 ≤ S < 3 B +10

3 ≤ S < 4 C -10

4 ≤ S ≤ 5 D -20

ment Factor (AF). The algorithm selects a set of ideal

values based on some dynamic tests. Subsequently,

the set of AF values that achieves the best results in

recommendations for long-tail items is chosen. Thus,

depending on the dataset and user evaluations, this

same model can use different values for the AF of

clusters. That is, there will be customization of the

AF for each domain, hence the name of the approach:

Personalized Hitting Time Clustered (P-HTCL).

To ﬁnd the ideal values for the AF, the algorithm

evaluates several combinations and compares them

with each other. As the intention is to explore recom-

mendations for long-tail items, the comparisons are

based on the diversity of the recommendations gener-

ated with the combination of the AF. Two sequences

are considered for the clusters’ AF values to test the

combinations. Table 2 shows the sequence with the

algorithm’s values to ﬁnd the optimal FA value for

each of the 4 clusters. The ﬁrst is an exponential se-

quence S based on 2, that is, S =[2

, 2

....]. Thus, the sequence of values for the AF to

be tested in cluster A is S = [1, 2, 4, 8, 16, 32, 64, 128

...]. The second sequence is applied to cluster B. In

this case, the sequence is the natural numbers N = [0,

1, 2, 3, 4, 5, 6, 7, 8 ...]. For the values of clusters C

and D, the same sequences are used as clusters B and

A, respectively. However, in these clusters, the per-

centages are negative. Thus, the sequence for cluster

C, for example, is N = [0, -1, -2, -3, -4, -5, -6, -7, -8

...]. Similarly, cluster D is N = [-1, -2, -4, -8, -16, -32,

-64, -128 ...].

A diversity measurement is made for each test us-

ing the sequences of values shown in Table 2. Each

measurement is compared with the results obtained

in the previous tests. It is natural for the ﬁrst tests

to reach values closer to the Hitting Time algorithm.

This is due to the low inﬂuence of the AF, as the AF

values are still small. Figure 2 shows that a diversity

peak occurs at some point. In this example, the peak

happens at test number 4. The following values tend

to decrease. The peak represents the ideal value the

algorithm will use to generate the recommendations.

Table 2: Sequence of values analyzed by the P-HTCL algo-

rithm in search of the optimal AF set.

Cluster T1 T2 T3 T4 T5 T6 T7

A 1 2 4 8 16 32 64

B 0 1 2 3 4 5 6

C 0 -1 -2 -3 -4 -5 -6

D -1 -2 -4 -8 -16 -32 -64

Figure 2: Graph relating the level of diversity of the rec-

ommendations to the tests carried out with a set of different

values for the AF.

4 EXPERIMENTAL EVALUATION

In this section, the experiments conducted to evaluate

the proposed approaches will be presented. The ap-

proaches aim to generate recommendations focusing

on long-tail items, leading users to a greater diversity

of products while maintaining high relevance. The

following sections will detail the experimental evalu-

ation setup.

4.1 Methodology

This section shows how the evaluation of the pro-

posed models was conﬁgured and the organization of

the ﬁve experiments. To avoid inﬂuencing the results,

the averages for each baseline were calculated indi-

vidually and used the same computational environ-

ment.

Several experiments were carried out during this

research. However, to avoid excessive length in this

paper, we will present only 3 experiments. Addition-

ally, each of these experiments was executed using 3

different metrics. However, for the same reason, we

will only show the results with the recall metric here.

The third experiment we will present consists of an

online experiment where an evaluation questionnaire

is administered to users in the business domain. Two

datasets were used, one for each ofﬂine experiment.

Utilization of Clustering Techniques and Markov Chains for Long-Tail Item Recommendation Systems

4.2 Baseline

To analyze the effectiveness of the Hitting Time Clus-

tered (HTCL) approach, three comparison methods

(baselines) were deﬁned, namely:

• Hitting Time (HT) - The Hitting Time algorithm

proposed by (Yin et al., 2012);

• Hitting Time + Clustering All Dataset (HTCA)

- The Hitting Time Algorithm plus clustering is

similar to our approach. The difference here is the

lack of dataset splitting. In other words, there was

no separation of the items between long tail and

short head, so the clustering occurred across the

entire dataset;

• Hitting Time + Clustering Short Tail Dataset

(HTCS) - Also similar to the approach used in this

work, but the dataset was split inversely. Instead

of clustering the items located in the long tail, in

this baseline, we cluster only the items present in

the short head of the dataset.

To evaluate the performance of the P-HTCL ap-

proach, we compare it with Hitting Time (HT) and

Hitting Time Clustered (HTCL).

4.3 Datasets

In this section, details of the datasets used in this

study, including MovieLens and HomeCenter, will be

presented. The MovieLens dataset is widely recog-

nized and used in recommendation research, while

the HomeCenter dataset was specially organized for

this study based on data from an actual construction

store. The analysis of these datasets will provide valu-

able insights for the evaluation of the proposed ap-

proaches.

4.3.1 MovieLens

This dataset was organized by the GroupLens Re-

search group (Harper and Konstan, 2015) and was the

ﬁrst to be used in the experiments in this paper. The

MovieLens dataset is used in this research to measure

metrics and analyze the results of all baselines in the

experiments. It’s important to note that this dataset al-

ready contains information about user ratings on the

items of interest, which includes the relationships be-

tween users and items through ratings.

The domain of this dataset is related to movies,

with a quantity of 100,000 ratings and a density of

6.3%. Thus, we have a sparse matrix where most

users have not rated most movies. However, only

users rated at least 20 ﬁlms from the dataset were

selected. This dataset contains approximately 1,683

movies rated by around 943 users. More details about

MovieLens are provided in Table 3.

MovieLens also includes metadata such as user

age, gender, occupation, and movie category, which

were not used in this paper but could be utilized in

future work, as presented in Section 7.1.

4.3.2 HomeCenter

This paper’s HomeCenter dataset was organized

based on data from an actual store. This store is a ma-

jor retailer in the construction industry, where sales

are made physically. The dataset contains approxi-

mately 9,138 users and 2,955 purchase items, produc-

ing 71,296 sales. Table 3 compares the two datasets

presented here.

In addition to being from another domain, this

dataset differs from MovieLens in that it does not con-

tain user ratings for items. Therefore, an inference

model for ratings was deﬁned based on the quantity

of repeated items purchased by customers. In other

words, to determine the ratings in this dataset, we con-

sidered the number of purchases made by the same

customer for a particular product. We then list all the

products purchased by each customer, sorting them by

the number of times each one was bought. Then, we

normalized the quantity of items on a scale of 1 to 5,

thus arriving at the same scale used in the MovieLens

dataset. Therefore, the more purchases of a particular

item, the greater the user’s preference. This heuris-

tic was used as a guide for preparing the dataset and

inferring the ratings.

Table 3: Comparison between MovieLens and HomeCenter

datasets.

MovieLens HomeCenter

Number of users 943 9.138

Number of items 1683 2.955

Number of

ratings

100.000 0

Domain Films

Building

materials

Sparsity 6,3% 0,7%

4.4 Metrics

This section presents the evaluation metrics used to

verify the performance of the proposed approaches

in the algorithms. Each dataset was divided into two

subsets, one for training and the other for testing and

evaluating the recommendations generated. The re-

call is a fundamental metric for evaluating recommen-

dation systems, indicating the number of items of in-

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

terest to the user present in the list of recommenda-

tions. While popularity helps assess the performance

of approaches for less in-demand items.

4.4.1 Recall

The Recall@N metric was used to evaluate the accu-

racy of the proposed approaches (Yin et al., 2012).

Recall is an index that indicates the number of items

of interest to the user in the list of recommendations,

ranging from 0 to 1, with values closer to 1 indicating

a better recommendation.

Calculating Recall@N involves counting how

many times an item M appears within the top@N re-

sults, as shown in Equation 3:

Recall@N =

∑

hit@N

|L|

, (3)

where |L| is the number of test cases. The Recall met-

ric represents the relevance of the recommendations

and is calculated by the proportion between the rele-

vant and recommended items and the total number of

recommended items.

4.4.2 Diversity

The diversity metric was used to assess the distri-

bution of recommended long-tail items and reduce

the inﬂuence of popular items on recommendations.

With high diversity, less popular items are expected

to be recommended to users, allowing the discovery

of new items in the long tail. This metric determines

the top@N items recommended to a speciﬁc group of

users. To calculate diversity, we check how many re-

peated items appear once and then calculate the ratio

to the total, according to the equation 4:

Diversity =

|UI

∈ I|

|U|top@N

(4)

where I

is the set of unique items recommended for

all users, the element I is the data set, U represents the

set of users, and top@N is the recommended number

of items for each user (N represents the number of

items that the algorithm will return as a recommenda-

tion to the user, always from the most relevant to the

least appropriate).

4.4.3 Popularity

The popularity metric evaluated the number of long-

tail and short-tail items recommended to users. Ana-

lyzing the popularity of recommended items in con-

junction with other metrics provides a more detailed

analysis of the performance of the recommendation

algorithm. This metric calculates the frequency of a

given item and is based on the ratio of the number of

ratings compared to the other ratings in the dataset.

The diversity and popularity metrics can measure the

extent to which the recommendation is aimed at long-

tail items.

For each user, the average popularity is calculated

according to the equation 5:

Popularity =

∑

(5)

where,

∑

|U|top@N

(6)

The ranking deﬁned for the entire dataset is repre-

sented by R

. In contrast, the set U represents the set

of users chosen for calculating popularity and top@N,

as well as the number of items suggested for each user

in the set U. Equation 6 R

shows a normalized index

using the number of users and the items suggested for

each one.

5 RESULTS

Figure 3 shows the evolution of P-HTCL tests com-

pared to HT and HTCL. In all top@N, the P-HTCL

approach achieves better results than the other two

approaches. The best performance is achieved in

top@25 when P-HTCL exceeds HTCL by 148%.

Figure 3: Recall of top@N items from the MovieLens

dataset in 500 test cases.

Table 5 shows the results of the recall metrics for

each top@N and baseline. As highlighted in bold,

note that in the top@25 and top@30, the recall of the

P-HTCL approach is twice as good as the HTCL ap-

proach.

Figure 4 shows that P-HTCL performed best in

all top@N of the three approaches compared. In sec-

ond place, with a signiﬁcant distance from P-HTCL,

Utilization of Clustering Techniques and Markov Chains for Long-Tail Item Recommendation Systems

Table 4: Results of the recall technique from top@5 to

top@50 from MovieLens dataset run on all baselines: HT

(Hitting Time), HTCL (Hitting Time Clustered) and P-

HTCL (Personalized-Hitting Time Clustered).

Baselines

HT HTCL P-HTCL

top@05 0,0401 0,0484 0,1007

top@10 0,0656 0,0740 0,1794

top@15 0,0901 0,1019 0,2716

top@20 0,1087 0,1241 0,3495

top@25 0,1350 0,1658 0,4117

top@30 0,1658 0,2085 0,4588

top@35 0,2020 0,2596 0,4913

top@40 0,2473 0,3113 0,5226

top@45 0,2981 0,3722 0,5501

top@50 0,3634 0,4479 0,5762

is HTCL, which appears very close to the last-place

HT.

Table 5 shows the absolute values that help us bet-

ter analyze the results. P-HTCL stands out in top@5

and top@20, performing best compared to the other

baselines. At top@5, P-HTCL obtained a recall met-

ric value of 0.40656. Comparing it with the second-

best, HTCL, with 0.40010, we see an improvement

of 1.61%. Regarding the last-place HT, which only

reached 0.39216, P-HTCL surpassed it by 3.67%.

Looking at top@20, P-HTCL reached 0.48186 in

the recall metric against HTCL’s 0.47502, which is

1.44% better than the second-best approach. Com-

pared to the baseline with the worst performance, HT

with 0.47436, P-HTCL obtained an improvement of

1.58%

Table 5: Results of the recall from top@5 to top@50

from HomeCenter dataset run on all baselines: HT (Hit-

ting Time), HTCL (Hitting Time Clustered) and P-HTCL

(Personalized-Hitting Time Clustered).

Baselines

HT HTCL P-HTCL

top@05 0,39216 0,40010 0,40656

top@10 0,44174 0,44116 0,44766

top@15 0,45948 0,45948 0,46510

top@20 0,47436 0,47502 0,48186

top@25 0,48720 0,48852 0,49312

top@30 0,50098 0,50330 0,50696

top@35 0,51886 0,52066 0,52416

top@40 0,52756 0,53068 0,53332

top@45 0,54530 0,54426 0,54850

top@50 0,56632 0,56788 0,57130

As for the diversity and popularity metrics, the P-

Figure 4: Recall of top@N items from the HomeCenter

dataset in 500 test cases.

HTCL algorithm showed better results than the oth-

ers, HT and HTCL. Using the MovieLens dataset, the

P-HTCL approach stood out at all top@N levels, es-

pecially top@10. P-HTCL performed best in diver-

sity at this level, as shown in Table 6, the diversity

metric results for each approach.

Table 6: Results of the execution diversity metric in all

baselines: HT (Hitting Time), HTCL (Hitting Time Clus-

tered) and P-HTCL (Personalized-Hitting Time Clustered).

Baselines

HT HTCL P-HTCL

top@10 0,0535 0,0530 0,0552

top@20 0,0409 0,0408 0,0429

top@30 0,0344 0,0348 0,0369

top@40 0,0305 0,0308 0,0332

top@50 0,0273 0,0277 0,0309

Figure 5 illustrates the evolution of the three ap-

proaches, highlighting the superiority of P-HTCL in

diversity, especially as the number of recommenda-

tions increases.

Figure 5: Diversity metric results in Movielens 100k using

200 random users.

In addition, analysis of the results revealed that

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

in the top@50, P-HTCL’s diversity was the greatest,

which indicates the algorithm’s greater ability to of-

fer more varied and less biased recommendations. In

terms of popularity, P-HTCL managed to reduce the

popularity of recommendations, reaching a value of

1.7095 in top@20, which represents a decrease of

4.16% compared to HTCL, as shown in Table 7.

Table 7: Results of the popularity metric executed in all

baselines: HT (HittingTime), HTCL (HittingTimeClus-

tered), and P-HTCL (Personalized-HittingTime Clustered).

Baselines

HT HTCL P-HTCL

top@10 1,7546 1,7533 1,7095

top@20 1,2494 1,2429 1,1912

top@30 0,9157 0,9098 0,8737

top@40 0,7250 0,7206 0,6947

top@50 0,6135 0,6075 0,5829

6 complements this analysis by showing the evo-

lution of the popularity metric. P-HTCL not only im-

proves diversity but also focuses on recommending

less popular items, which is desirable in long-tail rec-

ommendation systems.

Figure 6: Popularity metric results on Movielens 100k using

500 random users.

Using the HomeCenter dataset, P-HTCL contin-

ued to excel in diversity, performing best at all top@N

levels, outperforming the other approaches. P-HTCL

not only offered diverse recommendations but also

managed to reduce the popularity of recommended

items. At top@50, the approach achieved a popular-

ity of 0.5829, decreasing by 4.99% compared to HT,

which obtained the worst result, as shown in Table 7.

It appears that P-HTCL had better results in diversity

and popularity, which improves the recommendation

of long-tail items. The decrease in popularity shows

that recommendations are more effective with niche

items. Thus, the P-HTCL approach is a favorable so-

lution for improving the quality of recommendations.

The tests are also conducted 100 times in the ex-

Figure 7: Flow of steps for applying the evaluation ques-

tionnaire.

periments to provide statistical conﬁdence. From this

data, it is statistically possible to determine if there

is a signiﬁcant difference between the means found.

The experiments used the AD (Anderson Darling) test

to analyze adherence to normality. After ensuring the

means have a normal distribution, we use a parametric

test called the T-Test. For the comparison of the aver-

ages is considered the p-value < 0.05. All results of

the evaluated approaches are compared with the other

baselines. In all comparisons, the obtained p-value is

less than 0.001. Thus, all results of the metrics pre-

sented were statistically different.

6 EVALUATION WITH

QUESTIONNAIRE

After performing ofﬂine experiments to evaluate the

HTCL and P-HTCL approaches, it is also essential to

analyze the performance of these algorithms directly

with users. In this experiment, the two approaches

proposed in this paper (HTCL and P-HTCL) and the

HT algorithm were evaluated. An online experiment,

through an evaluation questionnaire, was applied to

store employees that provide the HomeCenter dataset,

i.e., business domain users.

6.1 Questionnaire Execution Steps

To make it easier to understand the procedures carried

out during the execution of this experiment, Figure 7

shows a schematic of the steps involved in the ques-

tionnaire application process.

Utilization of Clustering Techniques and Markov Chains for Long-Tail Item Recommendation Systems

Figure 8: The ﬁrst part of the evaluation questionnaire

presents the client’s proﬁle.

6.2 Questionnaire Conﬁguration

The questionnaire consists of a spreadsheet contain-

ing questions about the recommendations generated

for the customers selected in the experiment. To make

it easier to understand, we’ll divide this spreadsheet

into three sections and explain each. Figure 8 shows

the ﬁrst section of the spreadsheet, which includes

a list of 10 previously purchased items. To do this,

10 customers who had previously purchased from the

store were randomly selected from the HomeCenter

dataset; their personal data was anonymized for this

experiment. Each of these customers received 9 prod-

uct recommendations, generated by running the al-

gorithms (HT, HTCL, and P-HTCL) for each of the

10 customers, resulting in 3 recommendations per ap-

proach.

Figure 9 shows the second part of the question-

naire, in which the evaluators express their opinions

on the recommendations generated by each technique

evaluated. In the questionnaire, the three techniques

compared (HT, HTCL, and P-HTCL) are identiﬁed by

the letters A, B, and C. At the top of Figure 9, there is

a brief explanation instructing the evaluator, followed

by a statement that must be answered according to the

set of answers listed at the top. Thus, for each recom-

mendation generated by the techniques, the evaluator

selects an answer based on the following statement:

The recommendation presented is in line with the

customer’s interests, and the store would undoubtedly

use it to make offers to the customer during a visit. In

Figure 9, the places where the evaluators provide their

answers are highlighted in green. Exemplary answers

have been inserted in these places in the illustration

for educational purposes only. The Likert scale (Lik-

ert, 1932) was used to parameterize and assign val-

ues to the evaluation responses. On this scale, evalu-

ators assign graded (ordered) answers to each evalu-

ated question. The questionnaire was administered to

two evaluators, and for each of them, ten customers

were randomly selected so that the recommendations

generated for them could be analyzed.

Figure 9: The second part of the evaluation questionnaire

presents the form with the recommendations generated by

each technique.

Figure 10: The third part of the evaluation questionnaire

presents the form with the recommendation techniques or-

dered by the evaluator.

Finally, based on the recommendations generated

for clients, the evaluator establishes an order of per-

formance for the techniques, as illustrated in Figure

10. This creates a ranking in which each method

is positioned according to the evaluator’s assessment

of its effectiveness and suitability. The values high-

lighted in green in Figure 10 are random examples

for illustrative purposes.

The questionnaire was subjected to an evaluation

of the degree of agreement and reliability using the

general and categorized 5-point Kappa.

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

6.3 Questionnaire Results

After applying the questionnaire, the evaluators’ an-

swers were analyzed, and the scores of the 10 cus-

tomers were added together to obtain an overall value.

The Likert scale (Likert, 1932) was used for this as-

sessment. This assesses the suitability of the rec-

ommendations to the customer proﬁle. The P-HTCL

technique obtained the best rating with 46.75 points,

followed by HTCL with 46.50 and HT with 39.25.

Evaluator 1 preferred HTCL, while Evaluator 2 high-

lighted P-HTCL, the latter being considered the best

overall.

Table 8: Summarization of the points obtained in the ques-

tionnaire for the evaluator’s preference concerning the rec-

ommendation generated.

Reviewer 1 Reviewer 2 Total

HT 17,00 22,25 39,25

HTCL 23,75 22,75 46,50

P-HTCL 22,50 24,25 46,75

Table 9: Summarization of the points obtained in the ques-

tionnaire for the evaluator’s preference for the technique

used.

Reviewer 1 Reviewer 2 Total

HT 1,0 4,0 5,0

HTCL 7,0 3,5 10,5

P-HTCL 7,0 7,5 14,5

Table 9 shows the evaluators’ general perception

of the techniques, as shown in Figure 10. The eval-

uator ranks the methods, giving 1.0 points for ﬁrst

place, 0.5 for second, and none for third. P-HTCL

led the way with 14.5 points, followed by HTCL with

10.5 and HT with 5.0 points, indicating the evalua-

tors’ preference.

7 CONCLUSIONS AND FUTURE

WORK

In this paper, we found that the combination of dif-

ferent techniques proved to be effective in addressing

the recommendation problem in the long tail, as there

was an improvement in recommendations in various

aspects, not limited to just one or two evaluation met-

rics. The results showed that the techniques used

exhibit better relevance indices as recommendations

become more diverse and less popular, thus direct-

ing users to greater diversity and relevance of prod-

ucts. Therefore, the proposed approaches better guide

users to niche items in the long tail. The proposed ap-

proaches are hybrid strategies that exploit clustering

techniques and representation matrices.

The promising results presented possibilities for

retail companies to increase their business proﬁts.

Since the proﬁt from the sale of long-tail items tends

to be higher than short-head items, focusing part of

the sales on these products will bring greater ﬁnancial

returns. The lower competition for these products is

evidence of this situation. Additionally, customers of

niche products are usually more loyal and are more

willing to pay a higher price to acquire them, increas-

ing the proﬁt margin for these companies.

7.1 Future Work

In the experiments with the MovieLens dataset, only

one variable, the score, was used. In addition to the

score, other variables can be considered in the simi-

larity calculation, such as movie category, producer,

cast, or even user clustering using proﬁle data such

as age and gender, among others. These variables are

already present in the MovieLens dataset and can be

the subject of new experiments.

Other techniques can be used with a base al-

gorithm to improve recommendations to exploit a

dataset’s long tail further. Our approach used Hit-

ting Time as the base algorithm, but other algorithms

can be experimented with along with various cluster-

ing techniques. Another possibility is to use differ-

ent methods, such as the probabilistic CF algorithm

(IRM2), multimodal similarity, and multi-objective

evolutionary algorithm (MORS), to name a few, in

conjunction with clustering.

ACKNOWLEDGEMENTS

The authors would like to thank FAPESB and

CAPES for their ﬁnancial support. Grant Term:

PPF0001/2021. Technical Cooperation Agreement

45/2021 and CAPES – Grant number 001. This ma-

terial is partially based upon work supported by the

FAPESB INCITE PIE0002/2022 grant.

REFERENCES

Abdelkhalek, R., Boukhris, I., and Elouedi, Z. (2022). To-

wards more trustworthy predictions: A hybrid eviden-

tial movie recommender system. Journal of Universal

Computer Science.

Abdollahpouri, H., Mansoury, M., Burke, R., and

Mobasher, B. (2019). The unfairness of popularity

bias in recommendation. https://arxiv.org/abs/1907.

13286.

Utilization of Clustering Techniques and Markov Chains for Long-Tail Item Recommendation Systems

Anderson, C. (2006). The Long Tail: Why the Future of

Business Is Selling Less of More. Hyperion.

Bobadilla, J., Ortega, F., Hernando, A., and Guti

errez,

A. (2013). Recommender systems sur-

vey. Knowledge-Based Systems, 46:109–132.

https://www.sciencedirect.com/science/article/abs/

pii/S0950705113001044.

Bolch, G., Greiner, S., de Meer, H., and Trivedi, K. S.

(1998). Queueing Networks and Markov Chains:

Modeling and Performance Evaluation with Com-

puter Science Applications. Wiley-Interscience, New

York, NY, USA.

Brynjolfsson, E. (2011). Goodbye pareto principle, hello

long tail: The effect of search costs on the concentra-

tion of product sales. Management Science, 57(8).

Burke, R. (2002). Sistemas de recomendac¸

ao h

ıbridos: Lev-

antamento e experimentos. User Modeling and User-

Adapted Interaction, 12(4):331–370.

de Sousa Silva, D. V., de Oliveira, A. C., Almeida,

F., and Dur

ao, F. A. (2020). Explorando similar-

idades em grafos com agrupamento para melhorar

recomendac¸

oes de itens de cauda longa. In de Salles

Soares Neto, C., editor, WebMedia ’20: Simp

osio

Brasileiro sobre Multim

ıdia e a Web, pages 193–200,

ao Lu

ıs, Brasil. ACM.

Gomez-Uribe, C. A. and Hunt, N. (2015). The netﬂix rec-

ommender system: Algorithms, business value, and

innovation. ACM Transactions on Management Infor-

mation Systems, 6(4):13:1–13:19.

Harper, F. M. and Konstan, J. A. (2015). The movielens

datasets: History and context. ACM Trans. Interact.

Intell. Syst., 5(4):19:1–19:19.

Likert, R. (1932). A technique for the measurement of atti-

tudes. Archives of Psychology.

Lin, J., Chen, S., and Wang, J. (2022). Graph neural net-

works with dynamic and static representations for so-

cial recommendation. In et al., A. B., editor, Database

Systems for Advanced Applications, pages 264–271,

Cham. Springer International Publishing.

Luke, A., Johnson, J., and Ng, Y.-K. (2018). Recommend-

ing long-tail items using extended tripartite graphs. In

Proceedings of the 2018 IEEE International Confer-

ence on Big Knowledge (ICBK), pages 123–130, Sin-

gapore.

Pang, J., Guo, J., and Zhang, W. (2019). Using multi-

objective optimization to solve the long tail problem

in recommender system. In Yang, Q., Zhou, Z.-H.,

Gong, Z., Zhang, M.-L., and Huang, S.-J., editors,

Advances in Knowledge Discovery and Data Mining,

pages 302–313, Cham. Springer International Pub-

lishing.

Qin, J. (2021). A survey of long-tail item recommendation

methods. Wireless Communications and Mobile Com-

puting. [Online]. Available: https://doi.org/10.1155/

2021/7536316.

Qin, J., Zhang, Q., and Wang, B. (2020). Recommenda-

tion method with focus on long tail items. Journal of

Computer Applications, 40(2):454–458.

Snyder, K. (2024). 35 e-commerce statistics of

2024. https://www.forbes.com/advisor/business/

ecommerce-statistics/#sources section. Accessed:

2024-06-06.

Sreepada, R. S. and Patra, B. K. (2021). Enhancing long

tail item recommendation in collaborative ﬁltering:

An econophysics-inspired approach. Electronic Com-

merce Research and Applications, 49:101089.

Wang, F., Ma, S., Yang, L., and Li, T. (2006). Recommen-

dation on item graphs. In Proceedings of the Sixth In-

ternational Conference on Data Mining (ICDM’06),

pages 1119–1123.

Yadav, N., Pal, S., Singh, A. K., and Singh, K. (2022). Clus-

dr: Cluster-based pre-trained model for diverse rec-

ommendation generation. Journal of King Saud Uni-

versity - Computer and Information Sciences, 34(8,

Part B):6385–6399.

Yamashita, K., McIntosh, S., Kamei, Y., Hassan, A. E.,

and Ubayashi, N. (2015). Revisiting the applicabil-

ity of the pareto principle to core development teams

in open source software projects. In Proceedings of

the 14th International Workshop on Principles of Soft-

ware Evolution (IWPSE 2015), pages 46–55, Berg-

amo, Italy. ACM.

Yang, L., Schnabel, T., Bennett, P. N., and Dumais, S.

(2021). Local factor models for large-scale induc-

tive recommendation. In Fifteenth ACM Conference

on Recommender Systems (RecSys ’21), pages 252–

262, New York, NY, USA. Association for Computing

Machinery.

Yin, H., Cui, B., Li, J., Yao, J., and Chen, C. (2012). Chal-

lenging the long tail recommendation. In Proc. VLDB

Endow., volume 5, pages 896–907.

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies