Uzbek Texts Sentiment Analysis: Database Development
Saboxat Allanazarova and Dilrabo Elova
Tashkent State University of Uzbek Language and Literature, Tashkent, Uzbekistan
Keywords: Sentiment Analysis, Social Network, Reviews, Customer’s Opinion, Emotional Coloring.
Abstract: In the digital era, research centered on exploring customer sentiments towards service quality, brands,
products, and events predominates. Social networks are pivotal platforms for this investigation, leveraging
online reviews to gauge consumer opinions across diverse domains. Among the myriad approaches to
sentiment analysis, this article focuses on the lexicon-based method, employing a dictionary of emotive words
tailored to the Uzbek language. This method involves parsing text to identify words with emotional
connotations, enabling a nuanced understanding of customer sentiment. By harnessing lexicon-based
sentiment analysis, researchers can dissect and interpret consumer perceptions, providing invaluable insights
for businesses and organisations seeking to enhance customer satisfaction and refine their offerings.
1 INTRODUCTION
As a result of the rapid advancements in science and
technology, the influx of information via social media
platforms has soared exponentially. One of the
primary challenges confronting users is the scarcity
of time. Given the sheer volume of available data, it
becomes impractical to digest every piece of
information. Consequently, there's an escalating
demand for methods to sift through this deluge of
data.
This surge in interest has fuelled exploration in
fields such as Natural Language Processing (NLP),
Machine Learning, Data Science, and Artificial
Intelligence. In the 21st century, marked by
technological prowess and intellectual evolution, a
novel approach to learning has emerged (Liu 2012,
Medhat et al 2014).
The Uzbek language, considered one of the
world's developed languages, suffers from
underutilization within the realm of information
technology. Regrettably, our language's inherent
potential remains largely untapped. Particularly in
areas like Sentiment Analysis (commonly known as
opinion mining), the process of data collection and
processing proves to be excessively time-consuming
for Uzbek. This highlights the pressing need for
further development and integration of Uzbek
language capabilities into modern technological
systems. Efforts to enhance the efficiency and
effectiveness of language processing in Uzbek are
vital to harnessing its full potential in the digital age
(Divyapushpalakshmi et al 2021, Ahmedova et al
2021).
2 RESEARCH METHODOLOGY
The research methodology for sentiment analysis
outlined above integrates both machine learning and
lexicon-based approaches.
The lexicon-based method relies on semantic analysis
of lexical units to gauge sentiment polarity within text,
considering factors such as the speaker's attitude,
emotional state, and the context of speech.
By assessing the semantic orientation of words and
sentences, this approach quantifies subjectivity and
opinion. Emotionally charged language is pivotal in
sentiment analysis as it conveys the speaker's feelings
and attitudes towards a given subject.
The methodology recognises the significance of
informal comments prevalent in social media
platforms like Twitter and Facebook, as well as
review sites such as the Google Play Store, where
users express their opinions about products and
services.
However, the methodology acknowledges the
noise inherent in such opinionated data sources and
emphasizes the need for robust techniques to distil
meaningful insights.
744
Allanazarova, S. and Elova, D.
Uzbek Texts Sentiment Analysis: Database Development.
DOI: 10.5220/0012915100003882
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 2nd Pamir Transboundary Conference for Sustainable Societies (PAMIR-2 2023), pages 744-747
ISBN: 978-989-758-723-8
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
Figure 1: Sentiment Analysis Essentials: From Data
Collection to Processing.
Emphasis is placed on the importance of
extracting valuable information from opinionated
sources despite the inherent noise.
The methodology underscores the relevance of
sentiments expressed by users in social media
comments and reviews, as they offer authentic
reactions, personal experiences, and recognitions.
These sentiments play a pivotal role in influencing
decisions of potential consumers when evaluating
businesses or services.
Therefore, the methodology highlights the
necessity of employing advanced sentiment analysis
techniques to sift through noisy data sources
effectively. By leveraging both machine learning and
lexicon-based approaches, the methodology aims to
provide accurate sentiment classification, thereby
facilitating informed decision-making processes
based on user-generated content.
3 RESULTS AND DISCUSSION
Our goal is to build the first labeled dataset for
sentiment analysis in Uzbek language texts, obtained
from the "Annotated dictionary of the Uzbek
language." This dictionary is a five-volume annotated
dictionary of the Uzbek language and one of the most
comprehensive Uzbek annotated dictionaries
published to date. The dictionary is based on a two-
volume annotated dictionary published in 1981.
Furthermore, we have built a larger manually dataset
of emotional coloring words and idioms. This
annotated dataset contains three categories: positive,
negative, and neutral. As in all languages, the Uzbek
language also has methods and tools that create
expressiveness and emotionality, and ways of using it.
Phonetic, lexical-phraseological, morphological, and
syntactic methods of expressive-emotional
expression in the Uzbek language are given. The
lexicon of the language has great potential for
conveying information with the subtlest meaning and
stylistic coloring. The stylistic features of units in the
vocabulary of the language are studied in the course.
Among them, especially synonyms and antonyms are
tools rich in stylistic possibilities. Words with
stylistic color are especially expressed in the
synonymous line. A synonym string can consist of
only two words or several words.
Figure 2: Synonymous Synsets in Uzbek: A Multi-line
Example.
From this synonymous group, Yuz is the main,
dominant word, which expresses a neutral assessment
and does not choose speech styles. The words Aft,
Turq, Bashara, Bet are negative words. Words such
as Chehra, Diydor, Oraz, Siymo have a positive
connotation. Each word in this synonymous line is
used selectively in different speech situations. Most
of the words expressing an opinion are adjectives. For
example: He is knowledgeable. His conversations are
interesting. The adjectives "knowledgeable" and
"interesting" in this sentence are positive words, and
the nouns traffic, sad, and unemployment express a
negative meaning. Phrases show the strong influence
of events, symbols on the human mind, the full
expression of the result of this strong influence in
speech, in general, expresses the influence of thought
at a high level.
The phenomenon of gradation, which is present in
expressions, allows expressing expressiveness to the
extent necessary. For example, if the phrase "Sochi
tikka bo‘lmoq" is considered the lowest level in terms
of expressiveness, it reinforces as follows using
certain phonetic or grammatical means and creates a
unique gradation.
Figure 3: Graduonymic Hierarchy: Phrases in Uzbek
Language.
Uzbek Texts Sentiment Analysis: Database Development
745
Lexical synonyms are distinguished by the
gradation of the meaning of the term, stylistic
characteristics, sometimes having a negative-positive
color. In the following years, graduonyms were
studied in depth by the scientists of our republic.
Graduonymy is a ranking of synonyms based on a
certain difference, based on a chain connection.
Figure 4: Lexical Poles: Synonyms and Antonyms in
Perspective.
Although sentiment analysis is a very active area
of research today, a number of complex problems
remain in this field. First, the problem of sarcasm;
Sarcasm is a complex form of speech units in which
the opposite of what the commenter or writer
intended is said or written. In sentiment analysis,
sarcasm and ironic sentences are very difficult to
analyze because the idea is not clearly and directly
expressed.
Figure 5: Irony in Uzbek: Examples of Sarcastic
Expression.
Even though positive words are included in such
sentences, the negative meaning is understood
through the tone of speech. Sarcasm and ironic words
are used with specific, different intonation in oral
speech, but in writing, they are often used with
quotation marks[9]. The computer does not
understand the opposite meaning in this sentence. As
a result, it evaluates the opinion as positive. Often,
words that express a positive assessment contain a
negative meaning. For example: Today, the
geography teacher fell and broke his leg. After this
"good news," the whole class was empty.
Polysemantisms are neutral with one meaning
(nominative meanings) and serve to express an
expressive-emotional thought with another meaning
(figuratively meaning):
1. Hayvon (animal) – objective word – neutral
meaning;
2. Hayvon (beast) – abusive word – negative
meaning.
Analysis factors and mathematical models are
necessary to create a system for semantic analysis of
polysemantic words [10]. Polysemantic and
homonymous words are close words. That is why the
issue of studying the phenomena of homonymy and
polysemy has not lost its relevance until now.
According to world experience, polysemantic words
are semantically analyzed based on statistical
calculations and probability theory on a large amount
of contextual data [11]. Researcher Sh.
Gulyamova[12] expressed her opinion on the
semantic analysis of polysemous words in the Uzbek
language. It is recommended to use Markov models
in the semantic analysis of polysemantic words in the
Uzbek language.
In linguistics, expressions such as extremely
negative attitude, discrimination, disdain, and insult
are very clearly visible in insulting words called
vulgarisms[13]. Such words are expressed in speech
not according to their nominative meaning but
according to their connotative meaning. Insulting
words are used in the speech of heroes in literary
works and in comments on social networks[14].
Vulgarisms also differ in terms of gender; that is, they
are used differently in the speech of men and women.
1. Vulgarisms used for women: g’ar, megajin,
manjalaqi… «Otasidan hovliyu mashinadan
boshqa yana nimalar qoldi ekan?» – deb
atrofingda hid olib yurgan bitta shu megajin
emasdir!
2. Vulgarisms used for men: takasaltang,
landovur, nomard… – Qo‘lga tushding-ku,
qani endi xo‘jayinga javob berib ko‘r, ha
nomard! As we have seen in the above
examples, the accuracy of sentiment
analysis is increased by including insults in
the linguistic database.
There are sentences that are often used in the
Uzbek people's culture of communication, in which
the persons who have entered into a relationship wish
each other well[15]. Such sentences are said when
praying, congratulating on a relationship, farewell,
greeting, condolence (in native language: duo, olqish,
qarg‘ish) [16]. Such sentences have stabilized
semantically and formally in the Uzbek language,
most of them have become speech etiquette, and all
of them have a positive meaning.
PAMIR-2 2023 - The Second Pamir Transboundary Conference for Sustainable Societies- | PAMIR
746
Table 1: Uzbek Language Dynamics: Communication
Styles.
Subjectivity Example: Classification:
Olqish Aylanay, o‘rgulay, jonim
tasadduq bo‘lsin!
positive
Qarg‘ish Yonimga hech kelmasin,
b
alo
g
a uchrasin!
negative
Duo “Qo‘l-ko‘zing dard
ko‘rmasin iloyim”, - dedi
xola mehr bilan.
positive
Uzbek, a Turkic language, stands as the primary
official and sole designated national language of
Uzbekistan. This language, spoken by Uzbeks,
follows a null-subject, agglutinative structure and
boasts a multitude of dialects, each varying
significantly across regions, thus presenting complex
challenges. While languages within the Turkic
family, such as Turkish and Kazakh, have made
notable strides in sentiment analysis, determining the
number of emotional words in Uzbek proves to be a
formidable task. This endeavor demands robust
linguistic expertise and extensive vocabulary
research. One initial hurdle lies in the absence of a
linguistic database tailored for Uzbek, hindering
sentiment analysis efforts.
4 CONCLUSION
In conclusion, the development of a linguistic
database for sentiment analysis in Uzbek texts
necessitates meticulous consideration of the
language's unique features. While the chosen
approach promises insight into emotional expression,
challenges arise regarding the time-consuming task of
compiling synonyms and antonyms. Moreover, the
prevalence of informal lexicon in platforms like
Twitter underscores the need to accommodate
informal language within the database.
Addressing these challenges, the decision to
incorporate informal lexicon, including insults and
emoticons, proves crucial. This expansion not only
enables the classification of formal texts but also
empowers the database to analyze informal
communication effectively. By bridging the gap
between formal and informal language usage, the
database enhances its utility in understanding
sentiment across various linguistic contexts.
REFERENCES
Liu, B. (2012). Sentiment analysis and opinion mining.
Synthesis lectures on human language technologies,
5(1), 1-167.
Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment
analysis algorithms and applications: A survey. Ain
Shams Engineering Journal, 5(4), 1093-1113.
Abdullayev A. (1976). The expressiveness characteristics
of phraseological units. Uzbek language and literature,
5, 36-39.
Makhmudova N. (2021). Functional-semantic field of
graduation category in the English and Uzbek
languages. Philology Matters, 2021(4), Article 4.
Ahmedova X. (2021). Mathematical models that
distinguish homonymy in the framework of a word
series. Electronic journal of actual problems of modern
science, Education and training, October, 2021-10/1.
ISSN 2181-9750.
Divyapushpalakshmi M, Ramalakshmi R. (2021). An
efficient sentimental analysis using hybrid deep
learning and optimization technique for Twitter using
parts of speech (POS) tagging. International Journal of
Speech Technology, 24(2), 329–339.
Uzbek Texts Sentiment Analysis: Database Development
747