CriX: Intersection of Crime, Demographics and Explainable AI

Muhammad Ashar Reza

, Aaditya Bisaria

, Advaitha S.

Alekhya Ponnekanti

and Arti Arya

Department of Computer Science Engineering, PES University, Bengaluru, India

Keywords:

Crime, Demographic Factors, Large Language Model, RAG, Zero-Shot Learning, Explainable AI.

Abstract:

Crime prediction and analysis often rely on crime statistics but neglect the potential inﬂuences of demographic

factors. Each locality possesses unique characteristics indicating that a ’one-size-ﬁts-all’ methodology is in-

adequate. This research presents a framework CriX that incorporates demographic factors to help understand

and address localised crime. At the root level, identifying and predicting crime hotspots is essential for provid-

ing context in training the language model; therefore, ST-DBSCAN and LSTM models are respectively used

on a custom-made dataset. InLegalBERT (Paul et al., 2023), which is pre-trained on Indian legal data, helps

generate embeddings for the large corpus of crime hotspot, demographic and legal data. These embeddings

are stored in a FAISS vector store, allowing for dynamic retrieval using RAG techniques. The generated em-

beddings are then fed into MistralAI offering a textual solution. These outputs are further reﬁned using zero

shot learning increasing model performance. The proposed framework achieved a validation accuracy of over

82% for crime hotspot predictions. The LLM also showcased substantial scores for Compactness, Fidelity

and Completeness, giving an average score of 4.18 out of 5, outperforming baseline models. This approach

enhances the interpretability of legal models by incorporating the concepts of Explainable AI (XAI).

1 INTRODUCTION

Crime analysis and prediction have become pivotal in

improving public safety and law enforcement strate-

gies, particularly in regions with complex demo-

graphic dynamics like Karnataka, India. Traditional

crime prediction models often focus on spatial and

temporal crime patterns but tend to overlook the un-

derlying factors that inﬂuence criminal behaviour.

Demographic factors such as GDDP

, NDDP

, PCI

and HDI

provide essential insights into the societal

conditions that might trigger a surge in criminal activ-

ity.

This research aims to develop a framework for

https://orcid.org/0009-0005-9458-2754

https://orcid.org/0009-0002-3692-3037

https://orcid.org/0009-0008-4074-4692

https://orcid.org/0009-0009-7616-7986

https://orcid.org/0000-0002-4470-0311

GDDP: Gross District Domestic Product

NDDP: Net District Domestic Product

PCI: Per Capita Income

HDI: Human Development Index

identifying and predicting crime hotspots in Kar-

nataka by integrating crime data with demographic

factors. The study utilises two primary datasets: First,

the FIR data from 2020 to 2022, covering the 31

districts of Karnataka and 1060 police stations, was

scraped from the ofﬁcial Karnataka Police website

(Karnataka State Police, ). In addition, various de-

mographic indicators (Government of Karnataka, )

for these districts were collected, providing a com-

prehensive proﬁle of the region. The various districts

of Karnataka can be seen in Figure 1.

To identify current crime hotspots, the re-

search employs the ST-DBSCAN clustering algo-

rithm, which allows for clustering crimes based on

both spatial and temporal features. These clusters are

passed to a Long Short-Term Memory (LSTM) to pre-

dict future hotspots where criminal activity is likely

to intensify within a speciﬁed time frame. The LSTM

model is particularly well suited for this task, as it

captures temporal dependencies and can predict se-

quences of data over time.

Therefore, a cascaded forecasting approach was

adopted by integrating the output of the ST-DBSCAN

clustering algorithm with the LSTM model. Using the

714

Reza, M. A., Bisaria, A., Advaitha, S., Ponnekanti, A. and Arya, A.

CriX: Intersection of Crime, Demographics and Explainable AI.

DOI: 10.5220/0013316200003890

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 2, pages 714-725

ISBN: 978-989-758-737-5; ISSN: 2184-433X

strengths of spatial-temporal clustering and sequence

prediction, this approach enables a seamless transi-

tion from identifying current crime hotspots to fore-

cast future ones with improved accuracy.

The identiﬁed and predicted crime hotspots are

processed through InLegalBERT to generate embed-

dings, which in due course are stored in a FAISS vec-

tor store for faster retrieval. InLegalBERT, a BERT

based model pre-trained on Indian legal texts (Paul

et al., 2023), serves as the foundation for generating

these embeddings. Once a user queries to the LLM by

adding crime location, IPC Section and District name,

the input values are converted into embeddings using

InLegalBERT for further processing.

The raLLM retriever then compares the query em-

beddings with the stored embeddings in the vector

store, retrieving the most relevant chunks based on

cosine similarity. These retrieved chunks are passed

to Mixtral-8x22B-Instruct-v0.1, a customised version

of the Mistral AI LLM, which transforms the embed-

dings into natural language outputs that are compre-

hensible.

By using InLegalBERT’s embeddings along with

crime and demographic data, this research adds a sig-

niﬁcant and impactful interpretative layer, contextu-

alising crime hotspots with meaningful demographic

insights. Adding on, the witness and victim descrip-

tions are provided in Kannada which is the local lan-

guage of Karnataka. The LLM integrates a Kannada

to English translation API to enhance the accessibility

and comprehension of these FIRs. The output gener-

ated by Mistral AI is enhanced using zero-shot learn-

ing, which operates without requiring examples for

understanding or improving the model’s performance.

Thus Explainable AI (XAI) plays a pivotal role

in enhancing the interpretability and transparency of

crime prediction models. This is achieved through its

four key concepts: justify, control, discover, and im-

prove. These elements provide a structured frame-

work for making the model’s insights accessible and

actionable.

The motivation for this work stems from the need

to tackle crime at its root by identifying and analysing

demographic factors that contribute to criminal activ-

ity. By prioritising these aspects, the study aims to

improve societal conditions in Karnataka, India, re-

ducing crime rates and promoting safer, more secure

communities.

This paper explores the integration of demo-

graphic factors into crime prediction models and em-

phasises crime explainability using a language model

(LLM). The Related Work section reviews existing

methodologies and identiﬁes research gaps in the

ﬁeld. The Methodology section outlines the data col-

Figure 1: The 31 Districts of Karnataka, India.

lection process, including FIR scraping, and details

the application of ST-DBSCAN and LSTM models

for identifying and predicting crime hotspots. It also

delves into the role of InLegalBERT and the FAISS

vector store in embedding generation and storage.

The retriever mechanism is explained, demon-

strating how it retrieves relevant chunks based on co-

sine similarity, which are then passed to MistralAI for

generating comprehensible outputs. This output is en-

hanced using zero-shot learning to improve its rele-

vance and quality.

The Results section presents key ﬁndings and vi-

sualisations of crime patterns across Karnataka. The

Discussion interprets these results, highlighting their

connection to demographic factors and offering in-

sights into potential strategies for crime prevention

and reduction.

2 RELATED WORK

2.1 Model Analysis

Understanding criminal patterns over different areas

is extremely important for crime prediction.(Wheeler

and Steenbeek, 2021) used Random Forests to predict

long term crime patterns in Dallas, Texas. Random

Forest was chosen as the model for its power in se-

lecting feature variables and non-linear correlations of

the crime occurrences. The performance of the model

was better than previous methods, including Risk Ter-

rain Modelling and Kernel Density Estimation. They

CriX: Intersection of Crime, Demographics and Explainable AI

715

also established the fact that impacts of predictors on

the crime rates were in fact non linear and spatial.

(Mandalapu et al., 2023) conducted an extensive

review of various papers and found that traditional

clustering algorithms, such as K-Means, often fell

short in addressing the dynamic and noisy nature of

real-world crime data, particularly when temporal di-

mensions were involved. The maximum accuracy

these models could achieve was about 80%. Crime

incidents are inﬂuenced by both spatial proximity and

time intervals and thus deep learning models like

CNN or LSTM that simultaneously handle these di-

mensions performed signiﬁcantly better. They found

the accuracy of such models reaching almost 95% de-

pending on the quality of the datasets.

(Marchant et al., 2018) noted that the Bayesian

framework improved criminal data analysis by using

a probabilistic model for capturing the dependencies

between crime rates and socio-environmental factors.

It also helped in incorporating the uncertainty asso-

ciated with predictions. It covered parametric and

non-parametric approaches, resulting in the capabil-

ity to model spatial dependencies adequately to fore-

cast crime rates. The authors considered investigation

of property crimes including theft, assault and drug

related offences and established that crime rates are

critically dependent on other demographic traits and

environmental features such as population density. In

conclusion, the Bayesian application was beneﬁcial

for comprehensive and diverse crime analysis.

(Birant and Kut, 2007) proposed the ST-DBSCAN

algorithm which had the ability of discovering clus-

ters according to non-spatial, spatial and temporal val-

ues of the objects and was particularly effective for

processing very large datasets. They introduced the

novel concept of density factor which enabled the

algorithm to handle noisy data even when clusters

of different densities were present. It had a much

faster runtime (factor between 1.5 and 3 times) than

other clustering algorithms such as CLARANS (Ng

and Han, 1994) and DBCLASD (Xu et al., 1998) and

the factor only increased with the size of the datasets

used. Thus ST-DBSCAN became a strong candidate

for clustering using spatial–temporal data.

(Ramirez-Alcocer et al., 2019) demonstrated that

the use of Long Short-Term Memory (LSTM) net-

works delivered strong results for predicting future

crime hotspots as it was adept at handling sequential

data. The study showed the feasibility of employing

LSTM models trained on extensive datasets of his-

torical crime records. Their deep learning approach

achieved a high performance in the ﬁnal model with

a validation accuracy of 87.84% and an average loss

function of 0.0376.

(Rai et al., 2022) demonstrated an effective ap-

proach by utilising LSTM in tandem with BERT, a

language model, to extract deeper contextual and lin-

guistic insights. The authors developed a model that

automatically classiﬁed news articles as either fake or

real based on their titles. This combination not only

enhanced the predictive accuracy to 88.75%, but also

enabled a more nuanced understanding of the textual

elements in the datasets.

Crime prediction involving the incorporation of

legal language models has become more popular re-

cently with different studies having researched on it.

(Paul et al., 2023) proposed InLegalBERT, inspired

by the work of (Beltagy et al., 2019) called SciBERT

that was pre-trained on scientiﬁc publications. InLe-

galBERT is a legal aligned BERT model pre-trained

on Indian legal documents. This study showed that

the proposed model could understand legal terms and

its context for the tasks relevant to the Indian legal

system such as of categorisation of crimes as per the

Indian Penal Code (IPC). The authors also noted that

warming on domain-speciﬁc texts improved the ﬁne-

tuning results in legal NLP tasks substantially.

(Bogomolov et al., 2014) examined the correlation

between crime and demographic characteristics using

aggregated human behavioural data captured from the

mobile network infrastructure in combination with

basic demographic information. They achieved an ac-

curacy of almost 70% when predicting hotspots for

real crime data in London. This proved that using de-

mographic factors have the potential to help predict-

ing urban crime issues effectively.

(Fan et al., 2024) highlights the signiﬁcance of

RAG in enhancing the capabilities of generative AI

by supplying reliable and up-to-date external knowl-

edge, which is particularly beneﬁcial in the context of

AI-Generated Content (AIGC). The paper emphasises

the potential of raLLMs to mitigate common issues

faced by traditional LLMs, such as hallucinations and

outdated internal knowledge, by leveraging retrieval

mechanisms.

(VM et al., 2024) divides the process of ﬁne tun-

ing into several stages. First training data in the target

domain was gathered and the text was then broken

into chunks and tokens with a suitable tokenizer to

convert the text into embeddings. The training cov-

ered the next token prediction strategy and optimised

the weights derived from the accumulated responses

given a trained-task oriented set of data set. The au-

thors highlighted that although the ﬁne-tuning pro-

cess helped to improve the model, it raised a number

of issues including the availability and quality of the

data, costs and ethical issues, which are all critical and

should be discussed in detail.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

716

Figure 2: Proposed Architecture Diagram for CriX (Crime Explainer).

2.2 Research Gaps

Together, these works offer a snapshot of what is

emerging as the state of the art in the ﬁeld of crime

prediction that joins methods formalised in legal lan-

guage models with other data. However, there remain

important gaps in the evidence, as stated below:

• Absence of readily available data for the India

state of Karnataka, led us to developed our own

dataset.

• Most approaches focused on identifying crime

patterns and hotspots without addressing the root

causes of criminal behaviour, often treating the

symptoms rather than the underlying factors.

• Each district possesses unique characteristics in-

dicating that a one-size-ﬁts-all methodology is in-

adequate. Therefore, a tailored approach is essen-

tial for effective crime prevention and interven-

tion.

3 METHODOLOGY

This work leverages InLegalBERT (Paul et al., 2023),

a BERT-based model trained on Indian legal texts

for its generative embeddings and is used by Mistral

model with district-level demographic data from Kar-

nataka. The primary aim is to predict why speciﬁc

crimes occur in different districts by integrating le-

gal and demographic factors. This section explains

the architecture of the model, the data collection and

preprocessing steps, the features used, and the ﬁne-

tuning process with demographic data. The proposed

architecture is visualised in Figure 2.

3.1 Data Collection and Preprocessing

The study utilizes two key datasets in the form of

JSON ﬁles: one capturing crime data over the years

2020-2022 for all 31 districts of Karnataka and an-

other that includes various demographic indicators

across these years. The scraping of the crime dataset,

detailing incidents across 1,060 police stations, from

the Karnataka Police website was automated using

Selenium, facilitating large-scale data collection. The

extracted data is formatted as JSON ﬁles, captur-

ing essential attributes such as crime location, IPC

sections, police station proximity, and time details.

Meanwhile, the demographic dataset encompasses

yearly indicators such as GDDP, NDDP, per capita in-

come, literacy rate, health index, and HDI, providing

a comprehensive proﬁle for each district. Together,

these datasets enable an integrated analysis of how

demographic factors correlate with crime across Kar-

nataka.

The transliteration process utilises the ‘Translator‘

class from the Gemini library, conﬁgured to trans-

CriX: Intersection of Crime, Demographics and Explainable AI

717

late from Kannada(’kn’) script to English(’en’). The

system initialises the converter with source and tar-

get scripts. FIR content recorded in Kannada is then

passed to the converter, which outputs the English

translation, preserving the original meaning. This

method facilitates consistent and interpretable crime

description data for further analysis, aligning local

linguistic data with broader law enforcement frame-

works. The success of crime prediction models de-

pends heavily on the quality of both legal and demo-

graphic data. The economic and development data

was sourced from data.opencity.in to reﬂect district-

speciﬁc characteristics that might inﬂuence crime.

3.1.1 Feature Engineering

To effectively incorporate the demographic data into

the model, we transformed these demographic indica-

tors into numeric features. Some key steps in feature

engineering included:

3.1.2 Normalization

All continuous variables, such as GDDP, NDDP, and

Per Capita Income, were normalized to ensure they

were on the same scale, preventing one feature from

dominating others during model training.

3.1.3 Encoding Categorical Features

IPC sections were encoded as categorical variables to

make them interpretable by the model. Each IPC sec-

tion corresponds to a unique integer representation,

enabling the model to differentiate between types of

crimes.

3.1.4 Handling Missing Data

For districts where some demographic data was miss-

ing or unavailable, we used interpolation techniques

and, in some cases, district averages to ﬁll in the gaps.

This pre-processing allowed us to create a rich feature

set that paired each crime type (based on its IPC sec-

tion) with the district’s demographic proﬁle.

3.2 Spatio-Temporal Density-Based

Spatial Clustering of Applications

with Noise

3.2.1 Structure

• Parameter Setting:

– best spatial threshold: Determines the spatial

distance for clustering,

– best temporal threshold: Speciﬁes the tempo-

ral distance,

– best min samples: Sets the minimum number

of samples required to form a cluster.

Let density distance max of a point p denote

the maximum distance between the point and its

neighbour objects within the neighbourhood ra-

dius centred around a point. Similarly, let den-

sity distance min of point p denote the minimum

distance between the point and its neighbour ob-

jects within the radius.

The density factor of a cluster C captures the de-

gree of the density of the cluster. If C is a “loose”

cluster, density distance min would increase and

so the density distance would be quite small, thus

forcing the density factor of C to be quite close

to 1. Otherwise, if C is a “tight” cluster, den-

sity distance min would decrease and so the den-

sity distance would be quite big, thus forcing the

density factor of C to be quite close to 0.

density factor(C) =

∑

p∈C

density distance(p)

|C|

(1)

density distance min(p) =

min{dist(p, q) | q ∈ D ∧ dist(p, q) ≤ E ps}

(2)

density distance max(p) =

max{dist(p, q) | q ∈ D ∧ dist(p, q) ≤ E ps}

(3)

• Model Initialisation and Fitting: An instance

of ST DBSCAN is created with the best parame-

ters. The model is then ﬁtted to the scaled features

extracted from the crime data, including latitude,

longitude, crime occurrence time, and various de-

mographic factors.

• Label Extraction and Data Preparation: The

clustering labels generated by the model are re-

trieved and added to the original DataFrame, al-

lowing for identiﬁcation of clusters within the

data.

• Data Visualisation: A 3D scatter plot is created

to visualize the clustered crime data over time,

with latitude on the x-axis, longitude on the y-

axis, and the normalized time on the z-axis as can

be seen in Figure 3. The points are coloured based

on their cluster membership, and a colour bar is

included for reference.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

718

Table 1: Features Key Utilised in Table 2 for ST-DBSCAN Clustering.

Key Feature

A Crime Latitude

B Crime Longitude

C Crime Time Number

D Encoded Crime Count

E NDDP Current

F GDDP Current

G Per Capita Current

H Average Crime Time

Table 2: ST-DBSCAN Clustering Results for Mysuru District, Karnataka.

Features Number of Number of Number of Cluster IDs Points in Each

Set Key Features Clusters Outliers Cluster

A, B, C, D 4 1 1 0 3127

A, B, C, D, E 5 2 4 0, 1 2501, 623

A, B, C, D, F 5 2 3 0, 1 2067, 1058

A, B, C, D, G 5 2 3 0, 1 2663, 462

A, B, C, D, H 5 2 17 0, 1 3108, 3

A, B, C, D, E, F 6 3 5 0, 1, 2 2067, 433, 623

A, B, C, D, E, G 6 4 5 0, 1, 2, 3 1605, 433, 623, 462

A, B, C, D, F, G 6 3 3 0, 1, 2 1605, 1058, 462

A, B, C, D, E, F, 8 8 35 0, 1, 2, 3, 4, 1756, 710, 417,

G, H 5, 6, 7 166, 28, 6, 4, 3

The density factor of a cluster C captures the

degree of the density of the cluster. The den-

sity distance of an object p is deﬁned as

density distance(p) =

density distance max(p)

density distance min(p)

(4)

• Cluster and Outlier Reporting: Finally, the

code calculates and prints the number of clusters

and outliers (noise points) detected by the algo-

rithm.

3.2.2 Input and Output

• Input: The model takes as input a DataFrame

containing scaled features related to crime occur-

rences, including geographical coordinates, tem-

poral data, and demographic factors as speciﬁed

in Table 1.

• Output: The output consists of clustered labels

assigned to each data point and a 3D visualisation

of the crime hotspots over time. The reduction

in noise as hotspots are identiﬁed more clearly,

rather than being diffused by the inclusion of ad-

ditional factors, which is illustrated in Table 2.

Figure 3: 3D Visualisation of Crime Hotspots.

3.2.3 Model Architecture

The clustering performed by ST-DBSCAN acts as a

foundational analysis tool that segments the data into

meaningful clusters based on the provided parame-

ters. The ﬁne-tuning parameters for the ST-DBSCAN

model include the spatial threshold (eps1), tempo-

ral threshold (eps2) and minimum samples required

(min samples) for forming clusters. These parame-

ters can be adjusted based on previous grid search re-

sults to optimise clustering performance. Thus, it ef-

CriX: Intersection of Crime, Demographics and Explainable AI

719

fectively identiﬁes and visualises crime hotspots and

helps facilitate a deeper understanding of crime pat-

terns in Karnataka.

3.3 Long Short Term Memory

3.3.1 Structure

The LSTM layer is initialised with parameters and

processes sequential data. It also considers a dropout

parameter which is used to prevent overﬁtting. A fully

connected layer follows the LSTM layer, which maps

the LSTM’s output to the desired number of output

classes, being the hotspot clusters in the research.

3.3.2 Input and Output

• Input: The model expects input in the shape of

(batch size, sequence length, input dim), where

sequence length is the number of time steps in

each input sequence, and input dim is the number

of features (demographic factors and other rele-

vant indicators).

• Output: The output of the model is a tensor rep-

resenting the predicted class probabilities for each

cluster, with the shape (batch size, output dim).

The model uses softmax activation implicitly in

the loss function during training to interpret these

outputs as probabilities for multi-class classiﬁca-

tion. The predicted hotspots for the state of Kar-

nataka and the district of Mysuru is plotted in Fig-

ure 4 and Figure 5 respectively.

Figure 4: Predicted Crime Hotspots in the State of Kar-

nataka.

Figure 5: Predicted Crime Hotspots in the District of My-

suru, Karnataka.

3.3.3 Model Architecture

• LSTM Layer: This core layer is responsible for

learning sequential patterns, conﬁgured with the

speciﬁed number of input features, hidden units,

and layers.

• Fully Connected Layer: A linear transforma-

tion that reduces the hidden state output from the

LSTM to the desired number of clusters.

3.3.4 Parameter Optimisation

The model includes several ﬁne-tuning parameters:

• input dim: Number of features in the input data,

which is dynamically determined based on the

shape of the training dataset X train.

• hidden dim: Set to 128, which speciﬁes the size

of the hidden state in the LSTM, allowing the

model to capture complex patterns in the data.

• output dim: Determined by the number of

unique clusters in the dataset, ensuring the model

outputs a prediction for each cluster.

• num layers: Defaulted to 2, which indicates the

model will stack two LSTM layers for deeper

learning.

• dropout: Set to 0.3, providing regularization to

mitigate overﬁtting during training.

• Optimizer and Loss Function: The Adam op-

timizer is utilised with a learning rate of 0.001,

and the loss function is deﬁned as CrossEntropy-

Loss, which is suitable for multi-class classiﬁca-

tion tasks, ignoring any speciﬁed index (-1) for

outlier points.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

720

Once the legal and demographic data was pre-

processed, we ﬁne-tuned InLegalBERT (Paul et al.,

2023) using district-level demographic data. The goal

of ﬁne-tuning was to enable the model to associate

certain IPC sections with speciﬁc demographic con-

ditions in different districts.

3.4 CriX - raLLM Enhanced

InLegalBERT with MistralAI

3.4.1 Structure

CriX (Crime Explainer) is a framework modelled to

provide an easily comprehensible text-based analysis

for crime occurrence based on demographic parame-

ters of the location. For example, the crime tenden-

cies in a speciﬁc area may be correlated with income

difference or low literacy levels, thus providing prac-

tical insight for policymakers to implement potential

socio-economic interventions.

In the proposed framework, embeddings are gen-

erated using InLegalBERT (Paul et al., 2023), a model

ﬁne-tuned on Indian legal texts, to create context-

aware representations of the input data. These embed-

dings are stored and indexed in a FAISS vector store,

enabling efﬁcient retrieval using cosine similarity. A

retrieval augmentation framework processes queries

and gets the required set of embeddings, which are

then passed to the Mixtral-8x22B-Instruct-v0.1 gen-

erative model. This model, with its zero-shot learning

capabilities, converts embeddings into coherent and

contextually accurate natural language output.

3.4.2 Input and Output

• Input: The input consists of:

– Predicted crime hotspots: Output clusters from

the LSTM model representing high-risk areas

of crime based on spatial-temporal and demo-

graphic factors.

– Demographic data: Socio-economic indica-

tors such as Gross District Domestic Prod-

uct (GDDP), Net District Domestic Product

(NDDP), per capita income, literacy rate, health

index, etc., speciﬁc to each region within Kar-

nataka.

• Output: For every predicted location of crime ac-

tivity, the model provides an easily understand-

able text-based analysis. For instance, it might

suggest that crime tendencies in a speciﬁc area

may be correlated with income difference or low

literacy levels. In addition, the human under-

standable output offers practical insight regarding

the potential linking of crime occurrence to de-

mographic parameters which could help policy-

makers to selecting potential socio-economic in-

terventions.

3.4.3 Model Architecture

InLegalBERT is a BERT based transformer model

pre-trained on Indian legal texts, making it partic-

ularly suited for generating embeddings related to

crime data within the context of Indian law. Crime

and demographic information is then embedded and

stored in a FAISS vector store, which is populated

with embeddings generated by InLegalBERT. This

knowledge base stores grouped crime and demo-

graphic embeddings for contextual relevance. These

embeddings are then used to get the top three most

relevant chunks from this store using a cosine simi-

larity mechanism.

The Mixtral-8x22B-Instruct-v0.1 LLM model is

used to perform the text generation and comprehen-

sion tasks. The retriever augmentation framework

searches the FAISS vector store and identiﬁes the

top three relevant chunks of embeddings. These ob-

tained embeddings are used by Mixtral as knowledge

sources to perform zero-shot learning. The Mistral

LLM maps these embeddings to natural language sen-

tences that are coherent and grounded without utilis-

ing any labelled examples.

The retrieval augmentation framework ensures

that the data unseen by the LLM are dynamically

fed into the model, allowing the responses to be ac-

curate and context-bound thus minimising hallucina-

tions that are typical in the LLMs. The LLM cache

stores recent responses that can be readily accessed

for similar queries thus improving efﬁciency and re-

ducing redundant computations.

4 RESULTS AND DISCUSSION

4.1 Hotspot Cluster Identiﬁcation

The ST-DBSCAN algorithm clusters crime hotspots

in Karnataka where areas of denser crime are de-

picted. This clustering also takes into account the var-

ious demographic factors for these districts. Multiple

clusters are identiﬁed in this process. Each cluster

type provides insights about spatial and temporal pat-

terns of crime, assisting in identifying areas that may

beneﬁt from increased surveillance or targeted inter-

ventions. The distribution of crimes across the clus-

ters identiﬁed is plotted in Figure 6.

Some regions exhibit unique crime patterns due to

CriX: Intersection of Crime, Demographics and Explainable AI

721

factors such as ﬂuctuations in crime unrelated to the

general state trends, or speciﬁc socio-economic con-

ditions like high population densities and varying in-

come levels. Moderate to large hotspots tend to have

signiﬁcant socio-economic differences, where crime

is more prevalent due to economic activities, popu-

lation standards and social status. Districts under-

going urbanisation or experiencing changes in eco-

nomic conditions or migration rates may see an in-

crease in crime as well. Smaller, localised hotspots

may emerge in areas with unique demographic pro-

ﬁles, such as high-income but low-population-density

areas, where the crime dynamics differ from those of

more densely populated, lower-income regions.

Training of the LSTM model showed signiﬁ-

cant improvement in valuation metrics during the 50

epochs. The training loss was reduced steadily down

to 0.3475 on the last epoch, while the change in the

validation loss was characterised by a gentle decline

with the minimum loss of 0.4928 achieved on epoch

33. The validation accuracy attained its highest of

0.8211 at epoch 35, as illustrated in Figure 7, which

demonstrated the model’s progressive ability at using

demographic factors to forecast crime intensity. For

training control, early stopping was applied when val-

idation performance was stagnant, leading to the stop-

ping of training at epoch 42 to guard against overﬁt-

ting.

The effectiveness of this approach was evaluated

by comparing the results with existing models imple-

mented in the study by (Zhuang et al., 2017) which

included STNN-LSTM (81%), multilayer perceptron

(76. 75%), random forest (76. 25%) and decision tree

(76%). The proposed cascaded-forecasting model

consisting of STDBSCAN and LSTM achieved a

higher accuracy than all the models used in this study.

These results suggest that our model is well-suited for

accurately capturing spatial and temporal dependen-

cies in crime hotspot prediction.

Figure 6: Identiﬁed hotspot cluster IDs visualised in in-

creasing order of density of crimes.

Figure 7: Results of Cascaded Forecasting Model: ST-

DBSCAN Infused LSTM.

4.2 Large Language Model

Performance Comparison

CriX’s performance is compared to various other

models of BERT in combination with the Mixtral

LLM. It yielded much more reliable results when

comprehending complex legal language and context

when trained on demographic and hotspot data, as

they account for the underlying demographic factors

inﬂuencing crime rates. The models implemented

are, BERT-based LLM (achieving an average score of

2.27), InLegalBERT (achieving an average score of

2.67), and CriX (achieving the highest average score

of 4.18). These scores highlight the distinctions in

model performance based on Compactness, Fidelity,

and Completeness. Each metric explained is scored

out of 5.

• Compactness: CriX demonstrates a superior

compactness, with a value of 4.03, surpassing

BERT based LLM (which achieved a score of

3.73) and InLegalBERT (which achieved a score

of 2.55), by producing concise and targeted ex-

planations that highlight only the most relevant

demographic and crime-speciﬁc factors. This is

particularly advantageous over baseline models,

which tend to generate less focused outputs. The

usage of Zero Shot Learning enables it to avoid

redundant or overly general information, result-

ing in outputs that are easier for stakeholders to

interpret and apply effectively.

• Fidelity: The model also performs well in terms

of accuracy offering explanation that resembles

the original data’s structure. CriX produces a

score of 3.51, whereas BERT based LLM has a

ﬁdelity score of 2.45 and InLegalBERT, with a

score of 2.47. As a result, demographic features

of the crime are preserved with minimal distor-

tions while the outcomes maximally correspond

to real factors of crime. It also performs bet-

ter than other descriptively general models which

may offer wider but less speciﬁc picture, thus giv-

ing credence to the importance of CriX for policy

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

722

Table 3: Performance Metrics of Few Language Models.

Large Language Model Compactness Fidelity Completeness Average Score

BERT + MistralAI 3.73 2.45 0.62 2.27

InLegalBERT + MistralAI 2.55 2.47 5 2.67

CriX (with zero shot learning) 4.03 3.51 5 4.18

and intervention.

• Completeness: CriX provides a more exten-

sive list of inputs, taking into account the socio-

economic and educational background which may

lead to crime. CriX outlines a completeness score

of 5, where as BERT based LLM only produces

a score 0.62, while InLegalBERT gives a score of

5. In contrast, general LLMs fail to capture such

larger demographic characteristics.

Each model provided textual explanations in re-

sponse to the prompt: “Give details how crime com-

mitted with IPC section IPC 1860 Section 378 hap-

pened in Mysuru on 02-02-2020.” The metrics cal-

culated on the explanation given by each model are

summarized in Table 3 for comparison. More speciﬁ-

cally, it indicates that with respect to the three criteria

of Compactness, Completeness and Fidelity, CriX is

able to provide what the users may require with the

greatest precision and depth. This makes it most suit-

able for the task of crime pattern analysis as it shows

how the model utilises the demographic factors.

4.3 Explainable AI

Through the Explainable AI approach, the model pro-

duces easily comprehensible summaries. The most

critical demographic factors that impact the model are

identiﬁed to help attain a better understanding of the

areas with high crime rates. This interpretative ap-

proach would enable the law enforcement agencies

to grasp why some regions may be most vulnerable

to crimes in regard to conditions of demography. In

the current paper, Explainable AI (XAI) is integrated

using four fundamental concepts of justiﬁcation, con-

trol, discovery and improvement to improve the pre-

dictive models for crime prediction.

• Justify: XAI is employed to ensure that crime

hotspot predictions are transparent and under-

standable. By using demographic factors such

as literacy rates, GDDP, and per capita income,

the model provides valid justiﬁcation for its out-

puts. This helps stakeholders, such as law en-

forcement and policymakers, make data driven

decision backed by clear insights into inﬂuential

factors.

• Control: CriX ensures a high level of control over

the predictive process by enabling authorities to

manage and inﬂuence model interpretability and

decision-making. It allows parameters to be up-

dated dynamically, reﬂecting new policy initia-

tives or demographic shifts. This ensures that

the model’s explanations and predictive capabil-

ities remain relevant and adaptable to real-world

changes, empowering authorities to proactively

respond to emerging crime patterns.

• Discover: The discovery aspect of XAI uncov-

ers hidden correlations between demographic in-

dicators and crime occurrences. For example,

the analysis might reveal an unexpected link be-

tween declining HDI and increased crime rates

in speciﬁc districts. This insight-driven approach

helps expand the understanding of the multi-

faceted causes of crime and enables authorities

to design targeted social programs that address

these underlying issues, thus contributing to holis-

tic crime prevention.

• Improve: The iterative nature of XAI supports

continuous improvement in the model’s predictive

capabilities. By regularly reviewing which de-

mographic features most inﬂuence outcomes and

assessing the model’s interpretability, researchers

can reﬁne the model’s training and enhance its

feature set. The iterative zero-shot learning feed-

back loop not only improves prediction accu-

racy but also helps guide future data collection to

strengthen the model’s overall effectiveness and

maintain its robustness over time.

5 CONCLUSIONS

This research successfully integrates spatial, tempo-

ral, and demographic factors to model and predict

crime hotspots in Karnataka, India, offering a novel

approach to crime analysis and prevention. By em-

ploying ST-DBSCAN for clustering and LSTM for

prediction, CriX demonstrates promising accuracy

in identifying crime-prone areas and time periods,

achieving a validation accuracy of over 82%. These

ﬁndings emphasise the impact of demographic factors

on criminal patterns, underscoring the potential for

targeted interventions. Furthermore, our LLM based

approach provides interpretability, explaining key de-

mographic inﬂuences on crime, and offering valu-

CriX: Intersection of Crime, Demographics and Explainable AI

723

able information for developing mitigation strategies.

By incorporating Explainable AI (XAI), CriX sets

a foundation for future advancements in predictive

policing and social policy, promoting data-driven so-

lutions to improve community safety. It enhances the

practical utility of the model by clarifying the inﬂu-

ence of key demographic factors on crime. This trans-

parency empowers stakeholders to adopt evidence-

based policies and adapt crime prevention strategies

based on clearly interpretable results, making the

model’s ﬁndings actionable and trustworthy. Through

this multifaceted approach, the research aims to en-

hance crime prevention by identifying hotspots with

both spatial-temporal and demographic dimensions.

The proposed framework informs law enforcement

and policymakers about the conditions under which

crime is likely to occur, thus contributing to a more

holistic and effective approach to public safety in Kar-

nataka.

6 LIMITATIONS AND FUTURE

DIRECTIONS

6.1 Limitations

• Data Availability and Quality: The model’s ac-

curacy relies on the availability and quality of de-

mographic, spatial, and crime data. Incomplete or

biased data can limit its predictive capabilities.

• Generalisability: While the model is optimized

for Karnataka, applying it to other regions may

require extensive recalibration due to unique de-

mographic and spatial characteristics.

• Integration of Additional Socioeconomic In-

dicators: Future models can incorporate more

socioeconomic variables to deepen insights into

the relationship between demographic factors and

crime.

• Real-Time Crime Prediction: Extend the model

to handle real-time data streams, enabling live

monitoring and dynamic hotspot predictions for

proactive policing.

• Community-Centric Crime Prevention Strate-

gies: Develop actionable recommendations based

on model ﬁndings to inform community-level in-

terventions and policy decisions aimed at reduc-

ing crime rates, which helps in real estate and res-

idential purchases.

6.2 Future Directions

In future research, the LLM could be ﬁne-tuned with

more comprehensive data from other states or even at

the national level to create a broader crime prediction

model applicable across India. This could facilitate

comparative studies across different regions and of-

fer valuable insights into the varying factors that con-

tribute to crime in different cultural and demographic

contexts. The inferences made highlight the potential

of LLMs to serve as valuable tools in crime preven-

tion, legal analysis, and policymaking. As we con-

tinue to reﬁne and expand such models, their role in

advancing evidence-based solutions to societal chal-

lenges will become increasingly important.

Incorporating Graph RAG might improve retrieval

by structuring crime and demographic data as a graph,

capturing relationships between crime hotspots, de-

mographic factors and temporal data. This approach

improves context relevance by leveraging graph based

embeddings, enabling retrieval of interconnected in-

sights that traditional vector based methods might

miss.

REFERENCES

Beltagy, I., Lo, K., and Cohan, A. (2019). Scibert: A

pretrained language model for scientiﬁc text. arXiv

preprint arXiv:1903.10676.

Birant, D. and Kut, A. (2007). St-dbscan: An algorithm for

clustering spatial–temporal data. Data & knowledge

engineering, 60(1):208–221.

Bogomolov, A., Lepri, B., Staiano, J., Oliver, N., Pianesi,

F., and Pentland, A. (2014). Once upon a crime: to-

wards crime prediction from demographics and mo-

bile data. In Proceedings of the 16th international

conference on multimodal interaction, pages 427–

434.

Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D.,

Chua, T.-S., and Li, Q. (2024). A survey on rag meet-

ing llms: Towards retrieval-augmented large language

models. In Proceedings of the 30th ACM SIGKDD

Conference on Knowledge Discovery and Data Min-

ing, pages 6491–6501.

Government of Karnataka. https://data.opencity.in/

organization/government-of-karnataka.

Karnataka State Police. https://ksp.karnataka.gov.in/

ﬁrsearch/en.

Mandalapu, V., Elluri, L., Vyas, P., and Roy, N. (2023).

Crime prediction using machine learning and deep

learning: A systematic review and future directions.

IEEE Access, 11:60153–60170.

Marchant, R., Haan, S., Clancey, G., and Cripps, S. (2018).

Applying machine learning to criminology: semi-

parametric spatial-demographic bayesian regression.

Security Informatics, 7:1–19.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

724

Ng, R. T. and Han, J. (1994). Efﬁcient and effective cluster-

ing methods for spatial data mining. In Proceedings

of VLDB, pages 144–155. Citeseer.

Paul, S., Mandal, A., Goyal, P., and Ghosh, S. (2023). Pre-

trained language models for the legal domain: a case

study on indian law. In Proceedings of the Nine-

teenth International Conference on Artiﬁcial Intelli-

gence and Law, pages 187–196.

Rai, N., Kumar, D., Kaushik, N., Raj, C., and Ali, A.

(2022). Fake news classiﬁcation using transformer

based enhanced lstm and bert. International Journal

of Cognitive Computing in Engineering, 3:98–105.

Ramirez-Alcocer, U. M., Tello-Leal, E., and Mata-Torres,

J. A. (2019). Predicting incidents of crime through

lstm neural networks in smart city domain. In The

Eighth International Conference on Smart Cities, Sys-

tems, Devices and Technologies, pages 32–37.

VM, K., Warrier, H., Gupta, Y., et al. (2024). Fine tun-

ing llm for enterprise: Practical guidelines and rec-

ommendations. arXiv preprint arXiv:2404.10779.

Wheeler, A. P. and Steenbeek, W. (2021). Mapping the risk

terrain for crime using machine learning. Journal of

Quantitative Criminology, 37:445–480.

Xu, X., Ester, M., Kriegel, H.-P., and Sander, J. (1998). A

distribution-based clustering algorithm for mining in

large spatial databases. In Proceedings 14th Interna-

tional Conference on Data Engineering, pages 324–

331. IEEE.

Zhuang, Y., Almeida, M., Morabito, M., and Ding, W.

(2017). Crime hot spot forecasting: A recurrent model

with spatial and temporal information. In 2017 IEEE

International Conference on Big Knowledge (ICBK),

pages 143–150. IEEE.

CriX: Intersection of Crime, Demographics and Explainable AI

725