Application of Data Mining for the Selection Process of Prospective

Students at ITTelkom Surabaya by Using the SPSS Modeler

Rokhmatul Insani, Muhammad Nasrullah, Anfazul Faridatul Azizah and Edriand Imens Raygrandi

Information System Departement, Telkom Surabaya Institute of Technology, Surabaya, Indonesia

edriand.imens.19@student.is.ittelkom-sby.ac.id

Keywords:

Application of Data Mining, Selection Process, Prospective Students.

Abstract:

In a tertiary institution, the problem of student resignation is something that often occurs, which can be due to

ﬁnancial factors or the ability factors possessed by students. Early detection can be carried out on prospective

students who will enter to reduce this risk. A system is needed to support the New Student Admissions (PMB)

process that can predict whether students will survive to graduate or will withdraw in the current semester.

By using data mining techniques predictions can be made on these problems, there are various methods used

to make these predictions, one of which is by using the CHAID Algorithm. Data mining is processed using

the CRISP-DM (The Cross Industry Standard Process Model for Data Mining) method. To help perform data

processing, the IBM SPSS Modeler 18.0 application is used.

1 INTRODUCTION

Advances in information technology are growing

rapidly in all areas of life, including in the ﬁeld of

education. Telkom Surabaya Institute of Technology

(ITTelkom Surabaya) is one of the private tertiary in-

stitutions in Indonesia, where students are the main

asset that must be considered for the sustainability of

the tertiary institution. However, in reality, many stu-

dents resign each semester, which impacts the cam-

pus’s ﬁnances. The resignation can be due to ﬁnan-

cial or ability factors possessed by students. To re-

duce the risk of students dropping out, early detection

can be carried out on prospective students who will

enter. The challenge is how to process the data so

that data can produce the knowledge we need. One

technique for processing a lot of data is data mining

(Insani et al., 2022).

Data mining is a technique for extracting pat-

terns from data so that you can get insight from the

data (Han et al., 2012). Data mining techniques

can be used to predict data based on past data. An

algorithm can be used to predict these problems,

namely classiﬁcation. The classiﬁcation method is

widely used in predicting freshmen, such as the re-

search conducted by Khoirunnisa concerning the Pre-

diction of Al-Hidayah Vocational High School Stu-

dents Entering Higher Education Using the Classiﬁ-

cation Method (Khoirunnisa et al., 2021). Research

conducted by Nadiya Hijriana regarding the applica-

tion of the decision tree algorithm C4.5 method for

the selection of prospective university-level scholar-

ship recipients (Hijriana and Rasyidan, 2017; Utami,

2020; Atma and Setyanto, 2018). Research conducted

by Saifudin regarding the use of the classiﬁcation

method for the selection of prospective students for

new student admissions at Pamulang University (Sai-

fudin, 2018). Another research conducted by Sher-

lyn regarding predictions of potential student enroll-

ment at Taman Siswa Teluk Betung Vocational School

is web-based using the classiﬁcation method (Putri,

2021).

The task of classiﬁcation is to predict the out-

put of variables/classes that have categorical or poly-

nomial values [7]. Where to carry out classiﬁca-

tion several methods are often used, one of which

is CHAID. Based on research conducted by Ardian-

syah, it was found that the CHAID algorithm is one

of the algorithms with the best performance applied to

the research dataset used (Ardiyansyah et al., 2018).

CHAID stands for Chi-squared Automatic Interaction

Detector. CHAID works to estimate a single variable,

known as the dependent variable, which is based on

several independent variables. CHAID is an iterative

technique that tests the independent variables one by

one used in classiﬁcation and arranges them based on

the chi-square statistical signiﬁcance level of the de-

pendent variable (Ardiyansyah et al., 2018).

Insani, R., Nasrullah, M., Azizah, A. and Raygrandi, E.

Application of Data Mining for the Selection Process of Prospective Students at ITTelkom Surabaya by Using the SPSS Modeler.

DOI: 10.5220/0012448600003848

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 3rd International Conference on Advanced Information Scientiﬁc Development (ICAISD 2023), pages 289-294

ISBN: 978-989-758-678-1

289

In this study, the application of the classiﬁcation

algorithm will be carried out to classify students who

are with drawing based on data on prospective new

students and student data at the ITTelkom Surabaya.

Data is taken from academic applications (Salal et al.,

2019). The data collected included personal, educa-

tion, family, and student status data, namely active

or resigned. Then the two data are combined to be

used as input data in the data mining process. The

data active or resigned. Then the two data are com-

bined to be used as input data in the data mining

process. The data mining process uses IBM SPSS

Modeler 18.0 tools. These tools have a visual in-

terface that allows users to take advantage of statis-

tical and data mining algorithms without program-

ming (Insani and Soemitro, 2016). Choosing data

processing tools based on several characteristics, such

as user-friendliness, cost, maintenance, availability of

skills, or the presence of help ﬁles (Belokurova and

Piazza, 2018; Wang et al., 2019). It is hoped that this

research can be used to predict whether prospective

students will survive until graduation or will resign in

the current semester.

2 RESEARCH METHODS

In data mining processing it is known as CRISP-

DM (The Cross Industry Standard Process Model for

Data Mining). CRISP-DM is a consortium of compa-

nies established by the European Commission in 1996

(Abbott, 2014), CRISP-DM provides a standard pro-

cess for data mining that can be applied to common

problem-solving strategies in businesses or research

units (Fadillah, 2015). The stages in CRISP-DM can

be seen in Figure 1 below:

Figure 1: CRISP-DM.

2.1 Business Understanding

The ﬁrst stage is to understand the goals and needs

from a business point of view and explain the beneﬁts

of data mining with data that is by the case studies

taken. Then determine the boundaries of the cases

taken to be used as formulas in data mining problems.

2.2 Data Understanding

In this second stage, data collection is carried out

which is then followed by a process to gain in-depth

knowledge about the data and identify data quality.

Understanding the data can be done by checking the

Data Summary. Data Summary can show the distri-

bution of data, and oddities in data that must be re-

solved at a later stage. Problems that are usually in

data such as missing values, outliers, spikes, and high

cardinality, must be identiﬁed before being resolved

in the next stage. Data visualization can provide a

better picture when compared to data summary.

2.3 Data Preparation

This stage produces a dataset as input at the modeling

stage. By paying attention to the results obtained at

the data understanding stage, several methods will be

carried out to clean the data so that the resulting model

is good. This stage includes attribute selection, data

cleaning, and data transformation.

2.4 Modeling

This stage is carried out by selecting and applying

various modeling techniques and adjusting several pa-

rameters to obtain optimal results. At this stage, the

selection of modeling techniques or algorithms to be

used, model development, and assessment of the re-

sulting model are carried out.

2.5 Evaluation

At this stage, the model has been formed and an eval-

uation is carried out on the quality of the model for

the data generated, as well as whether the model has

achieved the initial goals that have been set. At this

stage, the interpretation of the results of the data min-

ing modeling that has been carried out is carried out.

2.6 Deployment

Use the command At this stage, the knowledge that

has been obtained is carried out by applying the

method and will be represented to the user.

3 RESULTS AND DISCUSSION

Based on the method used there are several stages in

data mining processing, namely:

ICAISD 2023 - International Conference on Advanced Information Scientiﬁc Development

290

3.1 Business Understanding

In this study, the data to be processed is student data

in the academic section. Where students are the most

important asset of a tertiary institution. The challenge

that must be faced by every tertiary institution besides

getting prospective students who are on target is how

to maintain existing students, so they do not withdraw.

Withdrawal can be caused by various reasons such as

ﬁnancial problems or academic problems faced by the

student (Kristanto et al., 2020). Therefore, we need a

system to conduct early detection of prospective stu-

dents in the PMB process so that students can survive

attending lectures until they graduate.

Data mining is an information extraction process

to ﬁnd important patterns in piles of data so that it

becomes knowledge. With data mining, universities

can see the character of prospective students who will

be owned and can make predictions in the future to

avoid the risk of students withdrawing in the current

semester.

Based on the need for the application of data min-

ing for tertiary institutions, the purpose of data min-

ing at ITTelkom Surabaya is to classify student data

to determine the characteristics of students who with-

draw in the current semester. Based on this data,

can be used to conduct early detection of prospec-

tive new students who will enter college. In addi-

tion, if a prospective student is found to have the same

characteristics as a student who is withdrawing, then

the institution can carry out certain treatments for the

prospective student.

3.2 Data Understanding

To carry out the data understanding stage, in this study

software was used, namely SPSS Modeler 18. The

input data used came from academic data from IT-

Telkom Surabaya students. The input data consists of

21 variables which can be seen in Table 1.

Table 1: Input Data Variable.

No. Variable Explanation

1 PRODI Student Major Data

2 NIM Student ID Number

3 NAMA Student Name

4 ANGKATAN Student Intake

5 TAHUN MASUK Student Entry Year

6 PEND AYAH Father’s Last Education

7 KERJA AYAH Father’s Occupation

8 GAJI AYAH Father’s Salary

9 PEND IBU Mother’s Last Education

10 KERJA IBU Mother’s Job

11 GAJI IBU Mother’s Salary

No. Variable Explanation

12 UMUR Age of Student at Admis-

sion

13 GENDER Student Gender

14 ASAL KOTA City of Origin of Students

15 ASAL

PROVINSI

State of Origin of the Stu-

dent

16 NILAI SMA Total High School Grades

17 STATUS SMA High School status whether

public or private

18 JURUSAN SMA Student Majors

19 TAHUN LULUS

SMA

High School Graduation

Year

20 PEBAYARAN

PMB

Payment Status in

Full/Credit

21 STATUS MHS Student Status is Active /

Inactive

The input data is 1831 data with 1749 active stu-

dent data and 82 inactive students. An example of in-

put data that will be analyzed can be seen in Figure 2

below: In the process of understanding the data using

SPSS Modeler 18.0 tools. The following describes

the process of understanding the data carried out in

this study (Wendler and Gr

ottrup, 2016) as Figure 3.

The process of understanding the data uses 3 nodes,

namely: Node 1 (Source Excel) is used to enter re-

search data that will be checked for data quality, node

2 (Filter) is used to select what variables the data will

be audited and node 3 (Audit Data) is used to check

data quality. Following are the results of checking the

quality of the input data that will be used in modeling

as Figure 4.

Figure 2: Example of Input Data.

Figure 3: Process of Understanding Data with SPSS Mod-

eler.

Figure 4: Data Understanding Results.

Based on the results above, it is found that some

Application of Data Mining for the Selection Process of Prospective Students at ITTelkom Surabaya by Using the SPSS Modeler

291

data is still empty, data with extreme values, and some

incomplete data. In the next stage, the input data will

be transformed and cleaned to form a dataset that is

ready to be modeled.

3.3 Data Preparation

This stage is useful for cleaning data and forming

datasets. From the results of the analysis that was car-

ried out at the stage of understanding the data, it was

found that there were still incomplete data. At this

stage, transformation is also carried out according to

the data requirements in the modeling. The transfor-

mation was carried out using the SPSS Modeler 18

software, which can be seen in the image below as

Figure 5.

Figure 5: Data preparation process with SPSS Modeler.

The data preparation process involves 17 nodes,

namely: Node 1 (Source Excel) is used to enter re-

search data to form a dataset, node 2 (Derive) is

used to add new variables based on existing vari-

ables, this node is used to add up father’s salary and

salary mother, Node 3 (Derive) is used to calculate

the difference between college admission and senior

high school graduation, node 4 (Binning) is used to

simplify data. The data that will be generated will

have a range of 1 to 10. This is done so that the

data is easier to read, this node is used to simplify

SMA values, node 5 (Select) is used to select data,

this node is used to select parental salary data that is

more from 0, node 6 (Binning) is used to simplify

parental salary data, nodes 7-10 (Filler) this node is

used to change the value of a variable, this node is

used to change categorizing SMA majors data into

4 categories, namely IPA, IPS, Engineering, Non-

Engineering. Node 11 (Derived) is used to catego-

rize the origin of the province into and outside the

province of East Java, Node 12 is used to catego-

rize the origin of the city into and outside the city of

Surabaya, Node 13 is used to categorize the age to be

under 20 or over 20 years, Node 14 is used to catego-

rizing student status data into 0 (for withdrawing stu-

dents) and 1 (for active students). Node 15 (Filter) is

used to select which variables will be included in the

dataset. Node 16 (Type) is used to determine the type

of variable that will be input in the next stage. Node

17 (Export Excel) is used to form datasets in excel

format which will be processed at a later stage. The

following is the dataset generated in the data prepara-

tion process as Figure 6.

Figure 6: Results of the Data Preparation Process.

Based on the results above, several variables will

be processed using Data Mining, namely: Study Pro-

gram, Gender, High School Status, High School Ma-

jor, High School College Difference, Total Parent

Salary, Province Category, City Category, Age Cat-

egory, and Student Status Flag.

3.4 Modeling

This stage is the core stage in the data mining pro-

cess, namely the application of the algorithm by the

case studies taken. This process requires input in the

form of datasets generated in the previous process,

then modeling using data mining algorithms is carried

out as Figure 7.

Figure 7: Process Modeling with SPSS Modeler.

This modeling process involves 3 nodes, namely

Node 1 (Source Excel) is used to enter the research

dataset that has been obtained from the previous stage

and will be modeled, Node 2 (Type) is used to deter-

mine the type of variable that will be the input or out-

put variable at the next node, Node 3 (CHAID) is an

algorithm used in classifying the character of prospec-

tive students who will register on the campus. Node 4,

which is golden in color is the result of the algorithm

used in the form of a tree diagram.

3.5 Evaluation

Based on the CHAID algorithm used, several impor-

tant variables can be used for classiﬁcation, there-

fore in the modeling process, only the 4 most impor-

tant predictors are used, namely SMA majors, SMA

ICAISD 2023 - International Conference on Advanced Information Scientiﬁc Development

292

grades, SMA status, and parent’s salary as Figure 8.

Figure 8: The Most Important Predictor of the Modeling

Process.

After knowing the 4 most important predictor

variables, then resetting the variables that will be in-

put to the data mining process is carried out. After

that, modeling is carried out again according to the

input. Based on the 4 input variables used, a tree dia-

gram is obtained as follows Figure 9.

Figure 9: Tree Diagram of CHAID Algorithm Results.

Based on the resulting tree diagram, the clas-

siﬁcation table for the characteristics of ITTelkom

Surabaya students is obtained as follows Table 2.

Table 2: Input Data Variable.

Classiﬁcation Node Characteristics

Rule 1 1 SMA value ≤ 2 (from range 1-10)

Rule 2 2 SMA score between 2-3 (from 1-10 range)

Rule 3 3, 7 High school grades between 3-7 (from 1-10 range)

and majoring in Science or Engineering

Rule 4 3, 8, 11 High school grades between 3-7 (from 1-10 range),

majoring in Social Sciences or Non-Engineering

and Public High School status

Rule 5 3, 8, 12 High school grades between 3-7 (from 1-10 range),

majoring in Social Sciences or Non-Engineering

and Private High School Status

Rule 6 4 High school score between 7-8 (from 1-10 range)

Rule 7 5 High school score between 8-9 (from 1-10 range)

Rule 8 6, 9 High school grades > 9 (from 1-10 range) and par-

ents’ total salary ≤ 6 (from 1-10 range)

Rule 9 6, 10 High school grades > 9 (from 1-10 range) and par-

ent’s total salary > 6 (from 1-10 range)

Based on the classiﬁcation in the Table 2, the fol-

lowing is the percentage of students who withdrew.

Based on Table 3, shows that most students with-

draw from a trough in the 4th classiﬁcation with a per-

centage of 18.367% with high school grade character-

istics between 3-7 (from range 1-10), coming from so-

cial studies or non-engineering majors and state high

school status.

Table 3: This caption has one line so it is centered.

Classiﬁcation Resign Active

Rule 1 4,037% 95,963%

Rule 2 12,575% 87,425%

Rule 3 3,285% 96,715%

Rule 4 18,367% 81,633%

Rule 5 2,703% 97,297%

Rule 6 0% 100%

Rule 7 2,959% 97,041%

Rule 8 13,861% 86,139%

Rule 9 0% 100%

4 CONCLUSIONS

Based on research that has been conducted using

the CHAID algorithm to classify prospective students

who will enter the ITTelkom Surabaya, it can be con-

cluded that as many as 18.367% of students who re-

sign have high school grade characteristics between

3-7 (from the range 1-10), comes from the Social

Sciences or Non-Engineering major and State Senior

High School status. Meanwhile, the characteristics of

parents’ salary did not have much effect on the num-

ber of students who withdrew. With these characteris-

tics, it is hoped that the institution can pay more atten-

tion to students with these characteristics so that they

can survive until they graduate from college.

REFERENCES

Abbott, D. (2014). Applied predictive analytics: Principles

and techniques for the professional data analyst. John

Wiley & Sons.

Ardiyansyah, A., Rahayuningsih, P., and Maulana, R.

(2018). Analisis perbandingan algoritma klasiﬁkasi

data mining untuk dataset blogger dengan rapid miner.

Jurnal Khatulistiwa Informatika, 6.

Atma, Y. and Setyanto, A. (2018). Perbandingan algoritma

c4. 5 dan k-nn berbasis ﬁtur seleksi forward selec-

tion dalam identiﬁkasi mahasiswa berpotensi drop out.

Metik Jurnal, 2:31–37.

Belokurova, G. and Piazza, C. (2018). Case study—using

spss modeler and statistica to predict student success

at high-stakes nursing examinations (nclex).

Fadillah, A. (2015). Penerapan metode crisp-dm untuk

prediksi kelulusan studi mahasiswa menempuh mata

Application of Data Mining for the Selection Process of Prospective Students at ITTelkom Surabaya by Using the SPSS Modeler

293

kuliah (studi kasus universitas xyz),journal. Teknik In-

formatika Dan Sistem Informasi, 1.

Han, J., Kamber, M., and Pei, J. (2012). Data mining

concepts and techniques third edition. University of

Illinois at Urbana-Champaign Micheline Kamber Jian

Pei Simon Fraser University.

Hijriana, N. and Rasyidan, M. (2017). Penerapan metode

decision tree algoritma c4. 5 untuk seleksi calon

penerima beasiswa tingkat universitas. Al-Ulum: Jur-

nal Sains Dan Teknologi, 3:9–13.

Insani, R., Asfari, U., Armando, R., and Melyantara, I.

(2022). Perancangan data warehouse untuk menun-

jang akreditasi program studi. Jurnal Teknologi Infor-

masi dan Ilmu Komputer, 9:985–992.

Insani, R. and Soemitro, H. (2016). Data mining for market-

ing in telecommunication industry. In 2016 IEEE Re-

gion 10 Symposium (TENSYMP, page 179–183. IEEE.

Khoirunnisa, K., Susanti, L., Rokhmah, I., and Stianingsih,

L. (2021). Prediksi siswa smk al-hidayah yang masuk

perguruan tinggi dengan metode klasiﬁkasi. Jurnal

Informatika, 8:26–33.

Kristanto, T., Hadiansyah, W., and Nasrullah, M. (2020).

Analysis of higher education performance measure-

ment using academic scorecard and analytical hierar-

chy process. In 2020 Fifth International Conf. on In-

formatics and Computing (ICIC, page 1–6. IEEE.

Putri, S. (2021). Penerapan model naive bayes untuk mem-

prediksi potensi pendaftaran siswa di smk taman siswa

teluk betung berbasis web. Journal of Engineering,

Computer Science and Information Technology (JEC-

SIT, 1.

Saifudin, A. (2018). Metode data mining untuk seleksi

calon mahasiswa pada penerimaan mahasiswa baru di

universitas pamulang. Jurnal Teknologi, 10:25–36.

Salal, Y., Abdullaev, S., and Kumar, M. (2019). Educa-

tional data mining: Student performance prediction in

academic. International Journal of Engineering and

Advanced Technology, 8:54–59.

Utami, S. (2020). Penerapan data mining algoritma deci-

sion tree berbasis pso. In Seminar Nasional Teknologi

Komputer & Sains (SAINTEKS, vol. 1, page 677–681.

Wang, V., Sukamto, A., and Pratama, E. (2019). Sis-

tem pendukung keputusan seleksi mahasiswa pener-

ima beasiswa bbp-ppa dengan metode topsis pada

fakultas teknik untan. Jurnal Sistem dan Teknologi

Informasi (JUSTIN, 7:105.

Wendler, T. and Gr

ottrup, S. (2016). Data mining with SPSS

modeler: theory, exercises and solutions. Springer.

ICAISD 2023 - International Conference on Advanced Information Scientiﬁc Development

294