Predicting the Efficiency with Knowledge Discovery of a Budgeted

Company: A Cuban University

Validation through Three Semesters

Libia I. García

1,2

, Isel Grau

and Ricardo Grau

Centro de Estudios de Informática, Universidad Central “Marta Abreu” de Las Villas (UCLV), Santa Clara, Cuba

Dirección de Informatización y Comunicaciones, UCLV, Carretera a Camajuaní km 5½, Santa Clara, Villa Clara, Cuba

Keywords: Knowledge Discovery, Effectiveness of a Budgeted Company, University Efficiency.

Abstract: The efficiency analysis of a company cannot be reduced to a great number of statistical tables, despite the

reliability of them. It has been shown to be a better idea to seek the “essence” using Knowledge Discovery

(KD) techniques. In this paper, a simple methodology to apply KD in the efficiency analysis of a budgeted

company is presented. These analyses complement those from classical OLAP and Interactive Graphics.

Specifically, it is shown how to use in three steps: univariate analysis, non-supervised and supervised multi-

variate machine learning techniques in order to support the decision making. All these procedures are illus-

trated using the SIGENU database (in Spanish: Sistema de Gestión de la Nueva Universidad) of UCLV stu-

dents and the efficiency measure was the student graduation on time or not. The presented methodology was

elaborated in 2009 and it has been preliminary validated during three semesters of the years 2010 to 2012.

1 INTRODUCTION

Business Informatics is a discipline which combines

Information Technology and Informatics and man-

agement concepts. It integrates core elements from

the disciplines business administration, information

systems and computer science into one field. Also, it

has been considered that Business Informatics in-

cludes significant construction and implementation

oriented elements, i.e. one major focus lies in the

development of solutions for business problems

rather than the ex post investigation of their impact.

We do not agree and we will try to show that it is not

a contradiction, at least, in some type of research.

Business Informatics conceptualization is been re-

freshed today and its perspectives are continuously

updated (Forbrig and Günther, 2010).

The efficiency analyses are normally based on

transversal or longitudinal statistical studies of cer-

tain indicators such as profits, costs, productivity,

etc. When the subject is a public sector company,

e.g. state budgeted universities, cost-profits could

not be considered. These indicators should be al-

ways important, but the priority will be the effec-

tiveness on time. In addition, the analysis about the

efficiency of a company cannot be reduced to a great

number of tables, since the results are difficult to

interpret as a whole. Then, an alternative to classical

methods consists in looking for the essential infor-

mation using Knowledge Discovery techniques.

In this paper we show how to use several of these

supervised or unsupervised techniques for efficiency

analysis of a university. Specially, we explain how

could be used univariate and multivariate analyses to

predict a dichotomic efficiency measure (the com-

pany achieves its goal or not) from data that repre-

sent risk factors to failure.

2 METHODOLOGICAL ASPECTS

AND COMPUTATIONAL

TOOLS

The Central University of Las Villas was selected in

this study for two main reasons: firstly, it is one of

the best Cuban universities, every year it is first to

third in the national ranking; and secondly, it is the

most multidisciplinary university in the country. In

this sense, we can say that this university is a good

selection for a sampling by clusters in the Cuban

universities (Grau et al., 2004). From the design

315

García L., Grau I. and Grau R..

Predicting the Efﬁciency with Knowledge Discovery of a Budgeted Company: A Cuban University - Validation through Three Semesters.

DOI: 10.5220/0004106603150318

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 315-318

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

point of view, the experiment is a longitudinal study,

particularly a panel-cohort study (Grau et al., 2004).

Actually, we consider all the students that are in the

last year of their careers in the academic course

2009-2010, but we added all the students that began

its studies in the course 2005-2006 (cohort) although

some of them are not in the last year now because

they have not success during all the years, and so

they will not finish degree on time. Finally, the gen-

eral sampling has 1007 students from 12 faculties

and 25 careers. The cohort is composed with 803

students that began in the academic course 2005-

2006.

The efficiency measure or objective function to

predict efficiency is the dichotomy variable “Finish

degree on time” (Yes/Not). Predictive variables are

some individual student data, previous to the begin-

ning of their university studies, and additionally, the

Faculty and the Career where they are now. It is

important to notice that we used only “epidemiolog-

ical” predictive data and we do not use “clinical”

data, for example, the student performance in the

first/second year. It is a good idea to use only previ-

ous data because it will allow us to obtain a predic-

tion for every student before the first class in his first

course.

In the first step of this methodology it is used the

Association Univariate Analysis based on Crosstabs

Tables. There have been used, particularly, classical

measures as Chi-square, Φ and Cramer’s V (Cramer,

2011) and the classical epidemiological measure of

Relative Risk (Prentice and Farewell, 1986). We try

to build an “integral score of risk to not-finish on

time”. In order to do that, we split each variable

using Decision Trees with the CHAID growing

method: Chi-squared Automatic Interaction Detec-

tion (Decision Trees, 2011). We calculated a statistic

that represents the presence of relative risk weighted

by its relevance, quantified by Cramer’s V.

The multivariate unsupervised Analysis is devel-

oped in two directions. First of all the work is fo-

cused on the predictive data in order to form clusters

of students which have similar previous characteris-

tics. There are used Two-step clusters to determine

the optimal cluster number (with Schwarz’s a Bayes-

ian Information Criterion) and the clusters them-

selves (Bacher et al., 2004). Afterwards is employed

the Categorical Principal Component Analysis

(CATPCA) with the variables to reduce them to two

dimensions (Meulman et al., 2002). In order to vali-

date the clusters and the dimensions found, the re-

sults of both techniques are cross tabulated with the

efficiency measure. But it is important to notice that

unsupervised results are independent from the objec-

tive function and so they could be used if another

objective function is selected. Finally, Decision

Trees in a supervised analysis are used again with

our efficiency measure, but this time we use CRT:

Classification and Regression Trees.

3 KNOWLEDGE DISCOVERING

3.1 Univariate Analysis

The Univariate procedure can be illustrated with the

variable Faculty. The Decision Tree forms three

groups of Faculties that exhibit a Chi-square statistic

with high significance. Students that finish on time

are predominant but in different proportions. We

constructed three dichotomic variables respectively

associated with these nodes and calculate: the Chi-

square statistic, the Cramer’s V (Φ), the Relative

Risk and its 95% confidence interval in their cross

tabulation with the efficiency measure. This proce-

dure is repeated for every possible predicting varia-

ble. From the KD point of view, it is already inter-

esting to know which variables show risk or protec-

tion (negative risk) in some of their categories.

An integral-univariate risk score in any student is

calculated. Essentially, we should sum the present

risk Φ weighted by Cramer’s, standardized and

compare it with a threshold that it is optimized with

a Curve ROC (Fawcett, 2004). The accuracy results

65%. It is not a spectacular result but at least prove

that we can obtain a classification with this Univari-

ate analysis.

3.2 Multivariate Unsupervised Analysis

We separated the unsupervised analysis (clusters)

from data and variables (principal components).

3.2.1 Discovering Clusters

The Bayesian Information Criterion in the Two-Step

Clustering technique found that two is the optimal

number of clusters. Cluster 1 has 309 students

(47.7% of the total analyzed). Cluster 2 has 339

students (52.3%). Due to the missing values (specif-

ically Academic Index in the previous school and

Scale to get admission) 359 students were excluded

in the clustering procedure. It determined the varia-

bles that essentially distinguish the clusters.

So, this cluster analysis allows us to classify the

students according to the predictive variables. This

analysis was repeated also within the cohort and the

results were similar: 2 clusters were formed; the

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

316

significant variables were the same, with the inter-

esting exception that appears the mother occupation

with significance. On the other hand the cross tabu-

lation between the clusters with the efficiency meas-

ure is similarly significant (accuracies 88,5% and

84,5%).

3.2.2 Discovering Dimensions

We look for two dimensions with CATPCA and we

obtained them with high reliability (Cronbach’s

Alpha 0.702 for the first dimension, 0.458 for the

second and 0.866 for the total). The variables more

strongly associated with each dimension were de-

termined.

It is interesting that dimension 1 includes varia-

bles relatives to the students whereas dimension 2

includes variables relatives to the university (faculty

and career). For instance, Birth_Year, Academic

index in the previous school, and Academic scale to

get admission are variables with high negative corre-

lation with dimension 1, and so it reduces its value

when these variables increase; whereas Sex and

Scholastic origin increase the dimension 1 propor-

tionally. It is important to explain that Sex was codi-

fied with the greater value for males and Scholas-

tic_Origin with the greater values for origins that are

not the normal High school institutes.

It was clear that dimension 1 is more important

than dimension 2 regarding to the efficiency meas-

ure, because it has more separated extreme values.

These are good news for faculties and university

directives, and help them in decision making about

what variables explore and which students support.

Similar results are obtained with this procedure

applied to the cohort students. Finally we could

combine the results of these two unsupervised analy-

sis. If we do this, we obtain, as it was expected, that

dimension 1 is more strongly marked in the cluster 1

where the efficiency measure is the worst and the

inverse with dimension 2 (ANOVA is significant).

Of course, it is possible to combine the clustering

and a CATPCA technique in other forms, for in-

stance, as it was elaborated in (Nuñez et al., 2006),

but the information will be redundant, and we prefer

to show the results of the supervised analysis.

3.3 Multivariate Supervised Analysis

In the Fig. 1 we show the decision tree obtained with

the CRT technique for the general sampling. This

technique is capable to work with missing values

replacing them with values of the central tendencies

corresponding variables. The detected interactions

are interesting and the different paths lead to 12

terminal nodes. Associated with every one of them

and travelling throughout the correspondent path, it

can be defined a “simplified rule” (actually saved in

several formats, in a more complex form) for the

prediction. The confusion matrix confirmed 70.3%

accuracy. It is enough if we take into account that

we try to predict the success only with epidemiolog-

ical data. And it was ratified with the cohort with

classification accuracy greater than 0.79. Definitive-

ly, this supervised form to predict the efficiency is

the best for the general and the cohort sampling and

can be easily incorporated to SIGENU.

4 VALIDATIONS

The rules obtained were validated with the students

that began in Sept. 2010. They finalized their first

semester in April 2011 and their first year in July

2011. All the prognostics were concordant in these

moments. Recently we made a validation with stu-

dents that began in Sept. 2011. It is interesting that

prediction is already exact at least in 39% of cases.

Actually these are not good news because the prog-

nosis is for a long time (5 years) and the student that

could fail must increase. But faculties are advised

with specific names of students in risk, in order to

make decisions.

5 CONCLUSIONS

It is shown a methodology about how to use, in three

steps, univariate and multivariate non-supervised

and supervised Machine Learning Techniques, in

order to predict an efficiency measure of a budgeted

company and finally, to support the decision making

related with it. Specifically we tried a university

efficiency measure in terms of the possibility of

student’s graduation on time.

 Univariate analysis was significant with appro-

priated risks. It was shown how to use optimally the

Cramer’s V to predict -in an original form- weighted

sum of risks.

 Unsupervised analysis was also significant. It

showed the student classification in two clusters, and

variables combined in two dimensions. Both separa-

tions were essentially capable to show their associa-

tion with the graduation on time or not.

 Supervised analysis was more significant be-

cause it obtains a higher accuracy (almost 80%) in

the efficiency prediction.

This methodology is applied and validated through

PredictingtheEfficiencywithKnowledgeDiscoveryofaBudgetedCompany:ACubanUniversity-Validationthrough

ThreeSemesters

317

Figure 1: Shows the decision tree obtained with the CRT technique. It describes how the predictive variables can interact to

obtain a better prediction of “Finish degree on time”.

three consecutive semesters and the predictions are

used to improve special attention to students at risk

of failing in the UCLV.

REFERENCES

Bacher, J., et al., 2004. SPSS Two Step Cluster. A First

Evaluation. In RC33 6th International Conference on

Social Science Methodology, Amsterdam.

Cramer, 2011. http://changingminds.org/explanations

/research/analysis/cramers_v.htm, consulted Oct.

2011.

Decision Trees, 2011. Learning to Classify with

Branching Test, http://www.aaai.org/AITopics/pmwiki

/pmwiki.php/AITopics/, consulted Oct. 2011

Fawcett, T., 2004. ROC Graphs: Notes and Practical

Considerations for Data Mining Researchers. Kluwer

Academic Publishers. Netherland.

Forbrig, P., Günther, H., 2010. Perspectives in Business

Informatics Research. In Referred proceedings of BIR

2010. Lecture Notes on Business Information

Processing, 64, Rostock, Germany.

Grau, R. et al., 2004. Metodología de la Investigación.

Editorial EL POIRA S.A. Ibagué, Colombia. 2nd Edi-

tion.

Meulman, J. J. et al., 2002. New features of Categorical

Principal Components Analysis for Complicated Date

Sets, Including Data Mining. In Gaul, W., Ritter, G.

(eds.) Proceedings of the 24 annual conference of the

Geselschaft für Classifikation V. 24. Part 3: Data

Mining, Information Processing and Automation,

Springer, pp. 207-218.

Nuñez, I. et al., 2006. Two-Step Cluster Procedure after

Principal Component Analysis Identifies Sperm

Subpopulations in Canine Ejaculates and Its Relation

to Cryoresistance. In J. Androl., 27 (4) p. 596

Prentice, R. L, Farewell, V. T., 1986. Relative Risk and

Odd Ratio Regression. In Annual review of public

health. 7, 35-58.

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

318