A Novel Method for Improving Productivity

in Software Administration and Maintenance

Andrei Panu

Faculty of Computer Science, Alexandru Ioan Cuza University of Iasi, 11 Carol I Blvd., Iasi, Romania

Keywords:

Information Extraction, Named Entity Recognition, Knowledge Engineering, Machine Learning, Web Mining.

Abstract:

After the initial development and deployment, keeping software applications and their execution environments

up to date comes with some challenges. System administrators have no insight into the internals of the applica-

tions running on their infrastructure, thus if an update is available for the interpreter or for a library packaged

separately on which an application depends, they do not know if the new release will bring some changes

that will break parts of the application. It is up to the development team to assess the changes and to support

the new version. Such tasks take time to accomplish. In this paper we propose an approach consisting of

automatic analysis of the applications and automatic veriﬁcation if the changes in a new version of a software

dependency affect them. With this solution, system administrators will have an insight and will not depend on

the developers of the applications in such situations, and the latter will be able to ﬁnd out faster which is the

impact of the new release on their applications.

1 INTRODUCTION

The increasing adoption of cloud computing services

changed the way how software is developed, how it is

deployed and executed, and how we use and interact

with it (as developers or as simple users). We now

have applications that do not need to be locally in-

stalled, are accessible from everywhere using the In-

ternet, and are provided as a service (SaaS – Software

as a Service). These applications are usually com-

posed of multiple services that run ”in the cloud”, in

various environments. They can also be monolithic

applications of different sizes. Even mobile applica-

tions that are installed locally use backend services

hosted on servers. All these are executed in some

pre-conﬁgured environments. After the initial de-

velopment and deployment, some challenges appear

regarding the maintenance of the execution environ-

ment and of the application. For example, system ad-

ministrators face a dilemma when an update is avail-

able for the interpreter of a certain language (PHP,

Python, etc.), especially if it is a major one. Typically,

they are not the developers of the hosted applications

or services that rely on the interpreter, thus they do not

know if the update will bring changes that will break

some parts of the software. It is also not in their re-

sponsibility to know any details about the internals of

the applications. If the developers are faced with the

task to support a new version of the interpreter, they

must make an assessment of the changes brought by

the update and the changes to be made in the appli-

cation. The same problem appears when updating a

library on which the software depends. These tasks

require some effort and time.

In this paper we propose a novel approach that

gives the administrators an insight for the mentioned

problem, and the developers information about the

changes to be made, all that in an automatic manner,

improving their efﬁciency for such kind of tasks. Our

solution uses machine learning and natural language

processing techniques. It is independent on the lan-

guage in which the software was developed.

The closest approaches in similarity that verify

the differences between various versions of code are

those that treat the backward compatbility problem. A

lot of research is conducted on automating the assess-

ment of backward compatibility of software compo-

nents, but the solutions do not offer the same function-

ality as our proposal, they provide a complementary

one (Ponomarenko and Rubanov, 2012; Welsch and

Poetzsch-Heffter, 2012; Ahmed et al., 2016; Tsan-

tilis, 2009). Their source of information are soft-

ware repositories, which they monitor and analyze.

The approaches are based on different techniques that

evaluate the differences between the old source code

and the new one (entire code or only the interfaces).

220

Panu, A.

A Novel Method for Improving Productivity in Software Administration and Maintenance.

DOI: 10.5220/0006407002200229

In Proceedings of the 12th International Conference on Software Technologies (ICSOFT 2017), pages 220-229

ISBN: 978-989-758-262-2

The focus is on assessing if a new version of a soft-

ware component is compatible with its old version,

from a functional point of view. They do not mention

anything about the capability to provide information

about the change of support status of functionalities

in different versions of the libraries/interpreters.

There are some approaches that address this com-

patibility problem with the same purpose as our pro-

posal, but they are not that general and require a lot of

manual work for creating their source of knowledge.

For example, for PHP, there is a tool, PHP CodeSnif-

fer (Sherwood, 2017), that tokenizes the source code

and detects and ﬁxes violations of code formatting

standards. This is used by another tool, PHPCompat-

ibility (Godden, 2017), that contains a set of sniffs for

the former and is able to check for PHP version com-

patibility. Various similar custom tools were devel-

oped for different scopes and languages and libraries.

The major disadvantages with this kind of tools is

that they require a tremendous manual effort to cre-

ate their knowledge base and that they are developed

speciﬁcally for a certain language. To the best of our

knowledge, a system similar with our approach does

not exist.

Our proposal has also a contribution regarding the

possibility to identify and extract entities in the pro-

gramming domain. Known solutions that are capable

of identifying various terms, like AlchemyAPI, Open-

Calais, TextRazor, MeaningCloud, recognize entities

from categories like people, places, dates, not enti-

ties in the programming context. To the best of our

knowledge, a similar system does not exist.

In the next section we present our approach and

give details about how it solves the presented prob-

lem. In section 3 we describe a prototype platform

that implements our ideas. Section 4 presents some

experimental results. Our approach may also be ap-

plicable in other ﬁelds, use cases which are brieﬂy

described in section 5. Finally, section 6 contains fu-

ture development and research ideas for the evolution

of the technology.

2 OUR PROPOSAL

In this vast landscape of software applications of dif-

ferent sizes that run in execution environments con-

ﬁgured directly on barebone servers, on virtual ma-

chines, or in containers, the management of the appli-

cations and the environments after the initial devel-

opment and deployment is a challenge. Our proposal

addresses a speciﬁc management problem, regarding

keeping the software up to date.

System administrators maintain the infrastructure

for running different applications. One of their major

tasks is to maintain the systems up to date and this

generates some difﬁculties. When they are faced with

the situation of updating the execution environment

for the deployed applications, for example updating

an interpreter (for PHP, Python, Perl, etc.) to a new

version, they have to answer questions like: will the

existing applications run on the new version of the

interpreter? Are there any parts of the applications

that will not run because of the changes?

These are not questions that can be answered eas-

ily, because the administrator is usually not the de-

veloper of the application(s). The same thing applies

in case of libraries which are packaged and installed

independently and on which the applications depend.

Given the fact that the sysadmin does not have any

knowledge about the application (except the neces-

sary versions of the interpreter/libraries when the ap-

plication was developed and deployed), he/she can

only base its decision on assumptions to make the up-

date. One intuitive assumption is that if there is a mi-

nor version update for the interpreter or library, ev-

erything should be ﬁne, as no major changes in func-

tionality occurred. In the majority of cases, this holds

true, but it is still an assumption, not a certainty. The

problem arises when there is a major update and the

development team does not plan to update the appli-

cation to support the new release. The problem is very

serious in the case of interpreted languages, because

errors appear at run-time, only when certain blocks

of code get executed, not initially when everything

gets compiled, as is the case with compiled languages.

Thus, some parts of the application may work, while

other parts may not. The system administrator simply

will not know if there will be parts of the applica-

tion that will not execute, he/she is not the developer,

he/she is not the tester. This problem scales, because

a single administrator can have multiple applications

running on his/her infrastructure. Our proposal offers

a solution for this situation, by automatically analyz-

ing and providing information about whether the new

version will bring some changes that will break the

application or not.

Nowadays there is also a shift regarding the man-

agement of the execution environment, from the ded-

icated system administrators to the teams develop-

ing the applications, by using containers like Docker.

This does not solve the problem, only shifts it to the

developers, although it does not mean that adminis-

trators do not care about the situation of the environ-

ments of the containers running on their infrastruc-

ture. Thus, when the development team wants to up-

date the interpreter or some libraries used, it faces the

same problem described above. Even though the team

A Novel Method for Improving Productivity in Software Administration and Maintenance

221

knows the application, it does not know exactly which

blocks of code will execute and which will not. The

solution is to analyze the changelog/migration guide

and to assess the changes that need to be done. This

manual procedure is time consuming. Our solution

helps reduce this time.

The approach that we propose towards solving this

problem is a technology that is able to automatically

scan the code and verify if the used functionalities

are still supported in a targeted version of the inter-

preter/library. This technology is independent of the

used programming language. As we have stated ear-

lier, the focus is on interpreted languages. This tech-

nology is comprised of three parts:

1. A tool that automatically analyzes the code, ex-

tracts the used functionalities, and queries a

knowledge base that helps to answer the follow-

ing questions: is the functionality X supported in

the new version N? If not, what are the changes

that were made?

2. A knowledge base (Noy et al., 2001) created au-

tomatically that contains information about the

functionalities supported in every version of the

interpreter/library;

3. A platform that extracts speciﬁc entities from on-

line or ofﬂine manuals, independently of the pro-

gramming language, and populates the knowledge

base.

The tool is the part that must have access to the

code and which uses the (remote) knowledge base to

verify if there are functionalities used that are not sup-

ported anymore (or are marked to be removed in the

future) in the targeted version. It generates a report

based on the ﬁndings. The key enabler of our tech-

nology is the knowledge base. The most important

aspect is the contained information and the way it is

obtained. For the current version of the platform that

populates the knowledge base, which is presented in

section 3, the functionalities taken into consideration

are the supported functions in all versions of the inter-

preter/library (the provided APIs). Thus, the analysis

tool is a pretty basic component, all that it needs to

do is to extract all the functions in the code, eliminate

those declared locally, and query the knowledge base

to check for support in the targeted version. The only

functionality that is more complex is the ﬁltering of

the functions provided by a certain used library or by

the interpreter.

In this paper, the focus of our work is on the

platform that creates the knowledge base. As we

have mentioned, the data contained consists of de-

tails regarding all the supported functions in the in-

terpreter/library. For each function, we have different

attributes, like its signature’s components (the func-

tion’s name, the number of arguments, the types of ar-

guments, the order of the arguments), the return type,

its short description, its availability (supported, dep-

recated, obsolete), and the version number of the in-

terpreter/library in which it is supported. All the in-

formation is extracted from online manuals available

on the Web or ofﬂine ones. The platform described

in section 3 is capable to automatically extract the de-

sired data, independently of the content (certain man-

uals for certain languages/libraries) or the structure of

the web pages. The extraction technology does not

have implemented any adapters for speciﬁc manuals.

It supports any manual for any language. The only

restriction is regarding the syntax used for writing the

functions in the manual. This capability is achieved

using machine learning algorithms and different nat-

ural language processing (NLP) techniques.

By having such kind of knowledge and being able

to automatically analyze the code and obtain a status

report regarding the unsupported changes offers some

major advantages. The administrators can now have

an insight into what will happen with the execution

of the application if the update is made. All this can

be done without him/her knowing any details about

the implementation of the application. For the devel-

opment team, although it knows the internals of the

application, it does not know exactly which blocks of

code will execute and which will not, in case of exist-

ing changes of some functions. The solution for this

is to read the changelog/migration guide, to extract all

the functions that were modiﬁed and to search for all

occurrences of the functions that need to be modiﬁed

in the application. Another approach is to do a thor-

ough test of the application and check if everything

works or not, and if there are problems, the changelog

must be consulted. This manual procedure requires

a lot of time. By having the tool that is capable of

scanning the code, interrogating our knowledge base

and reporting what functions are not supported any-

more or suffered some changes, pointing the develop-

ers exactly to the blocks of code that need to be mod-

iﬁed, a lot of time is earned, by eliminating the man-

ual steps described above. The report also provides

details about the differences between the old and the

new function(s), easing the job of the developers to

make the changes, by not requiring them to manually

search for that information. All these aspects will im-

prove the delivery time for the updates. Such a report

that can be created automatically, containing all the

changes that need to be done, is also very useful in

case of project estimation. It can be used by all the

persons involved in estimating a project (e.g. project

managers, business analysts, etc.) to ﬁnd out from

ICSOFT 2017 - 12th International Conference on Software Technologies

222

the start what are the implications of making a cer-

tain update (in the context described above), without

requiring some developers to make an assessment of

the changes that need to be done. It eliminates all that

manual labor.

By employing our solution which needs access to

the source code, the developing teams and the owners

of the applications can have some real privacy con-

cerns. Privacy is a very important issue nowadays,

and there are many legal and technological aspects

regarding this (Pearson, 2009; Alboaie et al., 2015).

Our approach and design is in line with principles like

Privacy by Design and Privacy by Default (Ferretti,

2015). The functionality of the analysis tool that runs

on the users’ machines and scans the code is trans-

parent, and is easily veriﬁable by the user. This tool

extracts only the functions used in the code, which

represent public information, they are provided by

the interpreter/library, ignoring everything else in the

source code. Further, only those functions are trans-

mitted into the requests made to the knowledge base.

Thus, effectively no proprietary code is extracted and

transmitted.

3 SYSTEM ARCHITECTURE

AND DESIGN

In this section we present the architecture of a proto-

type platform that is capable of automatically access-

ing the manuals of interpreters/libraries available on

the Web or ofﬂine, identifying and extracting the spe-

ciﬁc entities, and populating the knowledge base. Its

components are designed to be context independent

and decoupled. Each component offers its function-

ality as an independent service. The designed archi-

tecture is based on SOA (Service Oriented Architec-

ture) principles. The platform is fully implemented

in Python. Figure 1 depicts the system’s architecture.

The platform has four main components, CorpusTrain

SigDetection, CorpusTrain VerDetection, WebMiner,

OntoManager, and its functionality is split into two

phases.

First one, the training phase, uses a machine learn-

ing algorithm to generate two models used for the de-

tection of function signatures and version numbers of

the interpreter/library in which the function is sup-

ported. The training data contains manually anno-

tated information downloaded from PHP and Python

online manuals. The components involved in this

step are CorpusDownloader, CorpusReader, Corpus-

Reader TrainingData, CorpusTrain SigDetection and

CorpusTrain VerDetection. Classiﬁer SigDetection

and Classiﬁer VerDetection are the classiﬁers that re-

sult from this step.

The second step, the extraction phase, consists

of accessing the manuals, analyzing them, and ex-

tracting speciﬁc knowledge. The main orchestrator

of all these operations is WebMiner component. The

other components used are CorpusDownloader, Cor-

pusReader, SigContext, NER Version Number, NER

SigComponents, NER SigDescription. The identiﬁed

data is then saved in the knowledge base, operation

managed by OntoManger. Further we present imple-

mentation details about each component.

3.1 CorpusDownloader

This module is a specialized Web crawler that ac-

cesses online manuals. It accomplishes this through

the use of lxml library (lxml, 2017). For navigating

through the links of the manual we use XPath expres-

sions, which are custom built for each site, guiding

the crawler to the desired web pages. The content of

each page is then downloaded and saved to a local

storage using Lynx (Dickey, 2017). This is a text web

browser, but we use it as a headless browser, without

requiring any user input, to save the rendered content,

exactly like an user sees it. This is an important op-

timization that improves the extraction process. Li-

braries like lxml or BeatifulSoup are able to download

web pages, but they download them with the HTML

code included, and when we want to clean the tags

and extract only the content, there are situations when

a sentence (e.g. a function signature) is split into mul-

tiple rows, thus is way harder to detect it. This de-

pends on how the developers wrote the HTML code.

By using a headless browser, the advantages are that

our component is independent of the structure of the

page, of its stylization, thus it is not affected from fu-

ture layout changes. It gives us the content exactly

like a human being is seeing it. The only dependency

is towards reaching the desired pages.

3.2 CorpusReader

This component loads downloaded content, splits it

into a list of sentences, then each sentence is split into

a list of words. The order of sentences and words

is maintained. The sentence segmentation is done

using Punkt sentence segmenter from NLTK plat-

form (NLTK, 2017). It contains an already trained

model for English language. For word segmentation,

Penn Treebank Tokenizer is used, which uses regular

expressions to tokenize text. After that, all punctua-

tion signs are removed, because, in the current version

of implementation, they do not bring any value to our

extraction process, thus saving some memory.

A Novel Method for Improving Productivity in Software Administration and Maintenance

223

Figure 1: System Architecture.

3.3 CorpusReader TrainingData

This module loads the training data that is used for

generating classiﬁers that can identify signatures and

version numbers. It uses all the functionalities pro-

vided by CorpusReader and, additionally, categorizes

each instance based on its label, which can be one of

the following two: pos or neg.

3.4 CorpusTrain SigDetection

The purpose of this component is to train a binary

classiﬁer to detect signatures. The assigned labels are

pos and neg. CorpusReader TrainingData provides

the labeled data, from which the feature sets are ob-

tained. For a signature, we use the following features:

if the character before last is parenthesis, if it contains

an equal sign before ﬁrst occurrence of ’(’ char, if it

contains special keywords (like if, for, foreach, while,

etc.) before ﬁrst ’(’ char, the number of colons before

ﬁrst ’(’ char, if there are more than three words before

ﬁrst ’(’ char, if there are letters before ﬁrst ’(’ char, if

it contains unusual words (like ones that do not con-

tain letters or contain only one character) before ﬁrst

’(’ char, if the last char is in a predeﬁned list of chars

(like !, @, &, *, etc.), if after last ’)’ char there are

more than one characters, if there is a single pair of

top level brackets, if there are more than one top level

bracket pairs.

The training data contains examples from PHP

and Python manuals. The generated feature sets are

used to train a Naive Bayes classiﬁer. Although it is

one of the simplest models, we have chosen Naive

Bayes because it is known to give good results for

various use cases and because it does not need a large

amount of training instances. The generated classiﬁer

is used further by Classiﬁer SigDetection component,

which offers a single service: it receives a sentence

and establishes if it represents a signature or not.

3.5 CorpusTrain VerDetection

This module trains a Naive Bayes binary classiﬁer

for detecting version numbers. The assigned labels

are also pos and neg. The feature sets are obtained

from the data provided by CorpusReader Training-

Data. We analyze each word from each sentence, tak-

ing also into consideration the word’s context. We use

for learning the following features: if the ﬁrst charac-

ter is ’v’, the number of dots, if most frequent chars

are digits, the word before, if the word before is in a

gazetteer list (which contains names of programming

languages).

The training data contains examples from PHP,

Python and jQuery manuals. The generated classiﬁer

is used by Classiﬁer VerDetection component, which

ICSOFT 2017 - 12th International Conference on Software Technologies

224

offers a single service: it receives a sentence and a

word in that sentence, and establishes if the word rep-

resents a version number.

3.6 SigContext

This component receives all the sentences of the en-

tire document, identiﬁes the signature(s) using Classi-

ﬁer SigDetection and established the descriptive con-

text of each signature. The algorithm for identifying

the context is intuitive and models the way humans

analyze this: usually all the descriptive details are af-

ter the line containing the signature or there are situ-

ations when there is a line containing only the func-

tion’s name (as is the case with PHP manual). This

marks the beginning. The context ends when a line

containing another signature is identiﬁed or the end

of document is reached. SigContext builds a model of

the document containing indexes that represents the

positions of the signatures, the start, and the end of

their contexts.

3.7 NER Version Number

This component receives the signatures and their de-

scriptive contexts and searches for the version num-

bers of the interpreter or library in which the function

is supported/unsupported. First it checks if there is

a version number outside the context, usually mean-

ing that the version applies to all the signatures. After

that, it extracts the number(s) detected inside the con-

text (which have higher priority over the one outside

the context). It uses Penn Treebank Tokenizer from

NLTK for tokenizing each sentence into words, and

our trained classiﬁer provided by Classiﬁer VerDetec-

tion for detecting version numbers. Beside the ver-

sion, it veriﬁes its context for identifying the func-

tion’s availability state: supported, deprecated, re-

moved. For that, we have built a dictionary containing

stems of different keywords (like changed (in), added

(in), removed (from), etc.) that denote the state. Then

each word in the context of the version is stemmed

and compared with the entries in the dictionary in or-

der to detect the state. Stemming is the technique that

removes afﬁxes from a word. We use it as an op-

timization, allowing us not to store all the forms of

words in the dictionary. For this procedure, the Porter

stemming algorithm from NLTK is used.

3.8 NER SigComponents

This module receives a sentence representing a sig-

nature and identiﬁes its components (e.g. function’s

name, return type, parameters, etc.).We deﬁned sev-

eral regular expression rules for this extraction.

3.9 NER SigDescription

This component is capable to identify the short de-

scription of a signature, which contains details about

its functionality. It searches for it in the signature’s

descriptive context. We are using a dependency parser

provided by spaCy library (ExplosionAI, 2017) to an-

alyze the grammatical structure of each sentence. The

textacy library (DeWilde, 2017) is used for manipu-

lating SVO triples (subject-verb-object) identiﬁed by

spaCy. The detected lexical items and their gram-

matical functions represent the source of information

for our identiﬁcation algorithm. This algorithm im-

plements some observed patterns commonly used to

express the description and selects the ﬁrst sentence

that meets its rules. This represents a good choice

because the description is usually near the signature.

We ﬁrst look at the subject linked to the root verb. If

it contains the function’s name, then the sentence is

very probable to be the description. If the root verb

does not have a subject, we look at its tense. If it is

labeled as VB (base form) or VBZ (3rd person sin-

gular present), then it is very likely to be the descrip-

tion, this being a very common way used to express it.

Also large sentences are split and each part analyzed

individually, because of the errors of the dependency

parser in such situations, in identifying the root verb,

its subject, etc.

3.10 WebMiner

This is the orchestrator of the entire information ex-

traction process. Through CorpusReader it obtains

the content of the page(s) that contain the signatures.

Then it provides this list of sentences to SigContext,

which returns a model containing indexes that repre-

sent which sentence is a signature and which one is

the start or the end of the sentence’s description con-

text. After that, it sends the signature to NER Sig-

Components to obtain its components. NER SigDe-

scription receives the context and returns the signa-

ture’s description. NER Version Number receives the

signature’s context and also the general context, as we

named it, and returns the version numbers. All data is

put together and is provided to OntoManager.

3.11 OntoManager

This component receives the extracted information

and generates instances for the knowledge base. The

data is expressed using RDF (Resource Description

A Novel Method for Improving Productivity in Software Administration and Maintenance

225

Framework) model. We use RDFLib (RDFLib, 2017)

to work with the RDF triples. These triples contain in-

formation structured according to concepts illustrated

by our software ontology. This ontology was created

using Protege. The existing ontologies that are spe-

ciﬁc to this domain, SEON Code Ontology (W

ursch

et al., 2012), Core Ontology of Programs and Soft-

ware (COPS) (Lando et al., 2007) and Core Software

Ontology (CSO), with its extension, Core Ontology of

Software Components (COSC) (Oberle et al., 2009),

do not contain all the conceptual descriptions that are

needed in our case. Thus, we have extended them

with some new concepts, attributes and relations that

model accurately the extracted information.

4 EXPERIMENTAL RESULTS

The performance of the platform for the current tar-

geted use cases is given by the capability of the clas-

siﬁers used to detect the signatures and the version

numbers. For the classiﬁer that detects signatures,

the training set contains 280 positive examples and

290 negative ones, and was created from PHP and

Python manuals. The ﬁrst 70% of the labeled in-

stances for each label are used for training, and the

rest for testing. The trained model has an accuracy

of around 98%. For the classiﬁer that detects version

numbers, the training instances are created from ex-

amples from PHP, Python, and jQuery manuals. The

positive set contains 48 rows with 201 words in total,

within which there are 62 words representing a ver-

sion number. The negative set contains 43 rows with

341 words in total. The feature sets are split, the ﬁrst

70% of the labeled instances for each label is used for

training, and the rest for testing. The model has an ac-

curacy of around 95%. We say that the models have

an accuracy value around a certain percent because at

train time we randomize the instances, thus the value

is slightly different on different executions.

As for validation test, we have pointed the plat-

form to extract data from various pages selected ran-

domly from Node.JS (version 7.7.0) (Node.JS, 2017),

Ruby (Ruby, 2017), PHP (PHP, 2017), Python (ver-

sion 3.6.0) (Python, 2017), and Laravel (Laravel,

2017) online manuals.

Table 1 summarizes the performance of the plat-

form in detecting the signatures for each case. The

second column contains the total number of functions

that exist in each page, and the last column the per-

centage of the detected signatures. For Node.JS, PHP,

and Python, it extracted all the functions. In case of

Laravel, it missed one function who’s name contains

only one character. In case of Ruby, it detected only

Table 1: Signature detection performance.

Man page No. of functions Detection rate

Node.JS 26 100%

Ruby 59 64.4%

PHP 1 100%

Python 30 100%

Laravel 80 98.75%

38 functions from a total of 59. This result is not very

good because in the page there are many functions

which are not written using parenthesis, this being

a very important feature when searching for the pat-

tern. In case of PHP, the platform identiﬁed 5 more

functions (false positives), because there are a lot of

comments with code examples in the page, being un-

successful in ﬁltering all of the functions mentioned

there. In the rest of the cases we did not have any

false positives or negatives.

Regarding the identiﬁed versions, we obtained the

following results:

• for Node.JS: the page contains 26 functions, 5 of

them having speciﬁed a single version represent-

ing when it was added, the rest containing two

versions (when it was added and since when it is

deprecated). The platform correctly identiﬁed all

of them, with their status. We do not have any

false positives or negatives;

• for Ruby: the page does not contain any version

numbers, thus the system correctly did not iden-

tify any;

• for PHP: the page contains a single function with

3 version numbers mentioned. The platform suc-

cessfully identiﬁed all of them;

• for Python: the page contains 30 functions, 4 of

them having speciﬁed two versions, 1of them hav-

ing speciﬁed 3 versions, and the rest only 1 ver-

sion. The system correctly identiﬁed all of them,

without any false positives or negatives;

• for Laravel: the page does not mention any details

about versions inside the context of the functions,

thus the system correctly did not identify any.

Regarding the additional information extracted,

for each of the identiﬁed functions, the component

NER SigComponents successfully extracted all of its

parameters and its return type. For the detection of

the functions’ descriptions, the system obtained the

results presented in Table 2.

The second column represents the total number of

signatures that were detected (each having a single de-

scription), and the last one the percentage of the de-

tected descriptions. In case of Node.JS, it failed to

ICSOFT 2017 - 12th International Conference on Software Technologies

226

Table 2: Description detection performance.

Man page No. of det. functions Detection rate

Node.JS 26 76.9%

Ruby 38 68.4%

PHP 6 100%

Python 30 93.3%

Laravel 79 96.2%

identify 6 descriptions. For Ruby, it missed 12 de-

scriptions. Regarding PHP, it successfully identiﬁed

the description of the single function. For the other 5

false positives, it did not detect anything because they

are examples of code, thus they do not have descrip-

tions. In case of Python, it did not identify the de-

scriptions of 2 functions. Finally, for Laravel, it failed

to extract 3 descriptions.

5 ADDITIONAL APPLICABILITY

Our solution directly targets the persons involved in

assuring software administration and software main-

tenance, for the use cases described previously. The

features offered by the platform can also be used in

the context of the Semantic Web. The purpose of Se-

mantic Web is to make the content available on the

Web understandable not only by humans, but also

by computers, without using artiﬁcial intelligence.

This is mainly done by enriching the Web documents

with semantic markup in order to add meaning to

the content. Thus, a machine will be able to pro-

cess knowledge about text. There are many seman-

tic annotation formats that can be used in HTML

documents, like Microformats, RDFa, and Micro-

data. In order to facilitate the annotation process,

many systems were developed (Reeve and Han, 2005;

Uren et al., 2006; Laclavik et al., 2012; S

anchez

et al., 2011; Charton et al., 2011). Considering the

annotation process, there are tools that allow users

to manually create annotations, and tools that cre-

ate them semi-automatically or automatically. In lat-

ter case, considering the methods used, they can be

classiﬁed in two categories, pattern-based and ma-

chine learning-based (using supervised or unsuper-

vised learning techniques). The tools that are based

on supervised learning require training in order to

identify our speciﬁc types of entities. Also the major-

ity of them require the existence of an ontology that

deﬁnes the semantics of the domain. To the best of

our knowledge, an ontology for the kind of informa-

tion we deal with does not exist. Also, the existing ap-

proaches typically cover real world entities. Our plat-

form provides such an ontology and also the informa-

tion extraction techniques needed to identify and ex-

tract the entities that must be annotated (e.g. the func-

tion, return type, parameters, parameters type, etc.).

Thus, the capabilities of the platform can enable auto-

matic generation of semantic markup for speciﬁc Web

documents. A special tool must be developed that is

able to match each entity extracted by our platform

with the information on a page, adding the appropri-

ate annotations.

Another use case is more advanced and represents

a vision. It refers to assisting developers in the pro-

cess of building new applications, especially for the

Internet of Things (IoT). In the context of the IoT,

there is the need for a single software application to

have support for different devices from different man-

ufacturers. This capability is obtained usually through

the implementation of adapters, which are dedicated

software modules for each device. Each module uses

the speciﬁc API (Application Programming Interface)

of the vendor to interact with its device. In the case

of supporting devices from different manufacturers

that have the same functionality (e.g. smart plugs,

air quality monitors, dimmer switches, thermostats,

smoke detectors, etc.), each API is learned and sim-

ilar adapters are built, because the functions that in-

teract with the devices are mostly the same, like set-

ting a certain value for a threshold or getting a cer-

tain value. The differences are in their name and pos-

sibly parameters, but their functionality is the same.

Our idea implies the development of a single adapter

and the use of a special built tool that is able to an-

alyze the functions used in that adapter and propose

the equivalent functions in the APIs of other similar

devices for which adapters must be developed. Our

knowledge base, that can contain all the functions in

each API, is the source of information. Based on

that information, we envision the development of a

new technology that is able to suggest similar func-

tions in the targeted APIs, by analyzing a function’s

signature, and most importantly, the meaning of its

description and, eventually, each parameter’s descrip-

tion. This capability can be achieved with our plat-

form by using NLP resources, like WordNet, a se-

mantically oriented dictionary of English, and tech-

niques for analyzing the structure of a sentence (by

using context free grammars, dependency grammars,

feature based grammars, etc.) and its meaning (by

using ﬁrst order logic, λ calculus, etc.) (Bird et al.,

2017). To accomplish semantic analysis is very hard.

It represents the most complex phase of natural lan-

guage processing. There is a lot of ongoing research

towards this goal and on the task of computing sen-

tence similarity (Atoum et al., 2016; Dao and Simp-

son, 2005; Crockett et al., 2006; Miura and Takagi,

A Novel Method for Improving Productivity in Software Administration and Maintenance

227

2015; He et al., 2015; Liu and Wang, 2013; Sanborn

and Skryzalin, 2015; Erk, 2012). There are also many

tools available capable of comparing the meaning of

two sentences, with different success rates (SEMI-

LAR, 2017; DKPro, 2017; RxNLP, 2017; Pilehvar

et al., 2013; Linguatools, 2017; Cortical.io, 2017).

At the moment, we think that existing solutions can

provide acceptable results, but still much further re-

search must be done to accomplish our vision. To the

best of our knowledge, there is no available system

that is using this kind of techniques and is capable of

generating code in this context.

6 CONCLUSIONS AND FURTHER

WORK

The proposed solution is based on many components,

each with varying degrees of complexity. Reﬁne-

ments can be made to any of them for further im-

proving the platform (e.g., using an improved head-

less browser that is aligned with the latest Web stan-

dards, improving the classiﬁers, implementation of

new patters for the detection of signatures’ descrip-

tions, etc.). Regarding the system’s architecture, we

plan on adding a REST API to each of the compo-

nents, thus making the platform easily deployable and

accessible in cloud environments and highly scalable.

It could also be easily integrated in environments

that make use of Enterprise Service Buses (Chappell,

2004), or newer approaches like Swarm Communica-

tion (Alboaie et al., 2014; Alboaie et al., 2013). Be-

sides the improvements that will be made to the plat-

form, much work will also be directed towards the

development of the tools that enable the use cases.

First, there is the tool that is able to scan code written

in different languages, extract all the used functions,

eliminate those that were locally deﬁned, interrogate

our knowledge base, and build the report regarding

support status for the targeted version of the inter-

preter/library. Then, we have the semantic annotator

tool, that is able to mark up HTML code referencing

the speciﬁc entities.

In this paper we have addressed the problem of

keeping software applications and their execution en-

vironments up to date. As we have seen, this task

comes with some difﬁculties. System administrators

have no insight about the internals of the applications

that run on their infrastructure, so if they are faced

with updating the execution environment, they do not

know if the applications will be fully functional after

the update. It is up to the development team to do

the necessary veriﬁcations and eventual changes. The

same applies in case of software libraries on which

the applications depend. The tests require a lot of

manual work and take time to accomplish. We have

proposed a novel approach that improves productiv-

ity in accomplishing such tasks, by automating the

assessment of the changes that were made in a new

version and their impact on the functionality of the

application. Our solution is based on one major com-

ponent, the platform that is able to automatically cre-

ate a knowledge base containing details about sup-

ported functionalities in each version of the targeted

interpreter/library, independent of the programming

language used. It has applicability in many ﬁelds,

like software administration, software maintenance,

and even project management. In this paper we pre-

sented the prototype of the platform. To the best of

our knowledge, similar solutions do not exist. The ex-

isting tools that are used require manual speciﬁcation

of the changes that were made and are also language

dependent.

REFERENCES

Ahmed, H., Elgamal, A., Elshishiny, H., and Ibrahim, M.

(2016). Veriﬁcation of backward compatibility of soft-

ware components. US Patent 9,424,025.

Alboaie, L., Alboaie, S., and Barbu, T. (2014). Extend-

ing Swarm Communication to Unify Choreography

and Long-lived Processes. In Proceedings of the 23rd

International Conference on Information Systems De-

velopment (ISD 2014).

Alboaie, L., Alboaie, S., and Panu, A. (2013). Swarm

Communication – A Messaging Pattern Proposal for

Dynamic Scalability in Cloud. In 10th IEEE Inter-

national Conference on High Performance Comput-

ing and Communications & 2013 IEEE International

Conference on Embedded and Ubiquitous Computing,

HPCC/EUC 2013, Zhangjiajie, China, November 13-

15, 2013, pages 1930–1937.

Alboaie, S., Alboaie, L., and Panu, A. (2015). Levels of

Privacy for eHealth Systems in the Cloud Era. In Pro-

ceedings of the 24th International Conference on In-

formation Systems Development, pages 243–252.

Atoum, I., Otoom, A., and Kulathuramaiyer, N. (2016). A

Comprehensive Comparative Study of Word and Sen-

tence Similarity Measures. International Journal of

Computer Applications, 135(1):10–17.

Bird, S., Klein, E., and Loper, E. (2017). Natural Language

Processing with Python – Analyzing Text with the

Natural Language Toolkit. http://www.nltk.org/book/.

Chappell, D. (2004). Enterprise Service Bus. O’Reilly Me-

dia, Inc.

Charton, E., Gagnon, M., and Ozell, B. (2011). Automatic

Semantic Web Annotation of Named Entities. Springer

Berlin Heidelberg, Berlin, Heidelberg.

Cortical.io (2017). Similarity Explorer. http://www.cortical.

io/similarity-explorer.html.

ICSOFT 2017 - 12th International Conference on Software Technologies

228

Crockett, K., McLean, D., O’Shea, J. D., Bandar, Z. A.,

and Li, Y. (2006). Sentence Similarity Based

on Semantic Nets and Corpus Statistics. IEEE

Transactions on Knowledge and Data Engineering,

18(undeﬁned):1138–1150.

Dao, T. N. and Simpson, T. (2005). Measuring similarity

between sentences. WordNet.Net, Tech. Rep.

DeWilde, B. (2017). textacy NLP library. http://textacy.

readthedocs.io/en/latest/.

Dickey, T. E. (2017). Lynx Text Web Browser. http://lynx.

browser.org/.

DKPro (2017). DKPro Similarity Framework. https://dk

pro.github.io/dkpro-similarity/.

Erk, K. (2012). Vector space models of word meaning and

phrase meaning: A survey. Language and Linguistics

Compass, 6(10):635–653.

ExplosionAI (2017). spacy NLP library. https://spacy.io/.

Ferretti, E. (2015). Privacy by design and privacy by

default. http://europrivacy.info/2015/06/09/privacy-

design-privacy-default/.

Godden, W. (2017). PHPCompatibility. https://github.com/

wimg/PHPCompatibility.

He, H., Gimpel, K., and Lin, J. (2015). Multi-perspective

sentence similarity modeling with convolutional neu-

ral networks. In Proceedings of the 2015 Conference

on Empirical Methods in Natural Language Process-

ing, pages 1576–1586.

Laclavik, M.,

Seleng, M., Ciglan, M., and Hluch

y, L.

(2012). Ontea: Platform for pattern based automated

semantic annotation. Computing and Informatics,

28(4):555–579.

Lando, P., Lapujade, A., Kassel, G., and F

urst, F. (2007).

Towards a general ontology of computer programs. In

Proceedings of the International Conference on Soft-

ware and Data Technologies, pages 163–170.

Laravel (2017). Manual page. https://laravel.com/

docs/5.4/helpers.

Linguatools (2017). DISCO. http://www.linguatools.de/

disco/disco en.html.

Liu, H. and Wang, P. (2013). Assessing Sentence Simi-

larity Using WordNet based Word Similarity. JSW,

8(6):1451–1458.

lxml (2017). lxml XML toolkit. http://lxml.de/.

Miura, N. and Takagi, T. (2015). WSL: Sentence Similar-

ity Using Semantic Distance Between Words. In Pro-

ceedings of the 9th International Workshop on Seman-

tic Evaluation (SemEval 2015), pages 128–131, Den-

ver, Colorado. Association for Computational Lin-

guistics.

NLTK (2017). Natural Language Toolkit. http://www.nltk.

org/.

Node.JS (2017). Manual page. https://nodejs.org/api/util.

html.

Noy, N. F., McGuinness, D. L., et al. (2001). Ontology

development 101: A guide to creating your ﬁrst ontol-

ogy.

Oberle, D., Grimm, S., and Staab, S. (2009). An ontology

for software. In Handbook on ontologies, pages 383–

402. Springer.

Pearson, S. (2009). Taking account of privacy when design-

ing cloud computing services. In Proceedings of the

2009 ICSE Workshop on Software Engineering Chal-

lenges of Cloud Computing, pages 44–52. IEEE Com-

puter Society.

PHP (2017). Manual page. http://php.net/manual/en/ func-

tion.chmod.php.

Pilehvar, M. T., Jurgens, D., and Navigli, R. (2013). Align,

Disambiguate and Walk: A Uniﬁed Approach for

Measuring Semantic Similarity. In Proceedings of the

51st Annual Meeting of the Association for Computa-

tional Linguistics, pages 1341–1351. ACL.

Ponomarenko, A. and Rubanov, V. (2012). Backward com-

patibility of software interfaces: Steps towards auto-

matic veriﬁcation. Programming and Computer Soft-

ware, 38(5):257–267.

Python (2017). Manual page. https://docs.python.org/ 3/li-

brary/os.path.html.

RDFLib (2017). RDFLib RDF library. https://rdﬂib.

readthedocs.io/en/stable/.

Reeve, L. and Han, H. (2005). Survey of Semantic An-

notation Platforms. In Proceedings of the 2005 ACM

Symposium on Applied Computing, SAC ’05, pages

1634–1638, New York, NY, USA. ACM.

Ruby (2017). Manual page. https://ruby-doc.org/docs/ruby-

doc-bundle/Manual/man-1.4/function.html.

RxNLP (2017). Text Similarity API. http://www.rxn

lp.com/api-reference/text-similarity-api-reference/.

Sanborn, A. and Skryzalin, J. (2015). Deep learning for

semantic similarity.

anchez, D., Isern, D., and Millan, M. (2011). Content

annotation for the semantic web: an automatic web-

based approach. Knowledge and Information Systems,

27(3):393–418.

SEMILAR (2017). SEMILAR: A Semantic Similarity

Toolkit. http://deeptutor2.memphis.edu/Semilar-Web/

public/contact.html.

Sherwood, G. (2017). PHP CodeSniffer. http://pear.php.net/

package/PHP

CodeSniffer/.

Tsantilis, E. (2009). Method and system to monitor soft-

ware interface updates and assess backward compati-

bility. US Patent 7,600,219.

Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera,

M., Motta, E., and Ciravegna, F. (2006). Semantic An-

notation for Knowledge Management: Requirements

and a Survey of the State of the Art. Web Semant.,

4(1):14–28.

Welsch, Y. and Poetzsch-Heffter, A. (2012). Verifying

Backwards Compatibility of Object-oriented Libraries

Using Boogie. In Proceedings of the 14th Workshop

on Formal Techniques for Java-like Programs, FTfJP

’12, pages 35–41. ACM.

ursch, M., Ghezzi, G., Hert, M., Reif, G., and Gall,

H. C. (2012). SEON: a pyramid of ontologies for

software evolution and its applications. Computing,

94(11):857–885.

A Novel Method for Improving Productivity in Software Administration and Maintenance

229