Towards Synthesis of Code for Calculations Using Their Speciﬁcations

Advaita Datar, Amey Zare, Venkatesh R and Asia A

TCS Research, Tata Research Development and Design Centre (TRDDC), Pune, India

Keywords:

Program Synthesis, Programming by Examples, Formal Speciﬁcations.

Abstract:

Banking, Financial Services, and Insurance (BFSI) software are calculation intensive. In general, these cal-

culations are formally speciﬁed in spreadsheets, known as Calculation Speciﬁcation (CS) sheets. CS sheets

describe the calculation inputs and the business logic applied on these inputs to compute calculation output(s).

Additionally, an illustration of the calculation is provided with at least one valid value for each calculation in-

put. However, manual implementation of code corresponding to such CS sheets remains to be effort-intensive

and tedious. This includes writing database queries to retrieve values for calculation inputs from the enterprise

database and converting these queries and corresponding business logic to code. We propose a novel idea to

synthesize code corresponding to CS sheets that will i) automatically identify the calculation inputs ii) formu-

late a Programming By Example (PBE) speciﬁcation for each calculation input where, PBE input is the textual

description of the calculation input, PBE output is the valid value provided in the calculation’s illustration, iii)

then for each PBE speciﬁcation a) synthesize a set of possible database queries, b) manually review them and

mark the intended query, and ﬁnally iv) generate code, in desired target language, for all intended queries and

the business logic speciﬁed in CS sheets.

1 INTRODUCTION

Program synthesis is the task of automatically ﬁnd-

ing a program, in the underlying programming lan-

guage, that satisﬁes the user intent expressed in the

form of speciﬁcations (Gulwani et al., 2017). Var-

ious formats for expressing user intent include, but

are not limited to, formal logical speciﬁcations, in-

formal natural language descriptions and illustrative

input-output examples. Writing formal speciﬁcations

is a tedious and time-consuming task, and in some

cases, it may be as complex as writing the program it-

self. On the other hand, informal speciﬁcations are of-

ten criticized for being ambiguous and verbose, even

for simple programs. As an alternative, illustrative

input-output examples, in the form of Programming

By Example (PBE) speciﬁcations (Gulwani, 2016),

are used to express user intent. While PBE speciﬁ-

cations work well for string transformation examples

(Gulwani, 2011), they aren’t suitable for real-world

applications. This is because it is extremely difﬁ-

cult to specify all the requirements of such large and

complex applications using a limited number of input-

output examples. Moreover, sometimes only a subset

of the requirements of such applications need to be

implemented as the code for business rules for such

Figure 1: A sample Calculation Speciﬁcations (CS) sheet.

applications is already in place. Hence, accurately

expressing and interpreting user intent in the form of

speciﬁcations continues to be a major challenge for

program synthesis approaches.

Large enterprise applications, especially in the

Banking Financial Services and Insurance (BFSI) in-

dustry, rely heavily on calculations for day-to-day

operations. BFSI software, the largest market for

Information Technology (IT) services (Arc, 2022),

contains numerous types and instances of such cal-

culations. These calculations represent the business

Datar, A., Zare, A., R, V. and A, A.

Towards Synthesis of Code for Calculations Using Their Speciﬁcations.

DOI: 10.5220/0011940900003464

In Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2023), pages 497-504

ISBN: 978-989-758-647-7; ISSN: 2184-4895

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

497

Figure 2: Overview of the proposed approach for code generation from Calculation Speciﬁcations (CS) sheet.

rules that are adhered to by BFSI organizations, and

they are developed by domain experts following rigor-

ous review. Such calculations are typically speciﬁed

in spreadsheets, known as Calculation Speciﬁcation

(CS) sheets. Figure 1 shows a CS sheet from insur-

ance domain that speciﬁes the calculation for com-

puting admin fee for a policy holder who may have

exceeded the limit for premium payments in a year.

As evident from the sample CS sheet, calculations

are formally speciﬁed in terms of calculation inputs

(cells A3-A8) and the business rules for computing

the value of calculation output Admin Fee (cell A11)

using the values of calculation inputs (cells B3-B8).

The business rules (cells B11, B14) are implemented

by domain experts in the form of mathematical and

logical formulae. For ease of understanding, an il-

lustration of the calculation is provided that contains

at least one valid value for each calculation input.

For example, the given CS sheet (Figure 1) illustrates

how the admin fee is calculated for policy ID 123456

by considering the valid values (cell B3-B8) for each

calculation input. Here, valid values are the output

of queries corresponding to the calculation input text

(A3-A11) for the policy ID 123456. As soon as val-

ues for calculation inputs (cells B3-B8) are entered in

the CS sheet, all formulae (cell B11, B14) are trig-

gered and the respective calculation output is com-

puted (cell B11).

In real-world scenarios from the BFSI domain,

manually implementing code corresponding to given

calculation speciﬁcations, such as CS sheets, is an

effort-intensive and tedious task, because:

• To trigger/execute such calculations in code, val-

ues of calculation inputs (cells B3-B8 in Figure 1)

need to be retrieved from the enterprise database.

This requires inferring database queries that incor-

porate the table relationships (primary/foreign keys

for joining tables) in the database schema. Real-

world enterprise databases contain hundreds of ta-

bles that are part of complicated database schema

and hence creating the intended query manually is

extremely difﬁcult.

• These intended queries need to be written in a pa-

rameterized way to ensure data for multiple poli-

cies/users can be retrieved on demand.

• The business rules (cell B11, B14 in Figure 1),

referred henceforth as ‘business formulae’, in the

BFSI domain are complex and may contain long

sequence of mathematical and logical expressions

that refer to multiple calculation inputs. Hence, in

real-world scenarios, it is very difﬁcult to manu-

ally translate such business formulae into the de-

sired programming language.

Incorporating all these factors while writing the

code manually is a challenge. Hence, there is a clear

need for automating code generation for such calcula-

tions as per their formal speciﬁcations i.e., CS sheets.

In this paper, we propose an idea to synthesize the

code corresponding to the calculation speciﬁcations

in CS sheets. Figure 2 shows the overview of the ap-

proach implementing our proposed idea. We believe

CS sheets are the perfect combination of text and busi-

ness formulae, and hence CS sheets offer a novel and

simple means of specifying user intent for synthesiz-

ing code for calculations. Unlike ambiguous natural

language speciﬁcations, the contents of CS sheets are

precise and machine-interpretable. Additionally, the

illustration of the underlying calculations, provided

along-with CS sheets, contains a valid value for each

calculation input. These valid values play a key role

in our proposed approach that is described as follows.

As a ﬁrst step of our approach, we plan to parse

the input CS sheet to automatically identify all cells

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

498

containing calculation inputs and business formulae

(more details in Section 2.1). Next, we plan to for-

mulate a custom PBE speciﬁcation for each calcula-

tion input. The PBE input will contain all available

textual information corresponding to the calculation

input and the PBE output will be its valid value men-

tioned in the illustration. We refer to PBE speciﬁca-

tion as PBE spec henceforth.

As a next step, the PBE spec will be used to syn-

thesize database queries corresponding to each calcu-

lation input. Firstly, we will guess the query structure

and database entities needed in the various clauses

(SELECT, WHERE, etc.) of the intended query. This

will be done by applying Natural Language Process-

ing (NLP) techniques on the formulated PBE input.

Next, the guessed query structure and entities will be

traversed, as per the grammar and algorithm shown in

Section 2.3.3, to enumerate all possible queries and

these queries will then be executed on the underly-

ing database. The queries that return same value as

PBE output will then be presented for the end-user

review for marking the intended query. Once the in-

tended queries for all calculation inputs are synthe-

sized, these queries, along-with the speciﬁed busi-

ness formulae, will be translated to the programming

language desired by the end-user. Initially, we plan

to generate the ﬁnal code in Python using available

open-source code translation tools (Richter, 2022;

C.W., 2022). However, our approach can be easily

extended to generate code in other programming lan-

guages through corresponding state-of-the-art code

translation techniques.

The key contributions of our paper are:

• A user-friendly format for specifying calculations

i.e. CS sheets.

• An end-to-end approach for synthesis of code cor-

responding to CS sheets.

• A novel PBE-based query synthesis technique.

2 PROPOSED APPROACH

Figure 2 shows the overview of our proposed ap-

proach. The key steps of the same are explained as

follows, using the example CS sheet in Figure 1.

2.1 Identiﬁcation of Calculation Inputs

and Business Formulae

As a ﬁrst step, we identify all calculation inputs and

business formulae by parsing the input CS sheet. For

this, the contents of all non-empty cells will be read

to identify referred cells. These are the cells that are

Figure 3: Cell References Graph (CRG) for the example CS

sheet shown in Figure 1.

used or referred in at least one other cell in the CS

sheet. For example, in Figure 1 cell B14 has a for-

mula, “=MAX(B3,(B4-B5))”, and hence B3, B4 and

B5 will be identiﬁed as referred cells for B14. The re-

lationship between a cell and it’s referred cell(s) will

be maintained in the form of a directed graph, termed

as Cell References Graph (CRG). Figure 3 shows the

CRG for the example CS sheet (Figure 1). In CRG,

each cell with references to other cells (e.g. B14 in

Figure 1) will be tagged as ‘destination node’ and an

edge will be drawn from each of it’s referred cells

(B3, B4, B5 in Figure 1) by tagging them as ‘source

nodes’. Once all non-empty cells are parsed, the fol-

lowing three artifacts will be identiﬁed though CRG:

• Cells that are never tagged as a ‘destination node’

but are tagged as a ‘source node’ at least once will

be marked as input cells. (Example: cells B3-B8 in

Figure 1)

• Each cell immediately to the left of each input cell

will be tagged as calculation input. (Example: cells

A3-A8 in Figure 1)

• All non-empty cells, apart from input cells, that are

part of CRG will be marked as business formulae

cells. (Example: cells B11-B14 in Figure 1)

2.2 Formulation of Programming by

Example (PBE) Speciﬁcation

As a next step, a PBE spec will be formulated for each

identiﬁed calculation input. We formally deﬁne PBE

spec as:

Input Text [Supporting Information]

{Domain Keywords} = Value, where

• Input Text contains the name of the calculation in-

put. In the example CS sheet (Figure 1), contents of

calculation input cells A3 to A8 i.e. Policy value,

Total premium in current year etc. will be consid-

ered as Input Text for respective PBE spec.

• [Supporting Information] captures additional in-

formation, apart from the Input Text, that can be

Towards Synthesis of Code for Calculations Using Their Speciﬁcations

499

used in synthesizing queries for calculation inputs.

In the BFSI domain, this typically includes infor-

mation that is unique to each illustration, such as

policy ID, customer ID, etc. In the example CS

sheet (Figure 1), the illustrated values for all cal-

culation inputs are taken from the policy 123456.

Hence, Policy ID = 123456 will be considered as

Supporting Information in the PBE spec of each

calculation input.

• {Domain Keywords} is an optional ﬁeld of PBE

spec that contains all domain-speciﬁc information

that denotes the context and/or the calculation type

in which the calculation input is being used. These

keywords are included in order to assist query syn-

thesis by narrowing down the search for relevant

database entities that are needed in query clauses.

The example CS sheet (Figure 1) belongs to insur-

ance domain and the calculation name, i.e. Admin

Fee, will be considered as Domain Keywords in the

PBE spec of each calculation input.

• Value contains the valid value(s) of the calculation

input that is(are) provided in the illustration. It is

determined by extracting the contents inside input

cells that are identiﬁed in Section 2.1. In the ex-

ample CS sheet (Figure 1), the input cells are B3

to B8 and hence, their contents i.e. 99000, 90000

etc. will be considered as Value for PBE spec for-

mulation. Please note that we use valid value i.e.

Value and PBE output interchangeably throughout

the paper.

Using the formal deﬁnition described above, the

PBE spec for each calculation input in the example

CS sheet (Figure 1) will be as follows.

1. Policy value [Policy ID = 123456] {Admin Fee} =

99000

2. Total premium in current year [Policy ID =

123456] {Admin Fee} = 90000

...................

Formulation of such a PBE spec from calculation

speciﬁcations (CS sheet) and using it to synthesize

queries is one of the novel contributions of our ap-

proach. To the best of our knowledge, none of the ex-

isting techniques create such a speciﬁcation for syn-

thesis and/or make use of valid values i.e. PBE output

to synthesize the query.

2.3 Query Synthesis Based on PBE Spec

Once the PBE spec is formulated, a query synthe-

sis apparatus will be used to synthesize the desired

database query. The approach for the same is de-

scribed as follows.

Figure 4: Phrases and relationships for ‘Total premium in

current year’ from the CS sheet in Figure 1.

2.3.1 Generation of Query Structure

The structure of the desired query can be guessed

by parsing the Input Text using standard natural lan-

guage parser. Figure 4 shows the parsed phrases

and their relationships, identiﬁed through CoreNLP

parser(Manning et al., 2014), for Total premium in

current year. The query structure for this calculation

input will be constructed by a rule-based apparatus

on the basis of identiﬁed phrases and their relation-

ships. This will be accomplished by mapping related

phrases to respective query clauses using natural lan-

guage semantics, as explained below.

In Total premium in current year, the subject noun

is premium and hence it is considered in SELECT

clause. As per natural language semantics, the sub-

ject noun’s adjectival modiﬁer is Total which indi-

cates addition or cumulative sum, and hence the SE-

LECT clause of the desired query should be updated

to include aggregate function i.e. SUM(premium).

Next, the main noun’s nominal modiﬁer current year

(linked via preposition in) will be considered in

WHERE clause as it indicates the scope or condi-

tion of the desired query. As per natural language

semantics, the phrase current year is a named-entity

and intuitively indicates the ongoing year i.e. 2023,

and hence the WHERE clause should contain year =

2023. However, date-related information is generally

mentioned in columns with date data type, and hence

the phrase current year can also relate to a date on

or between 1-Jan-2023 and 31-Dec-2023. Therefore,

the WHERE clause will also contain 1-Jan-2023 <=

date <= 31-Dec-2023, leading to two query struc-

tures. Lastly, the Supporting Information in PBE spec

i.e., Policy ID = 123456 becomes part of the WHERE

clause by default as the given illustration contains val-

ues from policy ID 123456. Hence, based on the

aforementioned information for the calculation input

Total premium in current year, the identiﬁed query

structures are shown in Table 1.

2.3.2 Identiﬁcation of Database Entities Needed

in the Query

Once the query structure is ready, the database en-

tities i.e., table and column names corresponding to

each query clause need to be identiﬁed. As shown

in Table 1, every query clause will contain either text

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

500

Figure 5: Database details corresponding to the example CS sheet in Figure 1.

Table 1: Query structures for ‘Total premium in current

year’ from the CS sheet in Figure 1.

Stru- Clause LHS Oper- RHS

cture Aggr- Variable ator Aggr- Variable

ID egate egate

SELECT SUM premium

WHERE 1 Policy ID = 123456

WHERE 2 Date >= 1-Jan-2023

WHERE 3 Date <= 31-Dec-2023

SELECT SUM premium

WHERE 1 Policy ID = 123456

WHERE 2 year = 2023

or a text-operator-value triplet. To infer the intended

column(s) for the query clause text, we will compare

it with all table and column names, and their descrip-

tion provided in the database schema. The identiﬁed

columns will then be assigned a text matching score

by using standard text matching utilities (Foundation,

2022) to quantify how similar these columns are to

the query clause text. In case a query clause also con-

tains a value and operator, a clause predicate will be

created.

For example, the clause predicate for clause

WHERE 2 belonging to query structure ID 2 in Table

1 will be ‘year = 2023’. Columns that contain at least

one value mentioned in such a clause predicate will

be considered as candidate columns for the respective

query clause. All such candidate columns will then be

given a representative value matching score, and later

they will be sorted based on their combined value

and text matching scores to get a ranked list of can-

didate columns. Table 2 shows the ranked list of can-

didate columns, as per database schema in Figure 5,

for the PBE spec shown in Table 1. We have observed

that the accuracy of identifying database columns im-

proves if value matching is done along with conven-

tional text matching.

Apart from the Input Text, the Domain Keywords

Table 2: Ranked columns for the query structures of ‘Total

premium in current year’ from the CS sheet in Figure 1.

Clause Rank Table Column Value Text

Score Score

SUM 1 POLICY PREM AMT 0 20

(premium) 2 PREM RECEIPT PREM 0 15

PAID DT

... ... ... ... ...

Policy ID 1 POLICY POLICY ID 25 25

= 123456 DETAILS

2 POLICY POLICY ID 25 20

PREM RECEIPT

... ... ... ... ...

included in PBE spec also play a vital role in

the identiﬁcation of database entities for query

clauses. We explain this through the example of

the calculation input ‘Max Rate of Policy Value’

(cell A8 in Figure 1) for the database schema

in Figure 5. The SELECT clause for this cal-

culation input, as per the described query struc-

ture generation approach (refer Step 2.3.1), will

be ‘MAX(Rate)’. As a result, the columns ‘FI-

NAL RATE’ from PRODUCT DETAILS, ‘RATE’

from PROD ADMIN FEE, and ‘MAX RATE’ from

PROD SERVICE CHARGE DIM will get signiﬁ-

cantly higher text matching score due to the sub-string

‘rate’ in their name. Assuming that all three columns

contain the value ’0.02’ or ‘2%’, the column ‘RATE’

from PROD ADMIN FEE will be ranked highest as

it’s table name is very similar to the provided Do-

main Keywords ‘Admin Fee’.

2.3.3 Query Enumeration and Execution

To generate possible queries, all identiﬁed informa-

tion i.e., ranked columns, query clauses and query

structures, will be traversed as per the algorithm de-

scribed in Algorithm 1. Firstly, high-scoring tables

will be identiﬁed by considering the sum of text and

Towards Synthesis of Code for Calculations Using Their Speciﬁcations

501

Algorithm 1: Generation of Candidate Queries.

Inputs: Database (DB), list of query structures

(Structs), ordered list of ranked columns (RC),

expected query output (Value).

Output: Ordered list of candidate queries

(CandQueries).

1: procedure GENCANDQUERIES(RC,Value)

2: CandQueries ← [..]

3: RT ← getRankedTables(RC)

4: for s ∈ Structs do

5: for i ← 1..MaxJoins do

6: // MaxJoins will be conﬁgurable

7: SelTabs ← P(RT [0..i − 1])

8: // P(A) indicates power set of set A

9: for {t} ∈ SelTabs do

10: q ← queryInSyntax(s, {t}, RC)

11: result ← executeQuery(q, DB)

12: if result == Value then

13: Add q to CandQueries

14: end if

15: end for

16: end for

17: end for

18: return CandQueries

19: end procedure

20: procedure GETRANKEDTABLES(RankedCols)

21: TabMap ← {..}

22: for c ∈ RankedCols do

23: tab ← table o f c

24: score ← text + value matching score o f c

25: TabMap[tab] ← TabMap[tab] + score

26: end for

27: sorted ← sort(TabMap, TabMap.values())

28: return sorted.keys()

29: end procedure

value matching scores (computed in Step 2.3.2) of

ranked columns from each table. Next, queries will

be enumerated as per the grammar shown in Figure

6. As evident from this grammar, we plan to focus

on the key constructs without loss of generality, and

hence we restrict our query synthesis approach to con-

sider only a subset of query language constructs. To

form a query through enumeration, ranked columns

will be picked, as per the order of aforementioned

high-scoring tables, and inserted in identiﬁed query

clauses as per the desired query language syntax. For

explanation purpose we have used SQL as the desired

query language (Figure 6), but our idea can easily be

extended to work for other query languages by simply

using corresponding grammar for query enumeration.

Once an enumerated query is ready, it will be ex-

ecuted on the input database and the value returned

Figure 6: Grammar used for enumeration of queries.

by the query will be compared with the PBE output

identiﬁed in Step 2.2. If the values match, the query

will be marked as a candidate query and added to

an ordered list of candidate queries CandQueries (as

shown in Algorithm 1). The order in which the can-

didate queries are added in CandQueries represents

their rank. Once all enumerated queries are executed

on the database, CandQueries will be presented to the

end-user for review and verdict of intended query.

2.3.4 Query Review by the End-User

We aim to provide only a few queries for end-user

review, hence we execute all enumerated queries on

the database to get a reduced set of candidate queries.

In case all enumerated queries are presented to the

end-user, then he/she may prefer spending time man-

ually writing the intended query, instead of reviewing

a large number of queries that may not even result in

the expected query output. Hence, only top N candi-

date queries will be given to the end-user to mark the

intended query. The value of N will be conﬁgurable

and will get ﬁne-tuned after the end-user reviews.

There is a possibility that none of the candidate

queries is the end-user’s intended query, or our ap-

proach doesn’t generate even a single candidate query.

In such scenarios, we plan to ask the end-user to cre-

ate the intended query manually and use the query for

online learning (Hoi et al., 2021) to improve our ap-

proach to enable generation of queries for such sce-

narios in the future.

2.4 Translation to Target Programming

Language

Once the database queries are synthesized and ﬁnal-

ized for all calculation inputs in the CS sheet, the code

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

502

corresponding to these queries and the business for-

mulae (identiﬁed in Section 2.1) will be generated in

the programming language desired by the end-user.

In the initial prototype of our approach, we gener-

ate the ﬁnal code in Python using well-tested openly

available techniques (Richter, 2022; C.W., 2022).

These techniques automatically identify the partici-

pating formulae in the input CS sheet and convert

them into Python. Using these techniques, the ﬁnal

code for the input CS sheet in Figure 1, as per the

CRG in Figure 3, will be as shown below:

import os, sys

# Function synthesized for Total premium in

current year

def getTotalPremium(policyID):

cursor = database.cursor()

cursor.execute("SELECT SUM(PREM_AMT)

FROM POLICY_PREM_RECEIPT WHERE ...")

result = cursor.fetchall()

return result

# Functions for V, W, B, T, R will be

synthesized similar to getTotalPremium

# Function synthesized for Admin Fee

def getAdminFee(policyID):

V = getPolicyValue(policyID)

P = getTotalPremium(policyID)

W = getTotalWithdrawals(policyID)

AFThresh = MAX(V,(P-W))

B = getAdminFeeBase(policyID)

T = getWaiverThreshold(policyID)

R = getMaxRatePolicyValue(policyID)

AF = B if AFThresh < T else V * R

return AF

The constructed CRG (Figure 3) plays a vital role

in generation of ﬁnal code. The example CS sheet

(Figure 1) contains two business formulae in cells

B11, B14, but as evident in the CRG, the formula in

cell B14 needs to be executed before the formula in

cell B11. This sequential dependency in business for-

mulae is maintained by the directed edge from cell

B14 to B11 in CRG, and hence the code for the for-

mula in cell B14 appears before the same for cell B11.

3 RELATED WORK

The key challenge for program synthesis is the di-

versity of user intent and the inability to express the

intent precisely. Natural language descriptions (Li

and Jagadish, 2014; Desai et al., 2016) are the most

common way of stating user intent, but they are am-

biguous and result in incorrect programs. Specifying

user intent through input-output examples (Gulwani,

2016) works for a certain class of string manipulation

programs, but it hasn’t been used effectively in query

synthesis paradigm. State-of-the-art query synthesis

techniques either require the user to provide partial

queries (Bastani et al., 2019) or create large input-

output tables (Wang et al., 2017; Takenouchi et al.,

2020). Creating such complicated artifacts is as com-

plex and effort-intensive as writing the query manu-

ally. To address these drawbacks, we have proposed

a novel concise and precise way of specifying user

intent in the form of Calculation Speciﬁcation (CS)

sheets. A CS sheet represents the mathematical and

logical relationship between the inputs and outputs

of a program. Furthermore, the end-users, who are

not programming experts, may ﬁnd providing valid

value(s) of variables participating in the calculations

to be more approachable and natural as compared to

creating complicated input-output tables and partial

queries.

Another challenge for program synthesis is enu-

merating and searching programs in a large and

intractable program space. In query synthesis

paradigm, existing techniques (Desai et al., 2016;

Yaghmazadeh et al., 2017) haven’t overcome this

challenge of program space explosion. A few tech-

niques (Li and Jagadish, 2014; Wang et al., 2017;

Takenouchi et al., 2020) indeed reduce the search

space but require the end-user to provide additional

hints like the selection of aggregate function, con-

stants, etc., which can not be provided by the end-

users who typically don’t have technical knowledge

of query language and syntax. On the contrary, in our

proposed approach we plan to reduce the search space

by ﬁltering the enumerated queries based on expected

query output and other heuristics described in detail

in Section 2.3. Hence, the end-user doesn’t need to

give technical guidance, like the selection of aggre-

gate functions, constants, conditions, etc., for reduc-

ing search space for queries.

4 CONCLUSION AND FUTURE

WORK

In this paper we have presented a novel idea to synthe-

size code corresponding to CS sheets. As described

in the paper, business rules can be easily expressed

in CS sheets through a combination of text and for-

mulae. This makes CS sheet a precise and machine-

interpretable way to specify calculations. We have

also proposed a novel query synthesis approach that

ﬁrstly formulates a custom PBE spec for each input

in CS sheet, and later uses text and value inference to

synthesize database query corresponding to each PBE

spec. Lastly, the synthesized database queries and the

business formulae speciﬁed in the CS sheet are trans-

Towards Synthesis of Code for Calculations Using Their Speciﬁcations

503

lated into the desired programming language.

Going ahead, we plan to improve our approach

by incorporating machine learning to ensure that the

user-intended query is inferred with a high rank. Our

goal is to develop a technique that is self-adaptive and

capable of continuously improving the underlying

steps, such as identifying query structures, database

entities, query enumeration etc., based on feedback

from the end-user.

REFERENCES

Arc, I. (2022). It bfsi market overview.

https://www.industryarc.com/Research/

It-Bfsi-Market-Research-500664. Accessed:

2023-01-26.

Bastani, O., Zhang, X., and Solar-Lezama, A. (2019). Syn-

thesizing queries via interactive sketching. arXiv

preprint arXiv:1912.12659.

C.W. (2022). pyexcel - lets you focus on data, instead of ﬁle

formats. https://docs.pyexcel.org/en/latest/. Accessed:

2023-01-26.

Desai, A., Gulwani, S., Hingorani, V., Jain, N., Karkare,

A., Marron, M., and Roy, S. (2016). Program syn-

thesis using natural language. In Proceedings of the

38th International Conference on Software Engineer-

ing, pages 345–356.

Foundation, P. S. (2022). difﬂib - helpers for computing

deltas. https://docs.python.org/3/library/difﬂib.html.

Accessed: 2023-01-26.

Gulwani, S. (2011). Automating string processing in

spreadsheets using input-output examples. ACM Sig-

plan Notices, 46(1):317–330.

Gulwani, S. (2016). Programming by examples. Depend-

able Software Systems Engineering, 45(137):3–15.

Gulwani, S., Polozov, O., Singh, R., et al. (2017). Program

synthesis. Foundations and Trends® in Programming

Languages, 4(1-2):1–119.

Hoi, S. C., Sahoo, D., Lu, J., and Zhao, P. (2021). Online

learning: A comprehensive survey. Neurocomputing,

459:249–289.

Li, F. and Jagadish, H. V. (2014). Nalir: an interac-

tive natural language interface for querying relational

databases. In Proceedings of the 2014 ACM SIG-

MOD international conference on Management of

data, pages 709–712.

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R.,

Bethard, S., and McClosky, D. (2014). The stanford

corenlp natural language processing toolkit. In Pro-

ceedings of 52nd annual meeting of the association

for computational linguistics: system demonstrations,

pages 55–60.

Richter, S. (2022). Xlcalculator: Ms excel formulas to

python. https://github.com/bradbase/xlcalculator. Ac-

cessed: 2023-01-26.

Takenouchi, K., Ishio, T., Okada, J., and Sakata, Y. (2020).

Patsql: efﬁcient synthesis of sql queries from exam-

ple tables with quick inference of projected columns.

arXiv preprint arXiv:2010.05807.

Wang, C., Cheung, A., and Bodik, R. (2017). Synthesizing

highly expressive sql queries from input-output exam-

ples. In Proceedings of the 38th ACM SIGPLAN Con-

ference on Programming Language Design and Im-

plementation, pages 452–466.

Yaghmazadeh, N., Wang, Y., Dillig, I., and Dillig, T. (2017).

Sqlizer: query synthesis from natural language. Pro-

ceedings of the ACM on Programming Languages,

1(OOPSLA):1–26.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

504