Neural Sequence Modeling in Physical Language Understanding

Avi Bleiweiss

BShalem Research, Sunnyvale, U.S.A.

Keywords: Kinematics, Recurrent Neural Networks, Long Short-term Memory, Sequence Model, Attention.

Abstract:

Automating the tasks of generating test questions and analyzing content for assessment of written student re-

sponses has been one of the more sought-after applications to support classroom educators. However, a major

impediment to algorithm advances in developing such tools is the lack of large and publicly available domain

corpora. In this paper, we explore deep learning of physics word problems performed at scale using the trans-

former, a state-of-the-art self-attention neural architecture. Our study proposes an intuitive novel approach to

a tree-based data generation that relies mainly on physical knowledge structure and defers compositionality

of natural language clauses to the terminal nodes. Applying our method to the simpler kinematics domain

that describes motion properties of an object at a uniform acceleration rate and using our neural sequence

model pretrained on a dataset of ten thousand machine-produced problems, we achieved BLEU scores of 0.54

and 0.81 for predicting derivation expressions on real-world and synthetic test sets, respectively. Notably

increasing the number of trained problems resulted in a diminishing return on performance.

1 INTRODUCTION

Science language understanding is a form of text com-

prehension and commonly involves an expression-

rich natural language that reduces to concise com-

putational semantics and grammar. Typically, a hu-

man learner interprets the text describing a problem,

follows by deriving a compact number-sentence rep-

resentation, and only then performs a calculation to

answer the problem query. Vilenius-Tuohimaa et al.

(2008) showed that performance on math word prob-

lems at primary school levels strongly relates to the

child reading comprehension skills.

Machine solving science problems remains a

long-standing challenge to natural language process-

ing (NLP) practitioners. The pilot STUDENT system

(Bobrow, 1964) that dates back over ﬁfty years ago,

ﬁnds a solution to a large class of algebra problems,

but had no machine learning component. Then only in

recent years, automatically solving math word prob-

lems attracted considerable attention in the research

community. Models proposed use both feature engi-

neered machine learning (Hosseini et al., 2014; Kush-

man et al., 2014; Roy and Roth, 2015; Shi et al., 2015;

Zhou et al., 2015; Huang et al., 2016; Huang et al.,

2017; Matsuzaki et al., 2017) and deep learning (Ling

et al., 2017; Wang et al., 2017; Huang et al., 2018;

Wong, 2018; Wang et al., 2018). In practice, the broad

Table 1: Key kinematic variables with their corresponding

notation symbols and SI units of measurement (m for meters

and s for seconds).

Variable Symbol Units

Time t s

Acceleration a or g m/s/s

Initial Displacement x

or y

Final Displacement x or y m

Initial Velocity v

m/s

Final Velocity v m/s

output space of math solvers is often constrained, as

our study maps a problem to a single equation, either

linear or non-linear, of one unknown variable.

Despite the extensive interest in physical language

understanding by artiﬁcial intelligence (AI) scientists,

a sustainable research to develop tools purposed to

read physical word problems and output algebraic ex-

pressions has been fairly sparse. The earliest known

machine-comprehension based work by Suppes et al.

(1998) uses an equational language of well-deﬁned

grammar and semantics to formulate one-dimensional

kinematic problems. Their model associates terms oc-

curring in the natural language statement of the prob-

lem with an equation, but note that learning a ﬁxed

list of words and phrases that refer to qualitative se-

mantics of physical concepts are more effective when

drawn directly from the structure of natural language.

Recently, Leszczynski and Moreira (2016) intro-

464

Bleiweiss, A.

Neural Sequence Modeling in Physical Language Understanding.

DOI: 10.5220/0008071104640472

In Proceedings of the 11th International Joint Conference on Computational Intelligence (IJCCI 2019), pages 464-472

ISBN: 978-989-758-384-1

Table 2: Derivation templates with a {4,5,5} split across missing variables {x,v,t}, respectively.

x v t

v = v0 +a ∗t x = x0 + v0 ∗t + 0.5 ∗a ∗t

v = sqrt(v0

+ 2 ∗a ∗(x −x0))

v0 = v −a ∗t x0 = x −v0 ∗t −0.5 ∗a ∗t

v0 = sqrt(v

−2 ∗a ∗(x −x0))

a = (v −v0)/t v0 = (x −x0 −0.5 ∗a ∗t

)/t a = (v

−v0

)/(2 ∗(x −x0))

t = (v −v0)/a t = sqrt(v0

+ 2 ∗a ∗(x0 −x)) x0 = (v

−v0

−2 ∗a ∗x)/(−2 ∗a)

a = (x −x0 −v0 ∗t)/(0.5 ∗t

) x = (v

−v0

+ 2 ∗a ∗x0)/(2 ∗a)

duced a machine solver of word problems that de-

scribe a free falling object due to standard gravity in a

two-dimensional space. Their model consists of two

long short-term memory (LSTM) networks (Hochre-

iter and Schmidhuber, 1997) that are tasked to extract

from the word problem key physical parameters and

the type of question asked, which are then fed to a nu-

merical integrator for calculating the answer. Given

the natural language question that is sampled from

a ﬁxed grammar, their relative small train set might

limit coverage of the query space.

On the other hand, sequence-to-sequence neural

networks (Sutskever et al., 2014; Cho et al., 2014)

have enjoyed great success in a variety of NLP tasks

such as machine translation, speech recognition, and

text summarization. Motivated by the results of Wang

et al. (2017) for solving math word problems, our

work uses an encoder-decoder architecture to trans-

late natural language queries to kinematic equation

derivations in plain text that subsequently feed a neu-

ral expression encoder rather than a discrete evaluator.

Moreover, instead of word, we use character embed-

dings (Pennington et al., 2014) end-to-end to achieve

a compact vocabulary that effectively expresses ab-

breviated units of measurement, math operator to-

kens, and positive and negative decimal numbers with

variable digits assigned to symbolic variables.

The best performing sequence models connect the

encoder and decoder through an attention mechanism.

Attention allows at each timestep to model position-

less dependencies for variable length input and output

sequences. The standard and most widespread atten-

tion technique is additive based that is used in con-

junction with a recurrent neural network (RNN) (Bah-

danau et al., 2015). However, recurrent models main-

tain hidden state of the entire past and their inher-

ent sequential nature of computations limits scalabil-

ity. Attempting to address this shortcoming, ConvS2S

(Elbayad et al., 2018) uses convolutional neural net-

works (CNN) as a basic building block to compute

hidden representations in parallel for all input and

output positions. Aimed toward very long sequences,

the transformer (Vaswani et al., 2017) uses a net-

work with no recurrence and relies entirely on a self-

attention mechanism to draw global dependencies be-

tween input and output sequences. Attention is ﬁnely

distributed in independent layers for each the encoder

and decoder stack modules. Outperforming ConvS2S,

the transformer considerably improves on language

translation quality and signiﬁcantly reduces computa-

tion runtime by exploiting concurrency. In our work,

we chose the additive attention form as a baseline

to contrast performance with the self-attended trans-

former while iterating our kinematics corpus size.

Our main contributions are (1) a tool to construct

kinematic word problems from a high level of ab-

straction and emit summarized queries that are mostly

stripped out of reasoning and thus allow us to focus on

the task of translation, and (2) motivate a neural ex-

pression evaluator to compute rough problem results

and report quantiﬁable results on how scalable the

neural sequence models we chose are, as we increase

the train dataset, using both synthetic and real-world

test sets. We hypothesize that dissimilarities between

word problems originate primarily from randomized

numerical values assigned to kinematic variables, and

hence expect translation performance to tail-off as the

number of problems exceeds a reasonable threshold.

Our ﬁrst dataset version has 50,000 contrived prob-

lems to train and evaluate our sequence models

The rest of this paper is structured as follows.

In Section 2, we overview our two-step process for

constructing machine-synthesized kinematic word-

problems. In Section 3, we introduce the architecture

of our neural model that involves an attention-based

encoder-decoder for natural to mathematical language

translation, and follows with a character-level expres-

sion encoder to calculate a rough problem solution.

Section 4 highlights ﬁeld deﬁnitions of a word prob-

lem object, of which our synthesized corpus is built

on and further contrasted with a baseline real-world

test data, and provides initial data analyzes and de-

tails of our training methodology. In Section 5, we re-

port extensive quantitative results over our scalability

studies. Discussion, summary and identiﬁed avenues

for prospective work are provided in Section 6.

https://github.com/bshalem/kwp

Neural Sequence Modeling in Physical Language Understanding

465

Table 3: Field deﬁnitions of a kinematics word-problem object with corresponding sample values.

Field Value

problem id 5

missing t

equation vˆ2 = v

ˆ2 + 2 ∗a ∗(x −x

)

asked x

known v, v

,a,x

values 7830 m/s,5530 m/s,5297 m/s/s,7892 m

derivation x

= (5530ˆ2 −7830ˆ2 −2 ∗5297 ∗7892)/(−2 ∗5297)

solution 10792.51 m

question

a car has a ﬁnal velocity of 7830 m/s, initial velocity of 5530 m/s, acceleration of 5297 m/s/s, ﬁnal

displacement of 7892 m . what is the initial displacement ?

2 DATA

There are three major equations that are used to de-

scribe motion of an object under constant accelera-

tion. They determine the following mathematical re-

lationship between the formal kinematic parameters

and involve addition, subtraction, multiplication, and

square operators

v = v

+ a ∗t (x)

x = x

+ v

∗t + 0.5 ∗a ∗t

(v)

= v

+ 2 ∗a ∗(x −x

), (t)

where variable symbols are linked by their name and

quantity units as shown in Table 1. Our word prob-

lem grammar uses the tokens identiﬁed with two out

of the seven base units that are outlined in the modern

International System of Units (SI) for physical mea-

surements, namely m for displacement in meters and s

for time in seconds. The motion equations have the at-

tractive quality that identiﬁes each by a distinct miss-

ing variable listed as a tag to the right of the formulas.

Missing

Variable

Equation

Asked

Variable

Known

Variables

Derivation

Values

template number list

Figure 1: Visualization of our tree-based process for the ﬁrst

stage to machine-generated kinematic word-problems.

Our dataset comprises a list of problem objects,

each in a data structure we synthesize following a

two-stage task. In the ﬁrst step, a tree-based process

progresses in several levels as illustrated graphically

in Figure 1. At the root node, we randomly select a

missing variable ∈{x,v,t} and that leads to the equa-

tion node, where the scope of problem generation is

narrowed to one of the three kinematic formulations.

Each equation deﬁnes a close set of n kinematic vari-

ables, of which we randomly choose an asked vari-

able, and let the remaining (n −1) variables become

the known variables. Following at the derivation leaf

node, we chose one of fourteen template classes (Ta-

ble 2), each representing an assignment statement that

has the asked variable on its left-hand-side (lhs), and

the right-hand-side (rhs) is an expression composed of

the known variables linked together by arithmetic op-

eration symbols. At the sibling leaf node, we built an

(n −1)-size list of randomly chosen numeric values

that are ordered by the sequence of known physical

parameters.

Sequence

Transformer

Expression

Evaluator

numeric

solution

arithmetic

statement

natural

language

query

Figure 2: Architecture overview of our neural model. The

model input is the natural language query and the output

is a numeric solution computed off an intermediate math

statement.

In the second stage, we use both the template and

the number list produced in the former phase to gener-

ate the natural language query of the kinematics prob-

lem. Our simply interpreted query comprises a brief

preamble stating the moving object, a list of short-text

phrases, each expanding on a known physical quantity

with its value and SI unit, and concludes with a ‘what

NCTA 2019 - 11th International Conference on Neural Computation Theory and Applications

466

lstm lstm lstm lstm

⋯

b ca

⋯

Rough

Solution

Assignment

lstm

lstmlstm

lstm

⋯

Derivation

Decoder

lstm

Character

Embeddings

lstmlstm

⋯

lstm

Query

Encoder

Expression

Encoder

Attention

Figure 3: End-to-end architecture of our recurrent neural model: we let character embeddings representing the natural-

language problem question feed at the bottom of our query encoder that connects to the derivation decoder via an additive-

based attention mechanism. The derivation decoder outputs an assignment statement comprised of the asked symbolic vari-

able, math operator tokens, and decimal numbers. Then, the statement rhs enters a character-based expression encoder that

calculates a rough numerical solution to the problem. Model components are built of unidirectional LSTM units.

is asked for’ question clause, as illustrated in Table 3.

Uniformly all tokens of the query are lowercased. Ad-

ditionally, we compute a gold problem solution that

is used as a reference label in training, by providing

a value-substituted template rhs to a string-based ex-

pression evaluator (R Core Team, 2013).

3 MODEL

Our proposed model to evaluate physical language

understanding is shown in Figure 2. The model com-

prises a pair of neural network components, a trans-

lation frontend that feeds forward an arithmetic ex-

pression evaluator. In our experiments, the translation

module is conﬁgured with either an additive attention

in the context of a recurrent LSTM-based sequence-

to-sequence network, or the self-attended transformer

architecture (Vaswani et al., 2017).

Our end-to-end LSTM-based neural model is out-

lined in Figure 3. In succession, the model consists of

a query encoder, a derivation decoder, and an expres-

sion encoder. The ﬁrst two stages make up the stan-

dard sequence-to-sequence network that is fed with

the problem query represented in character embed-

dings, and outputs a derivation statement of which

kinematic variables are already assigned numerical

values. The statement rhs part, a blend of numbers

and math operator tokens, is then passed to the neu-

ral expression evaluator that calculates a rough prob-

lem answer. Our recurrent translation module aligns

sequence positions to steps in computation time, and

uses additive attention (Bahdanau et al., 2015) that

computes the compatibility function c

over a feed-

forward neural network with a single hidden layer us-

ing the weighted sum

∑

j=1

i j

where T is the input sequence length, to score how

well the inputs around position j and the output at

position i match using a probability weight α

i j

and

hidden state h

Scaled Dot-Product Attention

Project Project Project

K V

Concatenate

Project

Figure 4: Transformer: multi-head attention.

To ameliorate the limitations rooted in a serialized

recurrent layer, the transformer introduces multi-head

attention (Figure 4). By restating the deﬁnition of an

attention function, the transformer maps a query com-

bined with a set of key-value pairs to an output. Let-

ting each the query and key be a d

-dimensional vec-

tor, the transformer computes dot-products of a query

Neural Sequence Modeling in Physical Language Understanding

467

with all the keys, divides each by

√

, and applies a

softmax function to obtain the weights on the value

vectors of dimensionality d

. To compute a set of out-

puts simultaneously, the transformer uses an efﬁcient

matrix representation of queries Q, keys K, and values

V in the formulation of scaled dot-product attention

attention(Q,K,V ) = softmax(

√

)V.

Instead of performing a single attention function, the

transformer linearly projects queries, keys, and val-

ues h times, with h the number of heads, each with

distinctly learned hyperparameters in the form of pro-

jection matrices. Scaled dot-product attention is com-

puted then for each version of the projected inputs and

further concatenated and projected once more to pro-

duce ﬁnal values (Figure 4). Multi-head attention thus

beneﬁts attending jointly to relevant information from

non-aligned sequence positions.

Given n the sequence length and d the represen-

tation dimension, the theoretical computational com-

plexity per layer of the transformer is O(n

·d) com-

pared to O(n ·d

) for the recurrent model. In our task,

n  d and thus the transformer is performing faster

than the recurrent model.

4 EXPERIMENTAL SETUP

We measured answer quality of a problem by compar-

ing decoded to reference target derivation, and chose

to report unigram BLEU ∈ {0,1} at the corpus level

for our performance metric (Papineni et al., 2002).

Accuracy, the share of problems that generated a nu-

merical result within ±10% margin of a gold solution-

cluster center was used for ofﬂine model tuning.

4.1 Corpus

To address model scalability, we constructed a sufﬁ-

ciently large synthetic dataset of ﬁfty thousand word

problems with the provision to load any subset thereof

for experimentation. Loaded data was partitioned into

train, validation, and test split sets using an 8:1:1 ra-

tio. In Table 3, we show the ﬁeld deﬁnitions of a kine-

matics word-problem object with their corresponding

string values. The ﬁrst instructive ﬁelds include a run-

ning problem ID for object indexing followed by the

missing variable, and thereafter the implied one-of-

three kinematic equations to use. Next in the object

data-structure are the asked variable and a key-value

ordered lists of the known variables. The lists facili-

tate a dictionary to look up a pair of a numeric value

and an SI unit of measurement from a symbolic vari-

able. In our current implementation, variable assigned

values are numbers randomly sampled from a wide

range of 1 to 10,000. Succeeding in the object records

are the gold elements that include the target derivation

statement and the problem numeric solution we used

in training our model. Lastly, the composition of the

summarized natural-language query is shown.

t v x

Missing Variable

Asked Variable

Figure 5: Visualization of grid-based distribution of the

asked variable over missing variables using a 50,000 prob-

lem dataset.

−60

−30

−60 −30 0 30 60

Figure 6: Visualization of problem solution clusters for

each asked variable using a 50,000 problem dataset. Clus-

ters are used to determine rough problem answers.

We conducted initial analysis on our synthesized

problem data. In Figure 5, we show grid-based dis-

tribution of asked variables across missing variables.

At each asked-missing variable intersection we draw

the count of occurrences using a distinct color. As ex-

pected, the number of total observations for the kine-

matics equation identiﬁed with the missing variable x

and described by four variables is larger than for the

other two equations that are each represented by ﬁve

variables. Given a missing variable, asked variables

consistently render a relatively even distribution. In

Figure 6, we show problem solution clusters arranged

NCTA 2019 - 11th International Conference on Neural Computation Theory and Applications

468

Table 4: Number of character tokens and longest sequence

size used for each the input and output of the translation

module. Shown across the synthesized and real-world data.

Synthesized Real-World

tokens sequence tokens sequence

Encoder 35 171 47 303

Decoder 26 40 27 40

for each asked variables. We use t-distributed stochas-

tic neighbor embedding (t-SNE) (van der Maaten and

Hinton, 2008) to project the large solution space of

50,000 word problems onto a two-dimensional extent.

Numerical solutions to our synthesized problems ap-

pear overall well behaved with no apparent outliers.

To contrast model performance on our machine-

made data, we have extracted real-world kinematic

world problems from the open Web. Unlike the broad

selection for math, and to the extent of our knowl-

edge, we found no large word-problem repositories

in the domain of physics that are publicly available.

We conducted manual searches and managed to ob-

tain one hundred of high-quality kinematic problems

we used as our real-world test set. Real-world queries

were pre-processed to abide by the SI units of phys-

ical measurements, and lowercased for consistency

with our synthesized data. Figure 7 shows a typi-

cal real-world query sampled from an open grammar.

While in Table 4, we show the number of tokens and

the longest sequence in our concise character vocab-

ulary that applies to both the input and output of the

translation module. On the encoder side, ﬁgures are

considerably larger for the real-world test set.

a pitcher throws a fastball with a velocity of 43.5

m/s. it is determined that during the windup and

delivery the ball covers a displacement of 2.5 m.

this is from the point behind the body when the

ball is at rest to the point of release. calculate the

acceleration during his throwing motion.

Figure 7: A typical real-world word problem sampled from

an open grammar.

4.2 Training

In our work, we used both Keras (Chollet et al.,

2015) and PyTorch (Paszke et al., 2017) as our deep

learning research platforms for training and evaluat-

ing the sequence-to-sequence and transformer trans-

lation models, respectively.

Keras is a high-level deep learning interface that

runs on top of the TensorFlow

software library.

We trained our model using the RMSProp optimizer

https://www.tensorﬂow.org/

0.9

1.2

1.5

1.8

1 2 3 4 5 6 7 8 9 10

Epoch

Loss

1000 2500 5000 10000 25000

Figure 8: Epoch loss progression of sequence-to-sequence

model training parameterized by the number of contrived

kinematic problems.

(Tieleman and Hinton, 2012) with an initial learn-

ing rate of 0.1 and a variable mini-batch size that is

based on the loaded number of word problems. Uni-

formly over both the encoder and decoder compo-

nents, we used 100 dimensional character vectors, and

set the LSTM hidden state to a 256 element vector. To

overcome train data overﬁtting we used a dropout of

0.2, and a patience of two epochs in early train stop-

ping. In Figure 8, we show our sequence-to-sequence

loss behavior in model training that is parameterized

by ﬁve discrete choices of number of dataset prob-

lems ∈ {1,000,2,500,5,000,10,000,25,000}. As evi-

denced by the plot, starting from about the ﬁfth epoch

onward the loss curves of 10,000 and 25,000 pro-

grammatic problems draw exceedingly close.

PyTorch is a Python-based scientiﬁc computing

package that provides for tensor computation and efﬁ-

cient parallel execution in building deep neural archi-

tectures. We incorporated the annotated PyTorch im-

plementation of the transformer (Rush, 2018) in our

software framework and modiﬁed it to suit our task

and datasets. The transformer was trained on the CPU

performing modest runtime concurrency, as we used

a two-layer stack for each the encoder and decoder

components, rather than the six-layer default. Multi-

head attention was conﬁgured with h = 8 and model

size d

model

= 512, and each the query, key, and value

vectors set to d

= d

model

/h = 64. The inner

layer of the encoder and decoder had dimensionality

f f

= 2048. We used the Adam optimizer (Kingma

and Ba, 2014) and followed the transformer method

for increasing the learning rate in the ﬁrst warmup

epochs of training and then decreasing it proportion-

ally to the inverse square-root of the epoch number.

To avoid train overﬁtting we used a dropout of 0.1.

We trained the expression encoder separately us-

ing a dataset of compact problem objects, each com-

Neural Sequence Modeling in Physical Language Understanding

469

0.2

0.3

0.4

0.5

0 5000 10000 15000 20000 25000

Word Problems

BLEU Score

realworld synthetic

Figure 9: Sequence-to-sequence model scalability. Show-

ing BLEU scores for real-world baseline and synthesized

test problems as a function of a non-descending number of

trained word problems.

0.5

0.6

0.7

0.8

0 2500 5000 7500 10000

Word Problems

BLEU Score

realworld synthetic

Figure 10: Transformer model scalability. Showing BLEU

scores for real-world baseline and synthesized test problems

as a function of a non-descending number of trained word

problems.

posed of a pair of derivation and solution ﬁelds (Ta-

ble 3). Due to the large evaluation space of signed

ﬂoating-point numeric quantities, we expected the ex-

pression train set to be considerably larger than the

translation set, and hence the motivation to chose the

concept of computing rough problem results. We used

k-means to bin gold solutions and used the cluster

centers as the reference to compare against decoded

rough solutions. The number of clusters is a user set-

table hyperparameter to tune system accuracy.

5 EXPERIMENTAL RESULTS

In this section, we report quality of kinematic problem

answers for both the sequence-to-sequence and trans-

former neural models. Our rendered BLEU scores for

sequence-to-sequence scalability are shown in Figure

9 on both the reat-world and synthesized test data, as

a function of an increasing number of loaded word

problems. Initially rates climb precipitously up about

2,500 problems, then on to a more moderate upslope,

or a decline for real-world data, and end in a fairly

ﬂat course starting at about 10,000 problems. On av-

erage, machine-made data performance is higher than

on real-world baseline by about 0.12 BLEU.

In Figure 10, we show performance of our trans-

former model at scale. Upon ﬁrst observation, BLEU

rates advance in a pattern that resembles for the

most part the behavior of our neural recurrent model.

Notably however, the transformer quality ﬂattens at

around 5,000 problems, and moreover, with BLEU

scores of 0.54 and 0.81, it outperforms our recur-

rent model by a factor of 1.45 and 1.65 on real-

world and synthesized data, respectively. Markedly

the transformer encountered diminishing returns in

performance with relative modest amounts of data and

saturated rather quickly in contrast to the recurrent

model. We hypothesize this is mostly owing to a sim-

pler transformer architecture (Zhu et al., 2015).

Results of our comparative analysis of quality as

performed by neural models purposed for science lan-

guage understanding are shown in Table 5. The prob-

lem data type used in the studies listed ranges from

purely real-world or machine-generated to a merger

of the two, while the dataset dimensionality varies

from tens to one hundred thousand word problems.

The data-average performance of our recurrent model

at (0.49 + 0.37)/2 = 0.43 BLEU is on a par with the

0.48 BLEU score reported by Wong et al. (2018).

Their work pertains to the math domain and uses a

similar sequence-to-sequence model to ours. We note

that their approach concatenates both word and char-

acter level embeddings and uses the Python based

SymPy, a symbolic mathematical library to evaluate

a text expression, unlike our model that applies for

this task a dedicated neural network. At the rate of

0.54 BLEU on the real-world test set, the translation

quality of our transformer model tops the rest of the

neural models.

Using our synthesized data, we chose twenty hu-

man experts that include high school students to as-

sess their written response to randomly selected ques-

tions from a set of 10,000 problems. Labeled queries

were uniformly sampled from each of the derivation

templates (Table 2), as each participant in this experi-

ment was asked to solve fourteen problems, leading to

a total of 280 queries for the entire group. The aver-

age score of expert users was about 0.93 BLEU com-

pared to 0.81 BLEU for our transformer based model

(Table 5). Remarkably the most challenging task we

observed for the students was reasoning the selection

of one of the three kinematic equations, while devel-

NCTA 2019 - 11th International Conference on Neural Computation Theory and Applications

470

Table 5: Comparative model quality of science language understanding. Human expert performance provided for reference.

Model Domain Data Type Word Problems BLEU

Ling et al. (2017) Math Real-World, Synthetic 100,000 0.27

Wong et al. (2018) Math Real-World, Synthetic 18,000 0.48

Our Model (recurrent) Physics

Real-world 25,000 0.37

Synthetic 25,000 0.49

Our Model (transformer) Physics

Real-world 10,000 0.54

Synthetic 10,000 0.81

Our Data (human expert) Physics Synthetic 10,000 0.93

oping the derivation statement was of little effort.

Although the broader scope of attention scalabil-

ity is outside the extent of this paper, the concept

of pointer networks (Vinyals et al., 2015; See et al.,

2017) warrants a brief discussion. These networks use

attention as a pointer to select a member of the input

sequence as the output. They target problems whose

outputs are a sequence of discrete tokens that corre-

spond to positions in the input. Our model maps sum-

marized natural language queries onto an equation

derivation comprised of distinct symbols, and hence

pointer networks are likely to boost answer quality.

6 CONCLUSIONS

In this work, we proposed to extend the encoder-

decoder based sequence translation architecture by

a neural expression evaluator that computes a rough

problem solution to effectively learn kinematic word

problems. Counter to a common intuition, we showed

that performance of both a recurrent and self-attended

sequence models may not scale as the dataset size in-

creases, a statement we proved for both synthesized

and real-world content. We contend that analyzing the

data scale pattern for science language understanding

is key to advance learning algorithms in the underly-

ing network foundation, and prove continued gains in

answer quality by efﬁciently using large datasets.

We showed that the problem answer quality of

the multi-head self-attended transformer signiﬁcantly

outperformed our recurrent sequence-to-sequence ar-

chitecture by at least over ﬁfteen percentage points.

Combined with its outstanding training efﬁciency,

this positions the transformer as the more compelling

solution for our task.

Given a neural model with an output that corre-

sponds to positions in the query sequence, using self-

attention rules evidenced in the transformer archi-

tecture is our natural progression to improve answer

quality. We envision our neural expression evaluator

to take part not just in a backward path with the ob-

jective of assisting the training of the translation sub-

system, but also involved more directly in inference

to quantify the correctness of the problem numeri-

cal solution. Separating the formal variables from the

question clause in the query is a plausible approach to

explore a more effective mapping to a derivation tem-

plate by simulating human expert reasoning. Lastly,

our framework is generic and easily extensible to aid

in learning other physics problem domains.

ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for

their insightful suggestions and feedback.

REFERENCES

Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural ma-

chine translation by jointly learning to align and trans-

late. In ICLR, San Diego, California.

Bobrow, D. G. (1964). Natural language input for a com-

puter problem solving system. Technical report, MIT,

Cambridge, MA. http://hdl.handle.net/1721.1/6903.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau,

D., Bougares, F., Schwenk, H., and Bengio, Y.

(2014). Learning phrase representations using RNN

encoder–decoder for statistical machine translation. In

Empirical Methods in Natural Language Processing

(EMNLP), pages 1724–1734, Doha, Qatar.

Chollet, F. et al. (2015). Keras. https://keras.io.

Elbayad, M., Besacier, L., and Verbeek, J. (2018). Per-

vasive attention: {2D} convolutional neural networks

for sequence-to-sequence prediction. In Proceedings

of the 22nd Conference on Computational Natural

Language Learning (CONLL), pages 97–107, Brus-

sels, Belgium.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural Computation, 9(8):1735–1780.

Hosseini, M. J., Hajishirzi, H., Etzioni, O., and Kushman,

N. (2014). Learning to solve arithmetic word prob-

lems with verb categorization. In Empirical Meth-

ods in Natural Language Processing (EMNLP), pages

523–533, Doha, Qatar.

Huang, D., Shi, S., Lin, C.-Y., and Yin, J. (2017). Learn-

ing ﬁne-grained expressions to solve math word prob-

Neural Sequence Modeling in Physical Language Understanding

471

lems. In Empirical Methods in Natural Language Pro-

cessing (EMNLP), pages 805–814, Copenhagen, Den-

mark.

Huang, D., Shi, S., Lin, C.-Y., Yin, J., and Ma, W.-Y.

(2016). How well do computers solve math word

problems? large-scale dataset construction and evalu-

ation. In Annual Meeting of the Association for Com-

putational Linguistics (ACL), pages 887–896, Berlin,

Germany.

Huang, D., Yao, J.-G., Lin, C.-Y., Zhou, Q., and Yin, J.

(2018). Using intermediate representations to solve

math word problems. In Annual Meeting of the As-

sociation for Computational Linguistics (ACL), pages

419–428, Melbourne, Australia.

Kingma, D. P. and Ba, J. (2014). Adam: A method for

stochastic optimization. CoRR, abs/1412.6980. http:

//arxiv.org/abs/1412.6980.

Kushman, N., Artzi, Y., Zettlemoyer, L., and Barzilay,

R. (2014). Learning to automatically solve algebra

word problems. In Annual Meeting of the Association

for Computational Linguistics (ACL), pages 271–281,

Baltimore, Maryland.

Leszczynski, M. and Moreira, J. (2016). Machine solver for

physics word problems. In Neural Information Pro-

cessing Systems (NIPS) Intuitive Physics Workshop,

Barcelona, Spain.

Ling, W., Yogatama, D., Dyer, C., and Blunsom, P. (2017).

Program induction by rationale generation: Learn-

ing to solve and explain algebraic word problems.

In Annual Meeting of the Association for Computa-

tional Linguistics (ACL), pages 158–167, Vancouver,

Canada.

Matsuzaki, T., Ito, T., Iwane, H., Anai, H., and H. Arai,

N. (2017). Semantic parsing of pre-university math

problems. In Annual Meeting of the Association for

Computational Linguistics (ACL), pages 2131–2141,

Vancouver, Canada.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).

BLEU: a method for automatic evaluation of machine

translation. In Annual Meeting of the Association

for Computational Linguistics (ACL), pages 311–318,

Philadelphia, Pennsylvania.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,

DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and

Lerer, A. (2017). Automatic differentiation in py-

torch. In Workshop on Autodiff, Advances in Neu-

ral Information Processing Systems (NIPS) ), Long

Beach, California.

Pennington, J., Socher, R., and Manning, C. (2014). Glove:

Global vectors for word representation. In Empirical

Methods in Natural Language Processing (EMNLP),

pages 1532–1543, Doha, Qatar.

R Core Team (2013). R: A Language and Environment

for Statistical Computing. R Foundation for Sta-

tistical Computing, Vienna, Austria. http://www.R-

project.org/.

Roy, S. and Roth, D. (2015). Solving general arithmetic

word problems. In Empirical Methods in Natural Lan-

guage Processing (EMNLP), pages 1743–1752, Lis-

bon, Portugal.

Rush, A. (2018). The annotated transformer. In Proceedings

of Workshop for NLP Open Source Software (NLP-

OSS), pages 52–60, Melbourne, Australia.

See, A., Liu, P. J., and Manning, C. D. (2017). Get to

the point: Summarization with pointer-generator net-

works. In Annual Meeting of the Association for Com-

putational Linguistics (ACL), pages 1073–1083, Van-

couver, Canada.

Shi, S., Wang, Y., Lin, C.-Y., Liu, X., and Rui, Y. (2015).

Automatically solving number word problems by se-

mantic parsing and reasoning. In Empirical Meth-

ods in Natural Language Processing (EMNLP), pages

1132–1142, Lisbon, Portugal.

Suppes, P., B

ottner, M., and Liang, L. (1998). Machine

learning of physics word problems: A preliminary re-

port. In Computing Natural Language, pages 141–

154. Stanford University, California, USA.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Se-

quence to sequence learning with neural networks. In

Advances in Neural Information Processing Systems

(NIPS), pages 3104–3112. Curran Associates, Inc.,

Red Hook, NY.

Tieleman, T. and Hinton, G. E. (2012). Lecture 6.5-

RMSProp: Divide the gradient by a running average

of its recent magnitude. Technical report, COURS-

ERA:Neural Network for Machine Learning, 4, 26-

30.

van der Maaten, L. and Hinton, G. E. (2008). Visualizing

data using t-sne. Journal of Machine Learning Re-

search (JMLR), 9(Nov):2579–2605.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I.

(2017). Attention is all you need. In Advances in

Neural Information Processing Systems (NIPS), pages

5998–6008. Curran Associates, Inc., Red Hook, NY.

Vilenius-Tuohimaa, P. M., Aunola, K., and Nurmi, J.

(2008). The association between mathematical word

problems and reading comprehension. Educational

Psychology, 28(4):409–426.

Vinyals, O., Fortunato, M., and Jaitly, N. (2015). Pointer

networks. In Advances in Neural Information Pro-

cessing Systems (NIPS), pages 2692–2700. Curran

Associates, Inc.

Wang, L., Wang, Y., Cai, D., Zhang, D., and Liu, X. (2018).

Translating a math word problem to a expression tree.

In Empirical Methods in Natural Language Process-

ing (EMNLP), pages 1064–1069, Brussels, Belgium.

Wang, Y., Liu, X., and Shi, S. (2017). Deep neural solver for

math word problems. In Empirical Methods in Natu-

ral Language Processing (EMNLP), pages 845–854,

Copenhagen, Denmark.

Wong, R. (2018). Solving math word problems. Tech-

nical report, Stanford University, Palo Alto, CA.

https://web.stanford.edu/class/archive/cs/cs224n/

cs224n.1184/reports/6866023.pdf.

Zhou, L., Dai, S., and Chen, L. (2015). Learn to solve alge-

bra word problems using quadratic programming. In

Empirical Methods in Natural Language Processing

(EMNLP), pages 817–822, Lisbon, Portugal.

Zhu, X., Vondrick, C., Fowlkes, C. C., and Ramanan, D.

(2015). Do we need more training data? CoRR,

abs/1503.01508. http://arxiv.org/abs/1503.01508.

NCTA 2019 - 11th International Conference on Neural Computation Theory and Applications

472