Memory Net: Generalizable Common-Sense Reasoning over Real-World

Actions and Objects

Julian Eggert

, Joerg Deigmoeller

, Pavel Smirnov

, Johane Takeuchi

and Andreas Richter

Honda Research Institute Europe, Carl-Legien-Straße 30, 63073 Offenbach am Main, Germany

Honda Research Institute Japan, 8-1 Honcho, Wako, Saitama 351-0114, Japan

Keywords:

Situational Question Answering, Knowledge Representation and Reasoning, Natural Language

Understanding.

Abstract:

In this paper, we explore how artiﬁcial agents (AAs) can understand and reason about so called ”action pat-

terns” within real-world settings. Essentially, we want AAs to determine which tools ﬁt speciﬁc actions, and

which actions can be executed with certain tools, objects or agents, based on real-world situations. To achieve

this, we utilize a comprehensive Knowledge Graph, called ”Memory Net” ﬁlled with interconnected everyday

concepts, common actions, and environmental data. Our approach involves an inference technique that har-

nesses semantic proximity through subgraph matching. Comparing our approach against human responses and

a state-of-the-art natural language model based machine learning approach in a home scenario, our Knowledge

Graph method demonstrated strong generalization capabilities, suggesting its promise in dynamic, incremental

and interactive real world settings.

1 INTRODUCTION

Our work aims to enable artiﬁcial agents (AA) to

use practical, everyday common-sense knowledge in

physical environments. The AA should evaluate ac-

tions, tools, objects, locations, and collaboration part-

ners within a speciﬁc context. We use a simulated

home environment paired with a Knowledge Graph,

known as Memory Net (MemNet). A natural lan-

guage interface aids in aligning text input with Mem-

Net concepts. This concept aligns with Embodied

Question Answering (EQA), (Das et al., 2017), but

emphasizes situational context, hence termed Situa-

tional Question Answering (SQA), (Deigm

oller et al.,

2022).

We contend that SQA is important for effective

human-machine cooperation, rooted in mutual situa-

tional understanding. This shared comprehension lets

agents predict or strategize, advancing towards Co-

operative Intelligence (Sendhoff and Wersing, 2020),

which promotes harmonious human-machine task

collaboration.

In this study, we gauge mutual understanding

through a question-answering task. The AA, using

MemNet, suggests solutions, ranking them from most

to least suitable. In a tool identiﬁcation task, the AA

compares human-selected tools to those from Mem-

Net for questions like ”What can I use to store a

book?”. The choices are limited to 69 tools within

the simulated environment.

All common-sense and instance data exist in a

single graph, ensuring streamlined reasoning. We’ve

equipped MemNet with an inference system utilizing

semantic proximity via approximate subgraph match-

ing over action patterns.

The paper will ﬁrst investigate related works, fol-

lowed by explaining the principles of MemNet, and

ﬁnally compare our system’s performance against hu-

man annotated data as well as a neural language

model in the tool identiﬁcation task.

1.1 Related Work

Artiﬁcial agents (AA) rely heavily on the quality

of their knowledge sources for question-answering

performance. Knowledge-graphs like WikiData and

DBPedia predominantly offer factoid, encyclope-

dic knowledge about various topics (Antoniou and

Bassiliades, 2022). Systems built on such data of-

ten pair natural language understanding with SparQL

query generation (Zafar et al., 2018; Diefenbach et al.,

2018; Diefenbach et al., 2020), with benchmarks cen-

tered around instance-related questions (Dubey et al.,

2019). While these systems show progress, relying

182

Eggert, J., Deigmoeller, J., Smirnov, P., Takeuchi, J. and Richter, A.

Memory Net: Generalizable Common-Sense Reasoning over Real-World Actions and Objects.

DOI: 10.5220/0012182300003598

In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2023) - Volume 2: KEOD, pages 182-189

ISBN: 978-989-758-671-2; ISSN: 2184-3228

solely on encyclopedic knowledge isn’t sufﬁcient for

embodied agents to interact in a physical environ-

ment.

Common-sense knowledge graphs, unlike ency-

clopedic ones, focus on general world knowledge.

These are sourced from plain texts, search-engine

logs, language models, or human curations such as

WebChild (Tandon et al., 2017), Atomic (Sap et al.,

2019), and Wordnet (Miller, 1998). Systems based

on this knowledge usually employ neural networks

or language models for question-answering (Storks

et al., 2019).

Embodied Question Answering (EQA) handles

questions about physical entities, necessitating both

natural language understanding and robot localization

and navigation. However, benchmarks like (Shrid-

har et al., 2020; Srivastava et al., 2022) mainly as-

sess sequences of expert-driven actions rather than

the robot’s common-sense capabilities. (Vassiliades

et al., 2021) explored richer semantic understand-

ing of scene objects and implemented a knowledge

retrieval system for cognitive robots. Knowledge

representations in robotics, like (Paulius and Sun,

2018) and (Thosar et al., 2018), detail actions for

task executions but often overlook language inter-

actions, especially in resolving ambiguities. While

some robotics applications include language under-

standing, they typically lack the extensive knowledge

needed to execute complex tasks or comprehend their

context (Fischer et al., 2018; Venkatesh et al., 2021;

Lynch and Sermanet, 2021; M

uhlig et al., 2020).

A promising development is found in (Ahn et al.,

2022), which employs a large language model for

common-sense reasoning. They introduced an af-

fordance function to anchor the language model in

the physical world. However, we advocate for struc-

tured knowledge representation for transparency in

decision-making and real-time knowledge incorpora-

tion, especially if robot behaviors or knowledge need

reﬁning through interactions.

2 MEMORY NET KNOWLEDGE

GRAPH

We utilize a unique knowledge graph structure for

common-sense reasoning, as presented in (Eggert

et al., 2020) and (Eggert et al., 2019). This graph

describes heavily interlinked entities, with nodes rep-

resenting concepts and instances, and links denoting

relationships between them.

Our knowledge graph acts as the sole storage for

varied data: from abstract concepts to real-world enti-

ties, objects, tools, actions, states, and even temporal

observations. Unlike the semantic web (Berners-Lee

et al., 2001) and many property graphs (Angles, 2018)

that rely on node and link category naming to carry se-

mantics requiring human interpretation, our MemNet

assumes no implicit semantics. This ensures entirely

machine-driven operations and allows the AA to ex-

pand its knowledge during operations.

Concept semantics emerge from the graph’s struc-

ture and a minimal set of link types, enabling oper-

ations like property reasoning (”hasProp”), composi-

tionality (”hasPart”), and transformations (”transTo”).

In MemNet, a ”property pattern” represents each con-

cept, linking a central node (the main concept) to

child nodes that signify subconcepts and properties.

These properties ideally correlate to external ”mea-

surements”, such as linguistic utterances or visual

recognition.

Except for certain root nodes, all concept nodes

are automatically linked to others via specialization

or inheritance links. A concept specializes from an-

other if it maintains its parent’s traits but has addi-

tional unique properties and relationships. This spe-

cialization is explicitly denoted in the graph, helping

to maintain a concise representation and minimize re-

dundancies. The concept’s meaning arises both from

its property pattern and its links to other concepts.

One central concept of an AA that operates in real-

world is that of an action, since it links multiple dif-

ferent types of concepts, such as involved objects and

tools, a predicted result or a participating actor. This

is closely tied to verb semantics as highlighted in lin-

guistics (Baker et al., 1998).

Within our knowledge graph, an ”action pattern”

deﬁnes the components of an action. These are spe-

cialized versions of original concepts, given their spe-

ciﬁc roles in the action. Hierarchical taxonomies help

detail agents, transformations, and objects, which

then take on speciﬁc roles in an action. For instance,

a knife becomes a tool in the context of the ”cut” ac-

tion, while an agent might be an actor or recipient in

a ”bring” action, depending on its context.

A key beneﬁt of MemNet’s specialization schema

is its seamless representation of both abstract con-

cepts and real-world entities.

3 EVERYDAY COMMON-SENSE

KNOWLEDGE

To infuse our MemNet with general-knowledge con-

cepts and links, we utilized multiple taxonomies from

public datasets. Primarily, we turned to WordNet

(Miller, 1998) to extract hub nodes for specialization

taxonomies. These nodes, organized as ”synsets”,

Memory Net: Generalizable Common-Sense Reasoning over Real-World Actions and Objects

183

connect synonyms with shared meanings, serving

as our MemNet’s foundational structure. However,

WordNet lacks details on action patterns.

To ﬁll this gap, we leveraged ConceptNet (Speer

et al., 2017). Its natural language triplets, such as

”[a net]-[used for]-[catching ﬁsh]”, were connected

to MemNet concepts through spaCy

(Honnibal and

Montani, 2017). For accurate connections, action and

object lemmas from phrases were extracted (Losing

et al., 2021), and Word Sense Disambiguation fa-

cilitated the mapping between WordNet synsets and

MemNet concepts.

Although ConceptNet was useful, its domain cov-

erage was inconsistent. From it, we retrieved 271

statements, primarily of the ”used for” relation that

were relevant for our scenario. To broaden our cover-

age, we introduced 70 high-level action patterns, ex-

amples include:

• make.v.03, food.n.01, cooking utensil.n.01

• make.v.03, food.n.01, kitchen appliance.n.01

• drink.v.01, beverage.n.01, drinking vessel.n.01

• ...

For practical testing, we employed VirtualHome

(Puig et al., 2018), simulating an agent’s real-world

interactions. We incorporated 69 stateful objects from

VirtualHome into our graph.

In summary, three sources: WordNet for base cat-

egorizations, ConceptNet for action patterns, and Vir-

tualHome for tangible household items, enriched our

MemNet. On top, a language interface was integrated,

utilizing syntactic parsing via spaCy, allowing us to

transform user queries into MemNet queries. For

more information, refer to (Deigm

oller et al., 2022).

4 REASONING AND

GENERALIZATION

The MemNet database now integrates information

on both abstract and everyday concepts, objects, in-

stances, and the pivotal action patterns linking them.

Importantly, every concept, object, and action pattern

has an associated natural language expression (”lem-

mas”).

Envision this typical reasoning process: A user

poses a query with an incomplete action pattern. Per-

haps they’re asking about the object being acted upon,

a related tool, its common location, or the acting

agent. A hint typically suggests the missing piece,

which might be an agent, object, or tool. Questions

https://spacy.io/

could be phrased as, ”Which tool can I use to cut

onions?” (here, the missing element is a tool) or ”Who

can take that?” (where an agent is the missing link).

Addressing communication uncertainties, ambi-

guities, or general underspeciﬁcations is crucial. Our

method utilizes the graph structure, transforming

the challenge of pinpointing compatible action pat-

terns into an approximate subgraph matching prob-

lem—governed by semantic distance. This distance

is gauged by the connectivity of the knowledge en-

gine. We’re primarily searching for closely connected

action patterns, enabling the system to generalize and

tap into the depth of the MemNet graph.

The reasoning process is tri-phasic: associa-

tion, generalization, and ﬁltering. In the association

phase, lemmas guide towards related concepts in our

database. In generalization, a broad spectrum of se-

mantically linked action patterns is explored. Finally,

these patterns are veriﬁed based on a) the semantic

distance between their concepts and the initial hints,

and b) their appropriateness in the given context, like

item availability or achieving a goal state. Prioritiz-

ing by semantic distance reﬁnes the focus on the most

relevant options.

4.1 Action Pattern Inference

In a query such as ”What can I use to cut onions?”,

our language understanding identiﬁes ”cut” as the ac-

tion’s lemma and ”onions” as the object’s lemma.

Given the question structure, we infer that a tool is

being sought, initializing the tool’s lemma as ”any-

thing”. Since lemmas are tied to concepts, the ini-

tial step is identifying all concepts linked with ”cut”,

”onions”, and ”anything”. This identiﬁes our seed

concepts for actions, objects, and tools (lines 1-3 in

Listing 1). Owing to language’s inherent ambigu-

ity, there might be multiple seed concepts for each

lemma.

Next, these seed concepts link to more speciﬁc

subconcepts and broader superconcepts through spe-

cialization links. For instance, ”cutting” can be gen-

eralized as ”separating” or ”dividing” and specialized

as ”slicing” or ”cubing”. A similar procedure applies

to ”onions”. This step enlarges our pool of concept

candidates (lines 4-6).

We then identify matching action patterns by en-

suring they integrate an action, object, and tool con-

cept, each stemming from the respective set of con-

cept candidates or their specialized versions in the ac-

tion pattern (line 7).

Finally, line 8 directs us to the action patterns’

originating action, object, and tool concepts from the

candidate concept sets. These foundational concepts

KEOD 2023 - 15th International Conference on Knowledge Engineering and Ontology Development

184

guide us in identifying real-world instances ﬁt for the

action patterns. For example, in the question ”What

can I use to cut onions?”, while a real-world object

like a ”knife” is a viable candidate, it isn’t involved in

any speciﬁed action yet.

Listing 1: Starting from lemmas for action pattern elements

like action, object and tool, search the knowledge graph for

semantically related action patterns.

1 action_seed = Get concept with lemma

action_lemma

2 object_seed = Get concept with lemma

object_lemma

3 tool_seed = Get concept with lemma tool_lemma

4 action_concept_candidate = Get sub+superconcepts

of action_seed

5 object_concept_candidate = Get sub+superconcepts

of object_seed

6 tool_concept_candidate = Get sub+superconcepts

of tool_seed

7 (action, object, tool, ...) = Get all AP’s where

action is subconcept from any of

action_concept_candidate AND object is

subconcept from any of

object_concept_candidate AND tool is

subconcept from any of

tool_concept_candidate

8 (action_concept, object_concept, tool_concept,

...) = Get corresponding superconcepts

closest to action, object, tool, ...) from

the concept candidate pools

4.2 Generalization and Ranking

Action patterns, retrieved as per listing 1, match the

initial lemmas. Moreover, by associating these with

more abstract or specialized concepts, the system can

retrieve a broader range of action patterns semanti-

cally linked to the user’s query. For instance, ask-

ing ”What can I use to cut onions?” might yield re-

sults not only speciﬁc to ”cutting onions with a knife”

but also more generalized or specialized patterns like

”separating vegetables with a cutting utensil” or ”cub-

ing vegetables with a paring knife”, provided they ex-

ist in the knowledge base.

Once these patterns are retrieved, their semantic

proximity is evaluated. To achieve this, the shortest

abstraction or specialization paths between the seed

concepts and the concepts currently populating the

AP candidates are identiﬁed (lines 1-3 in listing 2).

These paths then inform the computation of element-

wise abstraction or specialization distances (lines 4-

9).

For every action pattern, the total semantic dis-

tance is computed by summing up their individual

element-wise distances (lines 10-11). These distances

guide the ranking of the action patterns. Special-

ization is straightforward since a specialized AP is

wholly compatible with a broader one. In contrast,

abstracting creates broader APs, which can be less ac-

curate in speciﬁc contexts.

The resultant ranking arranges the responses, fo-

cusing on the most pertinent ones, typically those

closest to the initial query. While highly abstract

APs like ”manipulate solid food with a tool” might

be valid, they may not be relevant. On the other hand,

extremely specialized APs, like ”slicing those shal-

lots with a particular slicer only Tom owns,” might

not resonate with the broader context.

Listing 2: Ranking of action patterns by semantic similarity.

1 action_path = Get shortest sub+superconcept path

between action_seed and action_concept

2 object_path = Get shortest sub+superconcept path

between object_seed and object_concept

3 tool_path = Get shortest sub+superconcept path

between tool_seed and tool_concept

4 action_abstraction_distance = length(action_path

) if action_concept is more abstract then

action_seed, 0 otherwise

5 object_abstraction_distance = length(object_path

) if object_concept is more abstract then

object_seed, 0 otherwise

6 tool_abstraction_distance = 0 or length(

tool_path) if tool_concept is more abstract

then tool_seed, 0 otherwise

7 action_specialization_distance = length(

action_path) if action_concept specialized

from (action_seed), 0 otherwise

8 object_specialization_distance = length(

object_path) if object_concept specialized

from object_seed, 0 otherwise

9 tool_specialization_distance = length(tool_path)

if tool_concept is specialized from

tool_seed, 0 otherwise

10 ap_abstraction_distance = weighted sum of (

action_abstraction_distance,

object_abstraction_distance,

tool_abstraction_distance, ...)

11 ap_specialization_distance = weighted sum of (

action_specialization_distance,

object_specialization_distance,

tool_specialization_distance, ...)

4.3 Retrieval of Instance Candidates

Once we have a ranked set of compatible action pat-

terns, we can proceed to check if there are some con-

crete items in the environments that might be compat-

ible with those actions. For this purpose, we search

for concepts that are specializations of the action, ob-

ject, tool, ... concepts of the AP’s and which fulﬁll

certain additional criteria.

Since one constraining criterion is given by the

seed information itself (e.g. if the user asks ”What

knife can I use to cut onions?”, the category ”knife” of

the tool is already given as a seed), we determine the

pairwise lowest (in terms of specialization distance)

between the seed concepts and the concepts of the AP

elements, see lines 1-3 of listing 3. From that start-

ing point, in lines 4-6 we proceed to ﬁnd specialized

versions of those concepts.

Let us concentrate on the case of a requested tool

in ”What can I use to cut onions?”. From the request,

there is no further constraint on the tool, so the tool

Memory Net: Generalizable Common-Sense Reasoning over Real-World Actions and Objects

185

seed can be anything that is usable as a tool. One

might consider e.g. ”knife”, ”cutting tool”, ”kitchen

knife” and ”swiss army knife” as tool concepts being

part of corresponding AP’s. Searching for instances

of these concepts in a real scene might reveal that

there is no ”swiss army knife”, but several instances

of kitchen knives and other cutting tools. The seman-

tic distance evaluation ﬁnally reveals that a kitchen

knife might be the best candidate for using as a tool.

Listing 3: Retrieval of suitable candidates for action pattern

elements.

1 action_instance_seed = Get pairwise lowest

concept between action_seed and

action_concept

2 object_instance_seed = Get pairwise lowest

concept between object_seed and

object_concept

3 tool_instance_seed = Get pairwise lower concept

between tool_seed and tool_concept

4 action_instance_candidate = Get subconcepts from

action_instance_seed

5 object_instance_candidate = Get subconcepts from

object_instance_seed

6 tool_instance_candidate = Get subconcepts from

tool_instance_seed

5 EXPERIMENTAL RESULTS

To evaluate the reasoning and generalization perfor-

mance of the knowledge engine, we compiled a set

of 109 questions related to our virtual home environ-

ment and asked 20 subjects to answer them. More

speciﬁcally, we asked questions about which object/-

tool that is present in the environment, might be use-

ful to execute a certain action on another speciﬁc ob-

ject, e.g. ”What can I use to drink juice?”. The sub-

jects could pick any candidate from the list of 47 ob-

ject/tool types (indicated in Fig. 1), and order them

from most to least applicable. The constraint was to

provide at least 3 candidates for each question. For

the benchmark, we collected the votes for each can-

didate, ﬁnally resulting in an ordered answer set for

each question ranked from high to low number of

votes. For the benchmark, answers with less than 3

votes have been excluded. The goal of the evaluation

is to ﬁnd out: (1) if the algorithm correctly identiﬁes,

i.e. disambiguates the right action-object-tool context

from lemmas; (2) how well the results match with the

expectations of the human subjects. We could also

phrase this as identifying the shared context under-

standing between human and machine, as stated in the

section 1.

To judge the knowledge engine performance, we

compared two more approaches with the user an-

swers. One is based on a ﬁne-tuned neural language

model and the other provides an as simple as possible

Figure 1: Excerpt of the tool icons (top) and an excerpt of

the questionnaire (bottom). The questions refer to tools

available as tool icons that could be applied for a given

question stating an action and object, like ”What can I use

to drink coffee?”. The subjects had to select at least three

possible answer candidates from the tool icons, which cor-

respond to the objects available in our simulator scene. The

candidates had to be assigned to one of the categories from

preferred (green) to least applicable (red).

baseline. Both are explained below in further detail.

5.1 Baseline

To generate a baseline answer set, we used the distri-

bution of the human answers. Based on the histogram

over all answers for all questions, we sorted the tool

candidates by the number of votes. This ranked order

is used as standard answer set to every question. That

means, we always give the same answer, independent

of the question, based on the frequency of the can-

didates in the overall user answers. Even if this ap-

proach is simple, it is a quite reasonable strategy in

terms of choosing a correct answer if the system is

agnostic of the question, because our set of answer

candidates, i.e. instances, is limited.

5.2 Neural Language Model

The Neural Language Model (NLM) we incorporated

is rooted in deep learning. We trained and tested the

NLM as a binary classiﬁcation problem so that the

NLM could deal with a small amount of training data.

The dataset employed to train the NLM was

sourced from the same set of information archived in

MemNet, detailed in Section 3. Leveraging triples

from ConceptNet such as [bookshelf]-[used for]-

[storing books] or their equivalent representations in

MemNet’s action pattern, we autonomously spawned

a question-response duo: ”What can I use to store a

book?” paired with the answer ”bookshelf”. Conse-

quently, this provided us with one labeled instance for

supervised training, tagged with the output ”1”. Simi-

KEOD 2023 - 15th International Conference on Knowledge Engineering and Ontology Development

186

Figure 2: Match score (top) and F1 score (bottom) for in-

creasing number of answers, comparing three approaches,

the knowledge engine (MemNet), the language model and

the baseline.

larly, negative samples for this question statement are

generated by randomly selecting from other tool can-

didates, like e.g. ”glass” is chosen, which then gets an

output value of 0.

The model was constructed by ﬁne-tuning with

the above training data using BERT (Devlin et al.,

2018). For each positive sample, ∼20 negative ex-

amples were generated, with a total number of 6030

negative examples.

Upon completion of the training phase, the reﬁned

neural model produced a value within the range [0, 1]

for every question-tool combination. This value was

interpreted as the likelihood of the statement being

accurate. For every query posed, we then arranged

all plausible tool responses based on this computed

likelihood. This provided a hierarchical list of poten-

tial responses, facilitating a side-by-side comparison

with the datasets generated by human participants and

the MemNet.

5.3 Evaluation Measures

For our evaluation, we have chosen two measures:

the standard F1 score and a match score explained

in more details below. We introduced a match score

measure because of two reasons. First, we cannot

map standard question answering measures, where

the most obvious is the mean reciprocal rank (MRR,

oller et al., 2022)) to our problem setting. This is

because we have a set of reference answers and a set

of system answers, instead of only a single reference

answer as it is used for the MRR. Second, we think

that the match score reﬂects well the message of our

investigation, which is how does the tool candidates a

human has in mind for a given question overlap with

the set of system responses.

The match score is motivated by set theory and is

based on the intersection R ∩ S between the reference

answer set R and the system answer set S. Further, we

count a positive match if there is any intersection be-

tween the two sets by |R ∩ S| > 0, otherwise we count

it as a no match by:

M =



1 if |R ∩ S| > 0

0 otherwise

To analyze the ranking of answers, we compute M

for an increasing set size of R and S between 1 and 10

(cf. Fig. 2).

Similar to the match score, we can use the sets

for calculating the F1 score, where precision = |R ∩

S|/|S| and recall = |R ∩ S|/|R|. Again, we compute

the F1 score for increasing set sizes, which gives us

the performance curve (cf. Fig. 2 bottom).

5.4 Comparison of Results

We compared three approaches against our reference

answers from the subjects. The match score at Fig. 2

(top) shows that the baseline method performs worst,

far below the language model and the knowledge en-

gine. Nevertheless, the answers provide already a

quite good guess, even if it is not related to the in-

dividual question.

The language model and the knowledge engine

perform comparably well considering the ﬁrst 2 an-

swers only. If we increase our set of answers to 3 and

higher, we can see that the knowledge engine outper-

forms the language model, with a maximum differ-

ence of around 10%. With increasing set sizes, all ap-

proaches converge for the match score, which means

that the set size is large enough to cover at least one

correct answer in the reference data.

The plots in Fig. 3 show how many candidates

contribute to a positive match for the knowledge en-

gine (top) and the language model (bottom). The dis-

Memory Net: Generalizable Common-Sense Reasoning over Real-World Actions and Objects

187

Figure 3: Number of correct answers for knowledge engine

(top) and language model (bottom). It shows the number

of correct candidates for increasing number of included an-

swers, as in Fig. 2. The envelope matches with the Match

Score from Fig. 2.

tribution indicates that the knowledge engine provides

signiﬁcantly more correct answers in the earlier ranks

than the language model. This is especially the case

for matches including 4 (red) and 5 (purple) correct

candidates.

Looking at the F1 score in Fig. 2 (bottom), we

can observe a similar tendency as for the match score

(top). The difference between knowledge engine and

language model gets even more prominent. Probably,

this is due to the difference in the number of correct

candidates per set, which we already observed in Fig.

6 CONCLUSION

In this paper, we introduced a generalizable infer-

ence method for Memory Nets — a type of knowl-

edge graph — that reasons about actions using action

patterns. These patterns, derived from common-sense

facts and sources like ConceptNet to connect objects,

tools, agents, or locations in an action context.

We assessed this method by comparing user and

system responses to scenario-based tool-related ques-

tions. Alongside, a language model trained on the

same common-sense facts was used for benchmark-

ing. Our ﬁndings from section 5.4 highlight that our

approach offers similar or better inference than the

language model. Notably, our method provides more

rational answers by working at a conceptual level,

while the language model hinges on word patterns. A

signiﬁcant advantage of our system is its transparent

reasoning through the knowledge graph.

However, we aim beyond just comparison. Con-

ventional machine learning models, like BERT, de-

mand retraining for each unique task or situation.

While the underlying BERT language model might

contain latent representations about objects, tools and

actions from language correlation statistics, the em-

bedding of the system into a speciﬁc situational con-

text occurs by the training data, meaning that the sys-

tem has to be retrained for every task and every situ-

ation. In contrast, Memory Nets assimilate changing

situations or new action patterns without retraining,

thanks to their graph-based inference mechanism.

As a conclusion, we believe that a system which

operates in dynamically changing situations and

which deals with an incrementally growing knowl-

edge base will not be possible with standard batch-

learning based models but will require more ﬂexible

reasoning mechanisms like proposed in this paper.

Anyhow, targeting for a hybrid solution, combining

advantages of both approaches, is a promising direc-

tion.

REFERENCES

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O.,

David, B., Finn, C., Fu, C., Gopalakrishnan, K., Haus-

man, K., et al. (2022). Do as i can, not as i say:

Grounding language in robotic affordances. arXiv

preprint arXiv:2204.01691.

Angles, R. (2018). The property graph database model. In

AMW.

Antoniou, C. and Bassiliades, N. (2022). A survey on se-

mantic question answering systems. The Knowledge

Engineering Review, 37.

Baker, C., Fillmore, C., and Lowe, J. (1998). The berkeley

framenet project. In Proceedings of the Coling-Acl.

Berners-Lee, T., Hendler, J., and Lassila, O. (2001). The

semantic web. Scientiﬁc american, 284(5):34–43.

Das, A., Datta, S., Gkioxari, G., S. Lee, D. P., and

Batra, D. (2017). Embodied question answering.

https://arxiv.org/abs/1711.11543. Accessed: 2022-12-

16.

Deigm

oller, J., Smirnov, P., Wang, C., Takeuchi, J., and

Eggert, J. (2022). Situational question answering us-

ing memory nets. In 14th International Conference

on Knowledge Engineering and Ontology Develop-

ment(KEOD).

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

KEOD 2023 - 15th International Conference on Knowledge Engineering and Ontology Development

188

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Diefenbach, D., Both, A., Singh, K., and Maret, P. (2018).

Towards a question answering system over the seman-

tic web (2018). arXiv preprint arXiv:1803.00832.

Diefenbach, D., Gim

enez-Garc

ıa, J., Both, A., Singh, K.,

and Maret, P. (2020). Qanswer kg: designing a

portable question answering system over rdf data. In

European Semantic Web Conference, pages 429–445.

Springer.

Dubey, M., Banerjee, D., Abdelkawi, A., and Lehmann, J.

(2019). Lc-quad 2.0: A large dataset for complex

question answering over wikidata and dbpedia. In

International semantic web conference, pages 69–78.

Springer.

Eggert, J., Deigm

oller, J., Fischer, L., and Richter, A.

(2019). Memory nets: Knowledge representation

for intelligent agent operations in real world. In In-

ternational Joint Conference on Knowledge Discov-

ery, Knowledge Engineering, and Knowledge Man-

agement, pages 260–282. Springer.

Eggert, J., Deigm

oller, J., Fischer, L., and Richter, A.

(2020). Action representation for intelligent agents

using memory nets. In Communications in Computer

and Information Science, volume 1297. Springer.

Fischer, L., Hasler, S., Deigm

oller, J., Schn

urer, T., Red-

ert, M., Pluntke, U., Nagel, K., Senzel, C., Ploen-

nigs, J., Richter, A., et al. (2018). Which tool to

use? grounded reasoning in everyday environments

with assistant robots. In CogRob@ KR, pages 3–10.

Honnibal, M. and Montani, I. (2017). spaCy 2: Natural lan-

guage understanding with Bloom embeddings, con-

volutional neural networks and incremental parsing.

https://sentometrics-research.com/publication/72.

Accessed: 2022-12-16.

Losing, V., Fischer, L., and Deigm

oller, J. (2021). Extrac-

tion of common-sense relations from procedural task

instructions using bert. In Proceedings of the 11th

Global Wordnet Conference, pages 81–90.

Lynch, C. and Sermanet, P. (2021). Translating natural

language instructions to computer programs for robot

manipulation. In Robotics Science and Systems.

Miller, G. A. (1998). WordNet: An electronic lexical

database. MIT press.

oller, T., Pietsch, M., and Rusic, M. (2022). Met-

rics to evaluate a question answering system.

https://www.deepset.ai/blog/metrics-to-evaluate-a-

question-answering-system. Accessed: 2022-12-16.

uhlig, M., Fischer, L., Hasler, S., and Deigm

oller, J.

(2020). A knowledge-based multi-entity and cooper-

ative system architecture. In 2020 IEEE International

Conference on Human-Machine Systems (ICHMS),

pages 1–6. IEEE.

Paulius, D. and Sun, Y. (2018). A survey of knowledge rep-

resentation in service robotics. In Robotics and Au-

tonomous Systems.

Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fidler, S.,

and Torralba, A. (2018). Virtualhome: Simulating

household activities via programs. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 8494–8502.

Sap, M., Le Bras, R., Allaway, E., Bhagavatula, C., Lourie,

N., Rashkin, H., Roof, B., Smith, N. A., and Choi, Y.

(2019). Atomic: An atlas of machine commonsense

for if-then reasoning. In Proceedings of the AAAI

conference on artiﬁcial intelligence, volume 33, pages

3027–3035.

Sendhoff, B. and Wersing, H. (2020). Cooperative intelli-

gence – a humane perspective. In 2020 IEEE Confer-

ence on Human-Machine Systems.

Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W.,

Mottaghi, R., Zettlemoyer, L., and Fox, D. (2020).

Alfred: A benchmark for interpreting grounded in-

structions for everyday tasks. In Proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 10740–10749.

Speer, R., Chin, J., and Havasi, C. (2017). Conceptnet 5.5:

An open multilingual graph of general knowledge. In

Thirty-ﬁrst AAAI conference on artiﬁcial intelligence.

Srivastava, S., Li, C., Lingelbach, M., Mart

ın-Mart

ın, R.,

Xia, F., Vainio, K. E., Lian, Z., Gokmen, C., Buch,

S., Liu, K., et al. (2022). Behavior: Benchmark for

everyday household activities in virtual, interactive,

and ecological environments. In Conference on Robot

Learning, pages 477–490. PMLR.

Storks, S., Gao, Q., and Chai, J. Y. (2019). Commonsense

reasoning for natural language understanding: A sur-

vey of benchmarks, resources, and approaches. arXiv

preprint arXiv:1904.01172, pages 1–60.

Tandon, N., De Melo, G., and Weikum, G. (2017). We-

bchild 2.0: Fine-grained commonsense knowledge

distillation. In Proceedings of ACL 2017, System

Demonstrations, pages 115–120.

Thosar, M., Zug, S., Skaria, A. M., and Jain, A. (2018).

A review of knowledge bases for service robots in

household environments. In 6th International Work-

shop on Artiﬁcial Intelligence and Cognition.

Vassiliades, A., Bassiliades, N., and Patkos, T. (2021).

Commonsense reasoning with argumentation for cog-

nitive robotics. In RuleML+ RR (Supplement).

Venkatesh, S. G., Upadrashta, R., and Amrutur, B. (2021).

Translating natural language instructions to computer

programs for robot manipulation. In 2021 IEEE/RSJ

International Conference on Intelligent Robots and

Systems (IROS), pages 1919–1926. IEEE.

Zafar, H., Napolitano, G., and Lehmann, J. (2018). Formal

query generation for question answering over knowl-

edge bases. In European semantic web conference,

pages 714–728. Springer.

Memory Net: Generalizable Common-Sense Reasoning over Real-World Actions and Objects

189