Data Mining for Automatic Linguistic Description of Data

Textual Weather Prediction as a Classiﬁcation Problem

J. Janeiro, I. Rodriguez-Fdez, A. Ramos-Soto and A. Bugar´ın

CITIUS, University of Santiago de Compostela, Santiago de Compostela, Spain

Keywords:

Linguistic Descriptions of Data, Natural Language Generation, Weather Forecasting, Classiﬁcation.

Abstract:

In this paper we present the results and performance of ﬁve different classiﬁers applied to the task of automat-

ically generating textual weather forecasts from raw meteorological data. The type of forecasts this method-

ology can be applied to are template-based ones, which can be transformed into an intermediate language

that can directly mapped to classes (or values of variables). Experimental validation and tests of statistical

signiﬁcance were conducted using nine datasets from three real meteorological publicly accessible websites,

showing that RandomForest, IBk and PARTare statistically the best classiﬁers for this task in terms of F-Score,

with RandomForest providing slightly better results.

1 INTRODUCTION

Weather forecasting has been one of the most sci-

entiﬁcally and technologically challenging problems

around the world in the last century. To make an ac-

curate prediction is one of the major challenges mete-

orologists face on a daily basis. Weather forecasts are

made by collecting quantitative data about the current

state of the atmosphere on a given place and using

scientiﬁc understanding of atmospheric processes to

project how the atmosphere will evolve on that place.

Modern weather forecasting is largely based on

numerical weather predictions (NWP), which essen-

tially are massive atmosphere simulations run on su-

percomputers. The output of NWP models is a set of

predictions of meteorological parameters or variables

(wind speed, temperature, precipitation, etc) for vari-

ous spatial locations and at various points in time.

Weather forecasting organizations take NWP data

and modify it according to their local knowledge and

expertise. They also interpolate between the locations

in the source NWP model, again using local knowl-

edge and expertise. The result is a modiﬁed set of

predicted numerical weather values, for locations of

interest to their customers.

Initially, the NWP data was used by expert mete-

orologists to manually describe the weather forecast

using texts for different places. With the increasing

accuracy of predictions and the need to generate tex-

tual forecasts for a large number of locations, weather

forecasting organizations require solutions which au-

tomatically build these texts.

There are several ofﬁcial meteorological agen-

cies that offer weather forecast services, such as the

Spanish AEMET (AEMET, 2014), American NWS

(NWF, 2014) or the British Met Ofﬁce (MetOfﬁce,

2014b). Other private organizations like Weather-

Forecast (WeatherForecast, 2014) or Intellicast (Intel-

licast, 2014) offer their own forecast services. Some

of them provide forecast data for speciﬁc domains,

such as skying or surﬁng, allowing users to ﬁnd the

best conditions in which to perform this kind of ac-

tivities. Furthermore, due to the need to provide tex-

tual forecasts to an increasing number of locations,

some meteorological agencies started offering auto-

matically generated forecast texts. For instance, in the

1990s, NLG systems such as FoG (Goldberg et al.,

1994) and MultiMeteo (Coch, 1998), were used by

meteorological agencies to provide this kind of infor-

mation services. More recently, the Met Ofﬁce with

Data2Text (MetOfﬁce, 2014a) or the Galician Meteo-

Galicia with GALiWeather (Ramos Soto et al., 2014)

are also employing this sort of technology to address

the creation of textual forecasts for increasing quanti-

ties of localized data.

Several techniques can be used for automated gen-

eration of weather forecast texts. These techniques

can be divided into two broad categories: knowledge-

intensive (KI) and knowledge-light (KL) approaches

(Adeyanju, 2012). KI approaches require extensive

consultation with domain experts during data analysis

and throughout the text generation approach devel-

556

Janeiro J., Rodriguez-Fdez I., Ramos-Soto A. and Bugarín A..

Data Mining for Automatic Linguistic Description of Data - Textual Weather Prediction as a Classiﬁcation Problem.

DOI: 10.5220/0005282905560562

In Proceedings of the International Conference on Agents and Artiﬁcial Intelligence (ICAART-2015), pages 556-562

ISBN: 978-989-758-074-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

opment process. On the other hand, KL approaches

rely more on the use of automated methods which are

mainly statistical.

The earliest KI systems generated forecast texts

by inserting numeric values in standard manually-

created templates. Multiple templates are created

for each possible scenario and one of them is ran-

domly selected during text generation to provide va-

riety. Other KI systems developed linguistic models

using manually-authored rules obtained from domain

experts and corpus analysis.

The KL approach to generate forecast texts typ-

ically employs machine learning techniques. Train-

able systems are built using models based on statisti-

cal methods such as probabilistic context-free gram-

mars and phrase based machine translation. The ad-

vantage is that systems are built in less time and with

less human effort as compared to the KI approach.

In this paper we consider forecast services with a

KI approach and use these templated textual forecasts

to obtain linguistic predictions presented as a classi-

ﬁcation problem to generate natural language (NLG)

descriptions. The paper is organized as follows: in

section 2 we present the problem and the different

types of automatic textual forecasts. In section 3 we

provide the steps needed to solve it using classiﬁca-

tion techniques. In section 4 we explain the ﬁve dif-

ferent classiﬁcation techniques tested, the results ob-

tained for each one and a statistical comparison be-

tween them and, ﬁnally, we present the most relevant

conclusions of this approach.

2 LINGUISTIC WEATHER

PREDICTIONS AS A NLG

PROBLEM

The generation of natural language text uses the NWP

data and additional expert information to generate

textual weather forecasts that are issued to the public.

There are two main approaches for generating textual

forecasts automatically (Van Deemter et al., 2005):

• Template-based systems are natural language gen-

erating systems that map their non-linguisticinput

directly to the linguistic surface structure. This

linguistic structure may contain gaps that must be

ﬁlled with linguistic structures that do not contain

gaps. For example, a template such as ”[amount]

rain at [time]”, where the gaps represented by

[amount] and [time], can be ﬁlled with informa-

tion from the data.

• Standard NLG systems, by contrast, use less di-

rect mapping between input and surface form.

These systems could start from the same input se-

mantic representation subjecting it to a number of

consecutive transformations until a surface struc-

ture results. Various NLG submodules would op-

erate on it, jointly transforming the representation

into an intermediate representation where lexical

items and style of reference have been determined

while linguistic morphology is still absent. This

intermediate representation may in turn be trans-

formed into a proper sentence in one of the avail-

able output languages.

The typical stages of natural language generation

systems (Reiter et al., 2000), are:

• Content determination: Deciding what informa-

tion to mention in the text.

• Document structuring: Overall organizationof the

information to convey.

• Aggregation: Merging of similar sentences to im-

prove readability and naturalness.

• Lexical choice: Mapping words to concepts.

• Referring expression generation: Creating refer-

ring expressions that identify objects and regions.

• Realization: Creating the actual text, which

should be correct according to the rules of syntax,

morphology, and orthography.

The texts generated by these two approaches usu-

ally have a similar structure, from which we can ex-

tract the main information and apply data mining

techniques to the raw data to generate the same fore-

casts. To achieve this, we applied classiﬁcation algo-

rithms that learn the textual forecasts using data sam-

ples. In the next example, using the temperature data

values, we can learn the forecast text for the weekly

temperature:

• Full forecast: “Mostly dry. Warm. Mainly fresh

winds.”

• Daily Temperature values (

◦

C): 21, 22, 20, 19, 20,

18, 19

• Learned textual temperature value: “Warm”

3 LINGUISTIC PREDICTIONS AS

A CLASSIFICATION PROBLEM

From the two approaches for automatically generate

textual forecasts explained before, we selected the

template based forecasts since they are more abun-

dant and they have a more regular structure that allows

us to extract the relevant information from the text.

To test the classiﬁcation of these forecasts we need

DataMiningforAutomaticLinguisticDescriptionofData-TextualWeatherPredictionasaClassificationProblem

557

to transform the textual forecast into a class, extract-

ing the relevant information and building descriptive

phrases. We selected three different datasets from the

web that offer NWP data and a descriptive, template-

based textual forecast. Then we transformed these

textual forecasts into classes and used them along

with the raw meteorological data to perform the clas-

siﬁcation.

3.1 Weather-forecast Dataset

Weather-Forecast (WeatherForecast, 2014) uses the

Global Forecast System from the National Oceanic

and Atmospheric Administration (NOAA) to get their

raw forecast data and use their own computers to

generate the actual forecasts. Their textual forecasts

include information about precipitation, temperature

and wind, as shown in the example that follows:

Mostly dry. Warm (max 29

◦

C on Tue afternoon,

min 23

◦

C on Wed night). Wind will be generally light.

From this forecast service we can extract three

datasets (one for each of the variables considered),

as indicated in table 1. The selected samples from

this service come from different locations worldwide.

Some examples of the classes considered are:

• Precipitation: “mostly dry”, “light rain”, “some

drizzle”, “moderate rain”.

• Temperature: “warm”, “very mild”, “freeze-

thaw conditions”.

• Wind: “generally light”, “increasing light to

fresh winds”, “mainly fresh ”, “decreasing fresh

to calm”.

3.2 National Weather Service Dataset

The National Weather Service (NWF, 2014) is a com-

ponent of the National Oceanic and Atmospheric Ad-

ministration (NOAA). They provide weather, water,

and climate data, forecasts and warnings for the U.S.

territory. Their textual forecasts include information

about precipitation, cloud coverage and wind:

A chance of showers, mainly before 11pm. Mostly

cloudy, with a low around 60. West wind 3 to 5 mph.

From this forecast we can extract three datasets as

indicated in table 2. The selected samples from this

service come from different locations on the United

States of America. Some class examples considered

from this dataset are:

• Precipitation: “chance showers”, “showers

likely”, “scattered showers and thunderstorms”,

“slight chance showers then slight chance show-

ers and thunderstorms”.

• Cloud coverage: “mostly cloudy”, “partly

sunny”, “mostly clear”, “sunny and hot”.

• Wind: “west”, “calm becoming west”, “north-

west becoming calm”, “west becoming north-

east”.

3.3 Intellicast Dataset

Intellicast (Intellicast, 2014) delivers site-speciﬁc

forecasts for 60,000 sites in the U.S. and around the

globe including detailed local forecasts to hurricane

tracks and severe weather warnings to international

conditions. Their textual forecasts include informa-

tion about cloud coverage, precipitation, temperature

and wind, for example:

Partly cloudy skies. Hot. High 93F. Winds WSW

at 10 to 20 mph.

From this forecast service we can extract three

datasets, one of them includes both cloud coverage

and precipitation information as indicated in table 3.

The selected samples from this service come from dif-

ferent locations worldwide. Some examples of the

classes considered are:

• Mixed cloud coverage and precipitation: “partly

cloudy”, “sunshine and clouds”, “partly cloudy

with thunderstorms”, “mix of clouds and sun with

the chance of isolated thunderstorm”.

• Temperature: “hot”, “warm”, “hot and humid”,

“very hot”.

• Wind: “WSW”, “light and variable”, ”S decreas-

ing”.

4 EXPERIMENTAL SETUP

We evaluated the performance of ﬁve different classi-

ﬁcation techniques for the three datasets introduced

previously using the data mining software “Weka”

(Hall et al., 2009). We selected these ones to test dif-

ferent types of supervised learning techniques which,

in general, provide comprehensible visual models.

Other techniques such as Artiﬁcial Neural Networks

were not considered due to its black box structure.

4.1 Classiﬁcation Methods

The classiﬁcation techniques applied are:

• J48 (Quinlan, 1993) is an open source Java im-

plementation of the C4.5 algorithm. C4.5 builds

decision trees from a set of training data. At each

node of the tree, C4.5 chooses the attribute of the

data that most effectively splits its set of samples

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

558