on more label instances and thus improve Model 2.
6.3 Relevance of the Word Embeddings
The use of the CamemBERT word embeddingds also
improved the results. The pollutant category (P) is the
one that benefits the most from the use of the vectors.
To compare, by using our preliminary model imple-
menting the Word2vec method, the pollutant annota-
tion precision is only 0.05 but by using the current
model the score has increased to 0.56 without low-
ering the recall. A pollutant expression is usually a
nominal phrase. It is very difficult to differ it from any
other nominal component, on syntactic level. And un-
like institution names or chemicals, the expression of
pollutants does not involve changes of word case or
the usage of nomenclatures. So the most promising
ways to recognize them are by analysing the polar-
ity (positive or negative) in the context, and by build-
ing the word meaning itself, all of which require us-
age of complicated semantic features. Unlike syntac-
tic features, semantic features are hard to extract and
to be comprehended by the algorithm. The Camem-
BERT model, which has embedded semantic features
in form of word vectors, enables the neural network to
learn annotation patterns on a semantic level. So, our
model can recognize some typical pollutant expres-
sions, like tensio actif (surfactant) and other chemi-
cal products, which is exactly the information which
must be extracted in order to build the memory of pol-
luted sites.
7 CONCLUSION
In this paper, we have described an approach for
event-related information extraction from a corpus fo-
cused on industrial pollution. With a supervised deep
learning method, we trained two models that can sim-
ulate our manual annotation on industrial event fea-
tures. Right now, the models trained with Bi-LSTM
neural networks have given promising results, but we
still need them to be better at detecting event trig-
gers and industrial activities in order to use them on
other text resources. Given the fact that the models
are trained with only a small portion of the corpus,
and the neural network configurations are not fully
explored, it could be possible to improve the model.
Aside from increasing training text data and adjust-
ing neural network setting, it is also interesting to see
if the model could have a better performance if we
use paragraphs instead of sentences as the input of the
neural networks, since the narration of an event is not
limited in a sentence.
This work is devoted to the construction of the
polluted sites memory, based on an only consistent
and complete database. Eventually, the event-related
information extracted by the models will be inserted
in the database. For future work, we will apply a syn-
tactic parser to link the extracted event features by
dependency relations, and train a classifier to catego-
rize the events, so that they can be integrated into the
database with an appropriate structure. The models
will also be tested and used on other corpora in the do-
main of industrial pollution, to connect other sources
of data and enrich the polluted site memory.
REFERENCES
Arnulphy, B. (2012). D
´
esignations nominales des
´
ev
´
enements:
´
etude et extraction automatique dans les
textes. PhD thesis, Universit
´
e Paris 11.
Basaldella, M., Antolli, E., Serra, G., and Tasso, C. (2018).
Bidirectional LSTM Recurrent Neural Network for
Keyphrase Extraction, pages 180–187. Springer.
Battistelli, D., Charnois, T., Minel, J.-L., and Teiss
`
edre, C.
(2013). Detecting salient events in large corpora by a
combination of NLP and data mining techniques. In
Conference on Intelligent Text Processing and Com-
putational Linguistics, volume 17(2), pages 229–237,
Samos, Greece.
Lecolle, M. (2009).
´
El
´
ements pour la caract
´
erisation des
toponymes en emploi
´
ev
´
enementiel. In Evrard, I.,
Pierrard, M., Rosier, L., and Raemdonck, D. V., ed-
itors, Les sens en marge Repr
´
esentations linguistiques
et observables discursifs, pages 29–43. L’Harmattan.
Martin, L., Muller, B., Ortiz Su
´
arez, P. J., Dupont, Y., Ro-
mary, L., de la Clergerie,
´
E., Seddah, D., and Sagot, B.
(2020). CamemBERT: a tasty French language model.
In Proceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics, pages 7203–
7219, Online. Association for Computational Linguis-
tics.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efficient estimation of word representations in vector
space.
Panchendrarajan, R. and Amaresan, A. (2018). Bidi-
rectional LSTM-CRF for named entity recognition.
In Proceedings of the 32nd Pacific Asia Conference
on Language, Information and Computation, Hong
Kong. Association for Computational Linguistics.
Schmid, H. (1994). Probabilistic part-of-speech tagging us-
ing decision trees.
Shin, H. J., Park, J. Y., Yuk, D. B., and Lee, J. S. (2020).
BERT-based spatial information extraction. In Pro-
ceedings of the Third International Workshop on Spa-
tial Language Understanding, pages 10–17, Online.
Association for Computational Linguistics.
KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval
224