1. Group sentences that has common N-grams.
2. Choose sentence with biggest amount of key-
words among those that are in one group.
3. Generate summary from sentences that were cho-
sen from previous step.
The used corpora contains information about emer-
gency situations, therefore numerical data is of par-
ticular importance. The attention also will be provi-
ded to numerical data. Such information will be very
helpful for emergency work specialists. The sum-
mary should contain such information and presence
of it will be used in evaluation process. Finally, the
main and most meaningful research should be done
in synonyms. Since the basic similarity calculated
by presence of common words in two sentences, it
is very important to add synonyms dictionary. The
sentence S A may contain word underground tremors
and sentence S B earthquake, meaning of these N-
grams mostly equal, but implemented algorithm will
not recognize similarity.
4 CONCLUSION
We carried out a study of existing works in the field
of the automatic summary extraction. The imple-
mented algorithms were compared and results of this
comparison show the practical meaning of this work.
The results of summary evaluation mostly matched
the comparison described in (Barrios et al., 2016).
The General TextRank was the best one, which ge-
nerates summary with a high distribution of key-
words. Its average key-words distribution is equal to
0.180. The easiest in implementation algorithm Lon-
gestCommonSubstring has key-words concentration
equal to 0.175. The lowest distribution 0.169 belongs
to BM25.
Tests showed that the presence of identical words
as a definition of the importance of sentences is not
suitable for all data. Firstly, it was noticed that unim-
portant N-grams, repeated in several sentences, lead
to summary with those sentences. Probably not all N-
grams should participate in sentences similarity cal-
culation. Secondly, synonyms are not taken into ac-
count. The sentence S A may contain word under-
ground tremors and sentence S B earthquake, mea-
ning of these N-grams mostly equal, but implemen-
ted algorithm will not recognize the similarity. Ho-
wever, the addition of synonyms will depend on the
existence of a dictionary of synonyms and its comple-
teness. Thirdly,numerical data does not taken into ac-
count. The used corpora contains information about
emergency situations, therefore numerical data is of
particular importance. In future we would like to con-
tinue research on completely different or hybrid sum-
mary algorithms to avoid descibed above issues.
More research would be done on dictionary ex-
traction, synonyms dictionary, and summary evalua-
tion. Dictionary extraction has more work to be done
since it is very important in summary evaluation and
all problems should be resolved: stop-words, stem-
ming. In most of the cases, all three algorithms cut off
useless information leaving only important part that
contains topic keywords.
REFERENCES
Barrios, F., Lpez, F., Argerich, L., and Wachenchauzer, R.
(2016). Variations of the similarity function of tex-
trank for automated summarization. In Proc. Argen-
tine Symposium on Artificial Intelligence, ASAI.
Fukumoto, F., Suzuki, Y., and Fukumoto, J.-i. (1997). An
automatic extraction o f key paragraphs based on con-
text dependency. pages 291–298.
Jaruskulchai, C. and Kruengkrai, C. (2003). A practical text
summarizer by paragraph extraction for thai. pages 9–
16.
Lin, C.-Y. (2004). Rouge: A package for automatic evalua-
tion of summaries. In Proceedings of the ACL Works-
hop: Text Summarization Braches Out 2004, page 10.
Mitra, M., Singhal, A., and Buckley, C. (2000). Automatic
text summarization by paragraph extraction.
Mussina, A. and Aubakirov, S. (2017). Dictionary ex-
traction based on statistical data. Vestnik KazNU, pa-
ges 72–82.
Page, L., Brin, S., Motwani, R., and Winograd, T. (1998).
The pagerank citation ranking: Bringing order to the
web.
Sripada, S. and Jagarlamudi, J. (2009). Summarization ap-
proaches based on document probability distributions.
In PACLIC.
Yacko (2002). Simmetrichnoe referirovanie: teoreticheskie
osnovy i metodika. pages 18–28.
COPYRIGHT FORM
The Author hereby grants to the publisher, i.e. Sci-
ence and Technology Publications, (SCITEPRESS)
Lda Consent to Publish and Transfer this Contribu-
tion.
DATA 2018 - 7th International Conference on Data Science, Technology and Applications
76