hill climbing method. Therefore, we decided to
focus on the pseudo backward hill climbing method
regarding parameter tuning. We tried a widespread
variety of experiments with all the four ML method.
Due to space limitations, we will detail the best
result, which was achieved by J48.
We decide to limit the number of leaves in each
sub-decision tree to 18. By that, we prevent over
fitting. In addition, we prevent any pruning. Features
# 11 & 12 that their weights were zero according to
InfoGain were omitted. By that, we achieved 91.2%.
After the omitting of the third feature, the accuracy
rate fall to 91.05%
To sum up, the best accuracy result (91.2%) was
achieved by J48, based on 17 features using the
pseudo backward hill climbing method. It represents
an improvement of 8.05% to the baseline result
(83.15%).
7 SUMMARY AND FUTURE
WORK
We present an automatic process that identifies
quotations included in rabbinic documents written in
Hebrew-Aramaic. The identification was based on a
combination of at most nineteen features that belong
to seven feature sets: matches, best matches, sums of
weights, weighted averages, weighted medians,
common words, and quotation indicators.
Our research is unique in addressing the much
more difficult problem of identification of
quotations included in rabbinic literature.
Furthermore, we define and apply features that some
of them have not been used in previous researches.
Experiments on various combinations of these
features were performed using four common ML
methods. A combination of 17 features using J48 (an
improved version of C4.5) achieves an accuracy of
91.2%, which is an improvement of about 8 %
compared to a baseline result.
Other Semitic languages, such as Arabic also
have similar complex morphology. For example,
prefixes and terminal letters might be included in the
citations. Research that emphasizes morphological
features might be fruitful not only to Hebrew-
Aramaic rabbinic texts but also to other kinds of
texts from other Semitic languages.
Future research proposals are: (1) Identify more
complex cases, e.g., nested quotations and
quotations attributed to unspecified authors; (2)
Implement morphological features that are
appropriate for various Semitic languages; and (3)
Disambiguate ambiguous quotations.
REFERENCES
Choueka, Y., Conley E. S., Dagan. I., 2000. A
Comprehensive Bilingual Word Alignment System:
Application to Disparate Languages - Hebrew,
English, in Veronis J. (Ed.), Parallel Text Processing,
Kluwer Academic Publishers, 69-96.
de La Clergerie É., Sagot, B., Stern, R., Denis P.,
Recourcé G., Mignot. V., 2009. Extracting and
Visualizing Quotations from News Wires, in Proc. of
L&TC 2009, Pozna´n, Poland.
Cortes, C., Vapnik. V., 1995. Support-Vector Networks.
Machine Learning 20, 273-297.
Forman, G., 2003. An Extensive Empirical Study of
Feature Selection Metrics for Text Classification. J. of
Machine Learning Research 3 1289-1305
Gabrilovich, E., Markovitch, S., 2004. Text
Categorization with Many Redundant Features: Using
Aggressive Feature Selection to Make SVMs
Competitive with C4.5. In Proc. of the 21
International Conference on Machine Learning, 321-
328, Morgan Kaufmann
Haykin, S., 1998. Neural Networks: A Comprehensive
Foundation, 2nd edition, Prentice Hall.
Hosmer D. W., Lemeshow. S., 2000. Applied Logistic
Regression. 2nd ed. New York; Chichester, Wiley.
Liang, J., Dhillon, N., Koperski. K., 2010. A Large Scale
System for Annotating and Querying Quotations in
News Feeds, Semantic Search 2010 Workshop ,the
19th international conference on World wide web
(WWW2010), Raleigh, North Carolina, USA.
Miller, G., A., 1956. The Magical Number Seven, Plus or
Minus Two: Some Limits on our Capacity of
Information. Psychological Science, 63, 81-97.
Platt, J., C., 1999. Fast training of support vector machines
using sequential minimal optimization. In Advances in
Kernel Methods - Support Vector Learning, MIT
Press, Cambridge, Massachusetts, chapter 12, 185-
208.
Pouliquen, B., Steinberger, R., Best, C., 2007. Automatic
Detection of Quotations in Multilingual News. In
Proc. of Recent Advances in Natural Language
Processing (RANLP-2007), 25-32.
Quinlan. R., J., 1993. C4.5: Programs for Machine
Learning. Morgan Kaufmann, Los Altos.
Sagot, B., Boullier, P., 2008. SXPipe 2: Architecture pour
le Traitement Présyntaxique de Corpus Bruts.
Traitement Automatique des Langues (T.A.L.),
49(2):155-188.
Yang, Y., Pedersen, J., P., A., 1997. Comparative Study
on Feature Selection in Text Categorization. In Proc.
of the Fourteenth International Conference on
Machine Learning (ICML'97), 412-420.
Vapnik. V., N., 1995. The Nature of Statistical Learning
Theory. Springer-Verlag, NY, USA.
Witten, I., H., Frank, E., 2009. Learning Software in Java.
http://www.cs.waikato.ac.nz/~ml/weka.
AUTOMATIC IDENTIFICATION OF BIBLICAL QUOTATIONS IN HEBREW-ARAMAIC DOCUMENTS
325