for machine translation evaluation with reinforced fac-
tors. In Machine Translation Summit.
Jiang, Z., Araki, J., Ding, H., and Neubig, G. (2021). How
can we know when language models know? on the
calibration of language models for question answer-
ing.
Kahle, P., Colutto, S., Hackl, G., and M
¨
uhlberger, G.
(2017). Transkribus - a service platform for transcrip-
tion, recognition and retrieval of historical documents.
In 2017 14th IAPR International Conference on Doc-
ument Analysis and Recognition (ICDAR), volume 04,
pages 19–24.
Kintz, M., Dukino, C., Blohm, M., and Hanussek, M.
(2020). Make your Customers Happy Again. AI and
NLP for a Customer Complaint Management Plat-
form.
Korbak, T., Shi, K., Chen, A., Bhalerao, R., Buckley, C. L.,
Phang, J., Bowman, S. R., and Perez, E. (2023). Pre-
training language models with human preferences.
Lambert, N., Castricato, L., von Werra, L., and Havrilla,
A. (2022). Illustrating reinforcement learning
from human feedback (rlhf). Hugging Face Blog.
https://huggingface.co/blog/rlhf.
Levenshtein, V. I. (1966). Binary codes capable of cor-
recting deletions, insertions, and reversals. In Soviet
Physics Doklady, vol. 10, no. 8, pages 707–710.
Lin, C.-Y. (2004). ROUGE: A package for automatic evalu-
ation of summaries. In Text Summarization Branches
Out, pages 74–81, Barcelona, Spain. Association for
Computational Linguistics.
Microsoft (2022). Document ai (intelligent document pro-
cessing). https://www.microsoft.com/en-us/research
/project/document-ai/. Accessed: 2022-12-20.
M
¨
oller, T., Risch, J., and Pietsch, M. (2021). GermanQuAD
and GermanDPR: Improving non-English question
answering and passage retrieval. In Proceedings of
the 3rd Workshop on Machine Reading for Question
Answering, pages 42–50, Punta Cana, Dominican Re-
public. Association for Computational Linguistics.
Neudecker, C., Baierer, K., Federbusch, M., Boenig, M.,
W
¨
urzner, K.-M., Hartmann, V., and Herrmann, E.
(2019). Ocr-d: An end-to-end open source ocr frame-
work for historical printed documents. In Proceed-
ings of the 3rd International Conference on Digital
Access to Textual Cultural Heritage, DATeCH2019,
page 53–58, New York, NY, USA. Association for
Computing Machinery.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K.,
Gray, A., Schulman, J., Hilton, J., Kelton, F., Miller,
L., Simens, M., Askell, A., Welinder, P., Christiano,
P., Leike, J., and Lowe, R. (2022). Training language
models to follow instructions with human feedback.
In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K.,
editors, Advances in Neural Information Processing
Systems.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
Bleu: A method for automatic evaluation of ma-
chine translation. In Proceedings of the 40th Annual
Meeting on Association for Computational Linguis-
tics, ACL ’02, page 311–318, USA. Association for
Computational Linguistics.
Pearce, K., Zhan, T., Komanduri, A., and Zhan, J. (2021).
A comparative study of transformer-based language
models on extractive question answering. CoRR,
abs/2110.03142.
Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you
don’t know: Unanswerable questions for squad.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016).
Squad: 100,000+ questions for machine comprehen-
sion of text.
Sanderson, M. and Croft, W. B. (2012). The history of in-
formation retrieval research. Proceedings of the IEEE,
100(Special Centennial Issue):1444–1451.
Schaeffer, R., Miranda, B., and Koyejo, S. (2023). Are
emergent abilities of large language models a mirage?
Smith, R. (2019). Tesseract ocr: an optical character recog-
nition engine for various operating systems. https:
//github.com/tesseract-ocr/tesseract. Accessed: 2021-
11-10.
Su, D., Xu, Y., Winata, G. I., Xu, P., Kim, H., Liu, Z., and
Fung, P. (2019). Generalizing question answering sys-
tem with pre-trained language model fine-tuning. In
Proceedings of the 2nd Workshop on Machine Read-
ing for Question Answering, pages 203–211, Hong
Kong, China. Association for Computational Linguis-
tics.
Tang, Z., Yang, Z., Wang, G., Fang, Y., Liu, Y., Zhu, C.,
Zeng, M., Zhang, C., and Bansal, M. (2022). Uni-
fying vision, text, and layout for universal document
processing.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,
C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-
icz, M., Davison, J., Shleifer, S., von Platen, P., Ma,
C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger,
S., Drame, M., Lhoest, Q., and Rush, A. M. (2020).
Transformers: State-of-the-art natural language pro-
cessing. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing:
System Demonstrations, pages 38–45, Online. Asso-
ciation for Computational Linguistics.
Xu, P., Liang, D., Huang, Z., and Xiang, B. (2021).
Attention-guided generative models for extractive
question answering. CoRR, abs/2110.06393.
Zhang, J., Chen, Y., Niu, N., and Liu, C. (2023). A prelim-
inary evaluation of chatgpt in requirements informa-
tion retrieval.
Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Rad-
ford, A., Amodei, D., Christiano, P., and Irving, G.
(2019). Fine-tuning language models from human
preferences. ArXiv, abs/1909.08593.
Fine-Tuning and Aligning Question Answering Models for Complex Information Extraction Tasks
205