
models for knowledge-intensive NLP. Preprint,
arXiv:2212.14024.
Li, M., Zhao, Y., Yu, B., Song, F., Li, H., Yu, H., Li, Z.,
Huang, F., and Li, Y. (2023). API-Bank: A Com-
prehensive Benchmark for Tool-Augmented LLMs.
Preprint, arXiv:2304.08244.
Liu, W., Huang, X., Zeng, X., Hao, X., Yu, S., Li, D.,
Wang, S., Gan, W., Liu, Z., Yu, Y., Wang, Z., Wang,
Y., Ning, W., Hou, Y., Wang, B., Wu, C., Wang, X.,
Liu, Y., Wang, Y., Tang, D., Tu, D., Shang, L., Jiang,
X., Tang, R., Lian, D., Liu, Q., and Chen, E. (2024).
ToolACE: Winning the Points of LLM Function Call-
ing. Preprint, arXiv:2409.00920.
Lu, J., Holleis, T., Zhang, Y., Aumayer, B., Nan, F., Bai, F.,
Ma, S., Ma, S., Li, M., Yin, G., Wang, Z., and Pang,
R. (2024). ToolSandbox: A Stateful, Conversational,
Interactive Evaluation Benchmark for LLM Tool Use
Capabilities. arXiv:2408.04682 [cs].
Ma, X., Gong, Y., He, P., Zhao, H., and Duan, N. (2023).
Query Rewriting for Retrieval-Augmented Large Lan-
guage Models. Preprint, arXiv:2305.14283.
Moon, S., Jha, S., Erdogan, L. E., Kim, S., Lim, W.,
Keutzer, K., and Gholami, A. (2024). Efficient and
Scalable Estimation of Tool Representations in Vector
Space. Preprint, arXiv:2409.02141.
Papineni, K. (2001). Why Inverse Document Frequency?
In Second Meeting of the North American Chapter of
the Association for Computational Linguistics.
Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E.
(2023). Gorilla: Large Language Model Connected
with Massive APIs. Preprint, arXiv:2305.15334.
Peng, W., Li, G., Jiang, Y., Wang, Z., Ou, D., Zeng, X.,
Xu, D., Xu, T., and Chen, E. (2024). Large Language
Model based Long-tail Query Rewriting in Taobao
Search. Preprint, arXiv:2311.03758.
Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin,
Y., Cong, X., Tang, X., Qian, B., Zhao, S., Hong, L.,
Tian, R., Xie, R., Zhou, J., Gerstein, M., Li, D., Liu,
Z., and Sun, M. (2023). ToolLLM: Facilitating Large
Language Models to Master 16000+ Real-world APIs.
Preprint, arXiv:2307.16789.
Raudaschl, A. H. (2023). Forget RAG, the future is RAG-
Fusion: The next frontier of search: Retrieval Aug-
mented Generation meets Reciprocal Rank Fusion and
generated queries.
Robertson, S. and Zaragoza, H. (2009). The Probabilis-
tic Relevance Framework: BM25 and Beyond. ISSN:
1554-0669 Issue: 4 Pages: 333-389 Publication Title:
Foundations and Trends® in Information Retrieval
Volume: 3.
Roucher, A. (2023). Agentic RAG: Turbocharge your RAG
with query reformulation and self-query!
Sawarkar, K., Mangal, A., and Solanki, S. R. (2024).
Blended RAG: Improving RAG (Retriever-
Augmented Generation) Accuracy with Semantic
Search and Hybrid Query-Based Retrievers. Preprint,
arXiv:2404.07220.
Setty, S., Thakkar, H., Lee, A., Chung, E., and Vidra,
N. (2024). Improving Retrieval for RAG based
Question Answering Models on Financial Documents.
Preprint, arXiv:2404.07221.
Sun, W., Yan, L., Ma, X., Wang, S., Ren, P., Chen, Z., Yin,
D., and Ren, Z. (2023). Is ChatGPT Good at Search?
Investigating Large Language Models as Re-Ranking
Agents. Preprint, arXiv:2304.09542.
Tang, Q., Deng, Z., Lin, H., Han, X., Liang, Q., Cao, B.,
and Sun, L. (2023). ToolAlpaca: Generalized Tool
Learning for Language Models with 3000 Simulated
Cases. Preprint, arXiv:2306.05301.
Tang, Y. and Yang, Y. (2024). MultiHop-RAG: Benchmark-
ing Retrieval-Augmented Generation for Multi-Hop
Queries. Preprint, arXiv:2401.15391.
Theja, R. (2023). Boosting RAG: Picking the best embed-
ding & reranker models.
Trivedi, H., Balasubramanian, N., Khot, T., and Sabhar-
wal, A. (2023). Interleaving Retrieval with Chain-of-
Thought Reasoning for Knowledge-Intensive Multi-
Step Questions. Preprint, arXiv:2212.10509.
Wang, L., Yang, N., and Wei, F. (2023). Query2doc: Query
Expansion with Large Language Models. Preprint,
arXiv:2303.07678.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,
Xia, F., Chi, E., Le, Q., and Zhou, D. (2023). Chain-
of-Thought Prompting Elicits Reasoning in Large
Language Models. Preprint, arXiv:2201.11903.
Wu, M., Zhu, T., Han, H., Tan, C., Zhang, X., and Chen,
W. (2024). Seal-Tools: Self-Instruct Tool Learning
Dataset for Agent Tuning and Detailed Benchmark.
Preprint, arXiv:2405.08355.
Xu, B., Peng, Z., Lei, B., Mukherjee, S., Liu, Y., and Xu,
D. (2023). ReWOO: Decoupling Reasoning from Ob-
servations for Efficient Augmented Language Models.
Preprint, arXiv:2305.18323.
Yan, S.-Q., Gu, J.-C., Zhu, Y., and Ling, Z.-H. (2024). Cor-
rective Retrieval Augmented Generation. Preprint,
arXiv:2401.15884.
Yao, S., Shinn, N., Razavi, P., and Narasimhan, K.
(2024). τ-bench: A Benchmark for Tool-Agent-
User Interaction in Real-World Domains. Preprint,
arXiv:2406.12045.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
K., and Cao, Y. (2023). ReAct: Synergizing Rea-
soning and Acting in Language Models. Preprint,
arXiv:2210.03629.
Yuan, L., Chen, Y., Wang, X., Fung, Y. R., Peng, H., and Ji,
H. (2024a). CRAFT: Customizing LLMs by Creating
and Retrieving from Specialized Toolsets. Preprint,
arXiv:2309.17428.
Yuan, S., Song, K., Chen, J., Tan, X., Shen, Y., Kan, R.,
Li, D., and Yang, D. (2024b). EASYTOOL: Enhanc-
ing LLM-based Agents with Concise Tool Instruction.
Preprint, arXiv:2401.06201.
Zheng, H. S., Mishra, S., Chen, X., Cheng, H.-T., Chi,
E. H., Le, Q. V., and Zhou, D. (2024a). Take a Step
Back: Evoking Reasoning via Abstraction in Large
Language Models. Preprint, arXiv:2310.06117.
Zheng, Y., Li, P., Liu, W., Liu, Y., Luan, J., and Wang,
B. (2024b). ToolRerank: Adaptive and Hierarchy-
Aware Reranking for Tool Retrieval. Preprint,
arXiv:2403.06551.
Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases
1187