
5.2 Ensemble Results
All of the above models yielded, on their own, bet-
ter results than the BERT model. However, accuracy
was improved further by reinterpreting the confidence
scores while ensembling the models, as can be ob-
served in Table 6.
The ensemble of large language models (GTE-
large-en v1.5, Mistral-7B and MiniGPT4-7B) af-
ter reinterpretation of confidence scores, is able to
achieve an accuracy of 89.5%, which is 1.2 percent-
age points better than the vanilla ensemble accuracy
of 88.3%. Note that the vanilla ensemble of the large
language models gave slightly lower accuracy than
GTE-large on its own, which gave 88.7% accuracy.
The ensemble of the smaller language models
(RoBERTa-base, DistilBERT, and ALBERT) also
benefited slightly, with accuracy gaining 0.4 percent-
age points due to the reinterpretation, when using the
argmax-of-product strategy. The reinterpretation im-
proved the AHP Ensemble accuracy by 0.3 percent-
age points.
6 CONCLUSION
The reinterpretation of confidence scores based on
precision as a function of confidence scores, improved
the performance of the ensemble of large language
models by 1.2 percentage points, as can be seen in
the difference between the F-scores of PV-LLM and
PR-LLM in Table 6.
The reinterpretation also had a small positive im-
pact of 0.4 percentage points, on the accuracy of the
ensemble of smaller language models, as can be in-
ferred from the difference between the F-scores of
PV-SLM and PR-SLM in Table 6.
This shows that precision as a function of confi-
dence score is an insightful metric for an ensemble. It
quantifies reliability of model predictions, facilitating
more informed decisions on which prediction to trust.
It gives an edge over simply using the softmax scores
reported by the models to obtain a verdict.
7 LIMITATIONS
The performance of the models, and that of the en-
sembles, was restricted by many factors, including,
but not limited to the dated and primitive method of
data augmentation through addition of noise. SMOTE
is the most-preferred data augmentation method in re-
cent literature pertaining to language models and ma-
chine learning.
The decision to experiment with the addition of
white noise was taken in view of limited VRAM,
which precluded the possibility of SMOTE when
working with a sufficiently large sequence length.
The lack of a comprehensive strategy to tackle class
imbalance may be a limiting factor in this work.
Another limitation is the need for a large portion
of the dataset to be reserved for the development of
the ensemble strategy. The mapping of confidence
scores to class incidence probabilities needs a large
sample size for the reinterpretations to be statistically
significant.
REFERENCES
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,
W. P. (2002). Smote: Synthetic minority over-
sampling technique. Journal of Artificial Intelligence
Research, 16:321–357.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018).
BERT: pre-training of deep bidirectional transformers
for language understanding. CoRR, abs/1810.04805.
Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi,
H., and Smith, N. (2020). Fine-tuning pretrained lan-
guage models: Weight initializations, data orders, and
early stopping.
Doering, N., Gorlla, C., Tuttle, T., and Vijay, A. (2024).
Empirical analysis of efficient fine-tuning methods for
large pre-trained language models.
Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the
knowledge in a neural network.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford,
C., Chaplot, D. S., de las Casas, D., Bressand, F.,
Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R.,
Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T.,
Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mis-
tral 7b.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P.,
and Soricut, R. (2020). Albert: A lite bert for self-
supervised learning of language representations.
Li, J., Li, D., Savarese, S., and Hoi, S. (2023). BLIP-
2: Bootstrapping language-image pre-training with
frozen image encoders and large language models. In
Krause, A., Brunskill, E., Cho, K., Engelhardt, B.,
Sabato, S., and Scarlett, J., editors, Proceedings of the
40th International Conference on Machine Learning,
volume 202 of Proceedings of Machine Learning Re-
search, pages 19730–19742. PMLR.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,
V. (2019). Roberta: A robustly optimized bert pre-
training approach.
Reynolds, L. and McDonell, K. (2021). Prompt program-
ming for large language models: Beyond the few-shot
paradigm.
Saaty, T. L. (1990). How to make a decision: The analytic
hierarchy process. European Journal of Operational
Classification of Complaints Text Data by Ensembling Large Language Models
685