COMPARATIVE ANALYSIS OF THE EFFECTIVENESS OF LARGE LANGUAGE MODELS FOR METAPHOR IDENTIFICATION: ZERO-SHOT AND FINE-TUNING METHODS

Yakiv Bystrov; Nestor Bolshakov

doi:10.32782/folium/2025.7.9

Authors

Yakiv Bystrov Vasyl Stefanyk Carpathian National University https://orcid.org/0000-0002-6549-8474
Nestor Bolshakov Vasyl Stefanyk Carpathian National University https://orcid.org/0009-0004-0917-4514

DOI:

https://doi.org/10.32782/folium/2025.7.9

Keywords:

large language models (LLM), metaphor, metaphor identification, finetuning, zero-shot, computational linguistics, natural language processing (NLP).

Abstract

The article addresses the challenge of automatic metaphor identification, one of the most complex tasks in natural language processing (NLP). Based on the principles of cognitive linguistics, which define metaphor as a fundamental mechanism of thinking (Lakoff & Johnson, 1980), the role of metaphor as a powerful framing tool in political and media discourse is explored. Although the ability to analyse metaphorical patterns at scale is crucial for identifying manipulative technologies, the process of recognising them is complicated by contextual dependence, creativity, and the need for encyclopaedic knowledge. One of the main issues addressed in this article is the assessment of the potential of modern large language models (LLMs) for solving the task of automatic metaphor identification. The paper compares two key approaches: using the so-called ‘innate’ knowledge of models without additional tuning (the ‘zero-shot’ approach) and their specialised adaptation through fine-tuning. The effectiveness of the latest models (as of July 2025) from leading developers was investigated: OpenAI (GPT-4o), Google (Gemini 2.5 Pro, Gemini 2.5 Flash), and Anthropic (Claude Sonnet 4). Special attention was paid to the methodology of the experiment. The analysis was based on the NAACL 2020 Shared Task on Metaphor Detection corpus, and standard binary classification metrics were used to evaluate the performance of the models: precision, recall, and the F1-score. The article describes the fine-tuning procedure and identifies practical limitations associated with varying levels of tool availability in leading artificial intelligence ecosystems. The results of the study showed that the baseline models demonstrate low and unbalanced performance, while the finetuning procedure significantly improves their output (F1-Score increases by 24-29%). A comparative analysis of the retrained models revealed that GPT-4o achieves a better balance between recall and precision (F1-Score 64.20%), while Gemini 2.5 Flash retains a slight advantage in precision. The article makes an important contribution to the study of the capabilities of LLMs for analysing figurative language, demonstrating that fine-tuning is an extremely important method for adapting them to complex linguistic tasks.

References

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., & Amodei, D. (2020). Language Models Are Few-Shot Learners.

In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems, 33, 1877–1901.

https://doi.org/10.48550/arXiv.2005.14165

Charteris-Black, J. (2004). Corpus Approaches to Critical Metaphor Analysis. Basingstoke: Palgrave Macmillan. https://doi.org/10.1057/9780230000612

Lakoff, G., & Johnson, M. (1980). Metaphors We Live By. Chicago: University of Chicago Press.

Leong, C., Beigman Klebanov, B., & Shutova, E. (2020). Report on the 2020 Metaphor Detection Shared Task. In Proceedings of the Second

Workshop on Figurative Language Processing. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.figlang-1.3

Musolff, A. (2006). Metaphor scenarios in public discourse. Metaphor and Symbol. https://doi.org/10.1207/s15327868ms2101_2

Pragglejaz Group. (2007). MIP: A method for identifying metaphorically used words in discourse. Metaphor and Symbol. https://doi.org/10.1080/10926480709336752

Shutova, E. (2010). Models of metaphor in NLP. In Proceedings of the 48th Annual Meeting of the Association for Computational

Linguistics (pp. 688–697). Association for Computational Linguistics. https://dl.acm.org/doi/10.5555/1858681.1858752

Steen, G.J., Dorst, A.G., Herrmann, J.B., Kaal, A.A., Krennmayr, T., & Pasma, T. (2010). A Method for Linguistic Metaphor Identification:

From MIP to MIPVU. Amsterdam/Philadelphia: John Benjamins Publishing Company. https://doi.org/10.1075/celcr.14

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., & Polosukhin, I. (2017). Attention is All You Need. In Advances in

Neural Information Processing Systems, 30. Curran Associates. https://doi.org/10.48550/arXiv.1706.03762

OpenAI API Fine-tuning Documentation https://platform.openai.com/docs/guides/supervised-fine-tuning

Google Cloud Vertex AI Fine-tuning Guide. Google Cloud Documentation. https://cloud.

google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning

Anthropic Claude Model Overview. Anthropic. https://www.anthropic.com/claude

OpenAI GPT-4 Product Page. OpenAI. https://openai.com/research/gpt-4

Google Gemini Model Overview. Google DeepMind Blog. Retrieved from https://deepmind.google/models/gemini/

YU-NLPLab. (n.d.). VU Amsterdam Metaphor Corpus – training subset (first 1500 sentences) [Data set]. GitHub. https://github.com/

YU-NLPLab/DeepMet/blob/master/corpora/VUA/vuamc_corpus_train.csv

Jin, G. (n.d.). VU Amsterdam Metaphor Corpus – test subset (203 sentences) [Data set]. GitHub. https://github.com/jin530/MelBERT/blob/main/data_sample/VUAtok_sample/test.tsv

Пасічник, В., Яромич, М. (2025). Особливості жанрової класифікації літератури за допомогою великих мовних моделей. Folium, 6, 132–143. https://doi.org/10.32782/folium/2025.6.19.

COMPARATIVE ANALYSIS OF THE EFFECTIVENESS OF LARGE LANGUAGE MODELS FOR METAPHOR IDENTIFICATION: ZERO-SHOT AND FINE-TUNING METHODS

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Language

logo