COMPARATIVE ANALYSIS OF THE EFFECTIVENESS OF LARGE LANGUAGE MODELS FOR METAPHOR IDENTIFICATION: ZERO-SHOT AND FINE-TUNING METHODS
DOI:
https://doi.org/10.32782/folium/2025.7.9Keywords:
large language models (LLM), metaphor, metaphor identification, finetuning, zero-shot, computational linguistics, natural language processing (NLP).Abstract
The article addresses the challenge of automatic metaphor identification, one of the most complex tasks in natural language processing (NLP). Based on the principles of cognitive linguistics, which define metaphor as a fundamental mechanism of thinking (Lakoff & Johnson, 1980), the role of metaphor as a powerful framing tool in political and media discourse is explored. Although the ability to analyse metaphorical patterns at scale is crucial for identifying manipulative technologies, the process of recognising them is complicated by contextual dependence, creativity, and the need for encyclopaedic knowledge. One of the main issues addressed in this article is the assessment of the potential of modern large language models (LLMs) for solving the task of automatic metaphor identification. The paper compares two key approaches: using the so-called ‘innate’ knowledge of models without additional tuning (the ‘zero-shot’ approach) and their specialised adaptation through fine-tuning. The effectiveness of the latest models (as of July 2025) from leading developers was investigated: OpenAI (GPT-4o), Google (Gemini 2.5 Pro, Gemini 2.5 Flash), and Anthropic (Claude Sonnet 4). Special attention was paid to the methodology of the experiment. The analysis was based on the NAACL 2020 Shared Task on Metaphor Detection corpus, and standard binary classification metrics were used to evaluate the performance of the models: precision, recall, and the F1-score. The article describes the fine-tuning procedure and identifies practical limitations associated with varying levels of tool availability in leading artificial intelligence ecosystems. The results of the study showed that the baseline models demonstrate low and unbalanced performance, while the finetuning procedure significantly improves their output (F1-Score increases by 24-29%). A comparative analysis of the retrained models revealed that GPT-4o achieves a better balance between recall and precision (F1-Score 64.20%), while Gemini 2.5 Flash retains a slight advantage in precision. The article makes an important contribution to the study of the capabilities of LLMs for analysing figurative language, demonstrating that fine-tuning is an extremely important method for adapting them to complex linguistic tasks.
References
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., & Amodei, D. (2020). Language Models Are Few-Shot Learners.
In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems, 33, 1877–1901.
https://doi.org/10.48550/arXiv.2005.14165
Charteris-Black, J. (2004). Corpus Approaches to Critical Metaphor Analysis. Basingstoke: Palgrave Macmillan. https://doi.org/10.1057/9780230000612
Lakoff, G., & Johnson, M. (1980). Metaphors We Live By. Chicago: University of Chicago Press.
Leong, C., Beigman Klebanov, B., & Shutova, E. (2020). Report on the 2020 Metaphor Detection Shared Task. In Proceedings of the Second
Workshop on Figurative Language Processing. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.figlang-1.3
Musolff, A. (2006). Metaphor scenarios in public discourse. Metaphor and Symbol. https://doi.org/10.1207/s15327868ms2101_2
Pragglejaz Group. (2007). MIP: A method for identifying metaphorically used words in discourse. Metaphor and Symbol. https://doi.org/10.1080/10926480709336752
Shutova, E. (2010). Models of metaphor in NLP. In Proceedings of the 48th Annual Meeting of the Association for Computational
Linguistics (pp. 688–697). Association for Computational Linguistics. https://dl.acm.org/doi/10.5555/1858681.1858752
Steen, G.J., Dorst, A.G., Herrmann, J.B., Kaal, A.A., Krennmayr, T., & Pasma, T. (2010). A Method for Linguistic Metaphor Identification:
From MIP to MIPVU. Amsterdam/Philadelphia: John Benjamins Publishing Company. https://doi.org/10.1075/celcr.14
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., & Polosukhin, I. (2017). Attention is All You Need. In Advances in
Neural Information Processing Systems, 30. Curran Associates. https://doi.org/10.48550/arXiv.1706.03762
OpenAI API Fine-tuning Documentation https://platform.openai.com/docs/guides/supervised-fine-tuning
Google Cloud Vertex AI Fine-tuning Guide. Google Cloud Documentation. https://cloud.
google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning
Anthropic Claude Model Overview. Anthropic. https://www.anthropic.com/claude
OpenAI GPT-4 Product Page. OpenAI. https://openai.com/research/gpt-4
Google Gemini Model Overview. Google DeepMind Blog. Retrieved from https://deepmind.google/models/gemini/
YU-NLPLab. (n.d.). VU Amsterdam Metaphor Corpus – training subset (first 1500 sentences) [Data set]. GitHub. https://github.com/
YU-NLPLab/DeepMet/blob/master/corpora/VUA/vuamc_corpus_train.csv
Jin, G. (n.d.). VU Amsterdam Metaphor Corpus – test subset (203 sentences) [Data set]. GitHub. https://github.com/jin530/MelBERT/blob/main/data_sample/VUAtok_sample/test.tsv
Пасічник, В., Яромич, М. (2025). Особливості жанрової класифікації літератури за допомогою великих мовних моделей. Folium, 6, 132–143. https://doi.org/10.32782/folium/2025.6.19.











