LLM-based stemming for improved Gujarati information retrieval

Kaur, Jasleen; Patel, Smit

doi:10.1108/AJIM-04-2025-0241

Purpose

Stemming is a critical preprocessing phase in information retrieval (IR) and natural language processing (NLP) tasks, aimed at reducing words to their root forms to improve query matching and retrieval performance. While effective stemming algorithms exist for high-resource languages, low-resource languages such as Gujarati lack a robust solution. Existing rule-based, dictionary-based and hybrid stemming techniques for the Gujarati language struggle to handle morphological variations, contextual understanding and out-of-vocabulary words, which limits their effectiveness in search engines and text-processing applications.

Design/methodology/approach

This research study proposes two distinct stemming techniques for Gujarati: (1) an IndoWordNet-based method and (2) a novel large language model (LLM)-based approach comprising three variants: word-level (LLM-WL), sentence-level (LLM-SL) and part of speech-sentence-level (LLM-POS-SL). For LLM implementation, we used the GPT-4-0613 model. These methods were evaluated on a curated Gujarati corpus of 20,849 words using standard metrics: precision, recall and F-score. The impact of stemming on IR performance was assessed using Gujarati Wikipedia search queries, with evaluation metrics including mean average precision (MAP) and Precision@10.

Findings

The LLM-POS-SL variant achieved the best results, with an average precision of 96.7%, a recall of 91.95% and an F-score of 94.26%. IR experiments further confirmed that LLM-POS-SL enhances search relevance and ranking with superior MAP and Precision@10.

Practical implications

These findings underscore the potential of LLM-based stemming in enhancing Gujarati NLP tools, improving the effectiveness of digital search systems. This research advances socio-technical development in regional-language IR and promotes more equitable access to information. Future work will focus on optimizing efficiency, scalability and domain-specific adaptations to further improve stemming accuracy.

Originality/value

To the best of our knowledge, this is the first reported study to leverage an LLM for the stemming task in Gujarati or any other Indian language.

2025

Emerald Publishing Limited

Licensed re-use rights only

You do not currently have access to this content.

LLM-based stemming for improved Gujarati information retrieval

Email Alerts

Cited By

LLM-based stemming for improved Gujarati information retrieval

Sign in

Client Account

ICE Member Sign In

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Sharing Unavailable