Skip to Main Content
Article navigation
Purpose

Stemming is a critical preprocessing phase in information retrieval (IR) and natural language processing (NLP) tasks, aimed at reducing words to their root forms to improve query matching and retrieval performance. While effective stemming algorithms exist for high-resource languages, low-resource languages such as Gujarati lack a robust solution. Existing rule-based, dictionary-based and hybrid stemming techniques for the Gujarati language struggle to handle morphological variations, contextual understanding and out-of-vocabulary words, which limits their effectiveness in search engines and text-processing applications.

Design/methodology/approach

This research study proposes two distinct stemming techniques for Gujarati: (1) an IndoWordNet-based method and (2) a novel large language model (LLM)-based approach comprising three variants: word-level (LLM-WL), sentence-level (LLM-SL) and part of speech-sentence-level (LLM-POS-SL). For LLM implementation, we used the GPT-4-0613 model. These methods were evaluated on a curated Gujarati corpus of 20,849 words using standard metrics: precision, recall and F-score. The impact of stemming on IR performance was assessed using Gujarati Wikipedia search queries, with evaluation metrics including mean average precision (MAP) and Precision@10.

Findings

The LLM-POS-SL variant achieved the best results, with an average precision of 96.7%, a recall of 91.95% and an F-score of 94.26%. IR experiments further confirmed that LLM-POS-SL enhances search relevance and ranking with superior MAP and Precision@10.

Practical implications

These findings underscore the potential of LLM-based stemming in enhancing Gujarati NLP tools, improving the effectiveness of digital search systems. This research advances socio-technical development in regional-language IR and promotes more equitable access to information. Future work will focus on optimizing efficiency, scalability and domain-specific adaptations to further improve stemming accuracy.

Originality/value

To the best of our knowledge, this is the first reported study to leverage an LLM for the stemming task in Gujarati or any other Indian language.

Licensed re-use rights only
You do not currently have access to this content.
Don't already have an account? Register

Purchased this content as a guest? Enter your email address to restore access.

Please enter valid email address.
Email address must be 94 characters or fewer.
Pay-Per-View Access
$41.00
Rental

or Create an Account

Close Modal
Close Modal