Skip to Main Content
Article navigation
Purpose

This study presents a robust Arabic spelling-correction model that simultaneously handles non-word errors and multiple categories of real-word errors. Arabic is a morphologically rich and relatively low-resource language, which makes end-to-end correction across diverse error types particularly challenging. Our work targets these challenges directly and addresses key gaps in current methodologies for Arabic spelling correction.

Design/methodology/approach

The study introduces a comprehensive training framework with dynamic error synthesis, which injects errors at sampling time so that each pass over the data produces fresh perturbations. The synthesized errors cover non-word, real-word, grammatical and punctuation categories. We train on a new corpus of 191,000+ news articles comprising 3.2 million sentences and formulate correction as a sequence-to-sequence supervised learning task using Transformer-based models. We evaluate several pretraining configurations, including BERT2BERT and BERT2GPT2 and quantify the benefits of transfer learning by comparing them against a baseline Seq2Seq Transformer trained from random initialization.

Findings

All pretrained models outperform their randomly initialized counterparts. The best configuration, BERT2GPT2, attains a word error rate of 6.86%, compared with 8.02% for the baseline, a reduction of 1.16%. The proposed system achieves state-of-the-art results on both QALB shared tasks.

Originality/value

To the best of our knowledge, this work provides the first large-scale, high-quality Arabic corpus accompanied by a scalable error-synthesis framework and a comprehensive transfer-learning study for spelling correction. We release a novel dataset comprising 3,221,524 samples. The dynamic error-synthesis pipeline systematically injects real-word, non-word, grammatical and punctuation errors, demonstrating practical applicability for training robust correction models.

Licensed re-use rights only
You do not currently have access to this content.
Don't already have an account? Register

Purchased this content as a guest? Enter your email address to restore access.

Please enter valid email address.
Email address must be 94 characters or fewer.
Pay-Per-View Access
$41.00
Rental

or Create an Account

Close Modal
Close Modal