Lexical density in English newspapers – a cross-analysis of the New York Times and Arab News

Alzahrani, Hayat

doi:10.1108/SJLS-12-2024-0070

Purpose

This study examines the lexical density of articles from The New York Times (TNYT) and Arab News (AN), analyzing its impact on the readability and accessibility of content for general audiences, particularly non-native English speakers. It also considers other features affecting readability in these newspapers.

Design/methodology/approach

The chosen method for lexical density is Ure’s content-to-total word ratio, while readability is calculated with the help of the Flesch–Kincaid Grade Level. The analysis involves the use of descriptive statistics and Pearson’s correlation coefficient to determine the relationship between lexical density and readability. A qualitative content analysis is used for in-depth analysis of contextual factors affecting text comprehensibility. The dataset comprises 100 articles: 50 articles from TNYT and 50 articles from AN, between 2022 and 2024 and covered topics related to climate change and the environment.

Findings

The findings revealed that AN has an average lexical density of 56.54% while TNYT has 57.96%. Nevertheless, the above difference is not statistically significant. In addition, lexical density has a positive correlation (0.6332) with Flesch–Kincaid Grade-Level scores. Therefore, this implies that reading comprehension decreases as lexical density increases. Qualitative findings indicate that AN emphasizes regional adaptation and increased contextual explanations, which can increase comprehensibility for non-native English speakers.

Research limitations/implications

One of the limitations of this research involves the exploration of two English-language newspapers from Western and Middle-Eastern world regions. This means that the findings cannot accurately reflect the audiences in these two regions. Additionally, this study did not involve the comparison of tolerance thresholds for lexical density among different groups, such as native and non-native English language speakers. In the future, researchers should examine the same topic by considering other Western and Middle-Eastern newspapers that were not covered in this study. In addition, interviews with experts and tests should be administered to compare the impacts of lexical density on comprehensibility between native and non-native English language speakers.

Originality/value

This research contributes to the knowledge of how lexical density affects readability in Western and non-Western English newspapers. It provides insights for media outlets aiming to enhance content accessibility for diverse, global audiences.

English-language newspapers serve many individuals across the world. The media significantly contributes to disseminating information on human relationships, raising awareness of pressing concerns, and influencing consumption behavior changes. The increased digitization of the globe allows individuals to search for information from online media outlets. Newspapers, such as The New York Times (TNYT), published in the United States, and Arab News (AN), representing the Saudi-Arabian population, disseminate information in English. Nevertheless, lexical density, defined as the ratio of content words to the overall words, has emerged as a crucial factor in determining the readability and complexity of a discourse (Bakuuro, 2024; Khorina and Handani, 2022). The variation in lexical density can create barriers to the overarching aim of media outlets to inform and engage diverse sociodemographic groups. Previous research has indicated that higher lexical density increases cognitive load and reduces reading speed (Liu and Dou, 2023). This study examines the lexical density of TNYT and the AN articles published between 2022 and 2024. The focus of this study is significant given the two publications' different cultural and socio-political contexts and the increased adoption of English as a global language. In addition, the comparative approach is relevant since the AN serves a predominantly English as a second language (ESL) audience. In addition, research has not explored lexical density, readability, and comprehensibility of newspapers in cross-cultural settings. This study can suggest strategies for enhancing the accessibility of news content to diverse audiences.

Research questions

This study explores three research questions:

RQ1.

How does lexical density differ between articles published between 2022 and 2024 in TNYT and the AN?

RQ2.

What is the impact of the lexical density of these articles on the comprehensibility and readability by the general audience?

RQ3.

What other factors other than lexical density affect the comprehensibility and readability of the articles in the two newspapers?

Originality and contribution

This research is the first attempt to compare the lexical density and the overall readability of newspapers representing two different cultures, The New York Times and Arab News. Therefore, it reveals patterns in cross-cultural journalism that can guide organizations in adjusting their writing patterns for increased engagement of audiences in the global segment. Moreover, the results can help news entities in the development of effective strategies to enhance the readability and comprehensibility of articles without sacrificing their information-carrying capacity.

Definition of terms

Lexical density: The proportion of content words relative to the total word count in a text or corpus (Ure, 1971).

Content words: Nouns, verbs, adjectives, and adverbs that carry meaning (Ure, 1971).

Readability: The ease with which an individual can read and understand a material (Kembaren and Aswani, 2022).

Comprehensibility: The degree to which a text can be understood by its intended audience (Kembaren and Aswani, 2022).

Lexical items: Individual words that have semantic content and contribute to the overall lexical density in a text (Khorina and Handani, 2022).

Literature review

Lexical density indicates the information-carrying capacity in written and oral discourse. Lexical density is measured as the proportion of content words, which primarily encompass adverbs, verbs, nouns, and adjectives, in the corpus or overall text (Ure, 1971). In other approaches for computing lexical density, lexemes are expressed as a ratio of a ranking clause (words that convey a complete idea) (Halliday, 1985). Most studies rely on Ure’s (1971) method in which the functional perspective of language is de-emphasized, and multiple words are not integrated into a single unit (Aziz and Riaz, 2024). Ure’s (1971) method is more straightforward and applies primarily to linguistic appraisal.

Lexical density in news media

Several studies have examined lexical density in spoken and print media outlets. Examination of TNYT, Business of Fashion, and BBC News using Halliday’s (1985) and Ure’s (1971) approaches led to lexical density scores above 50% (Kembaren and Aswani, 2022; Khorina and Handani, 2022; Rahayu and Syaifullah, 2022). Khorina and Handani (2022) used a sample of 10 pieces of written and spoken news from BBC News. The findings indicate that written news had slightly higher average lexical density (55.05%) compared to spoken samples (52.15%). Nevertheless, research has indicated that articles in magazines vary significantly depending on the topic and target audience. In their research, Sari and Ekawati (2021) found that the lexical density for articles in Reader’s Digest Magazine covering various topics was 64.5% for travel, 54.68% for parenting, 58.84% for language, 61.56% for animals, and 61.34% for health. The findings indicate that the majority of the articles could be classified as intricate since their lexical density values were above 60%.

Readability and comprehension

A paucity of research has established a relationship between lexical density, readability, and comprehension in different contexts. Bakuuro (2024) applied both Flesch’s (1948) and Gunning’s (1952) frameworks to examine readability of high school English texts. The authors inferred a negative correlation between lexical density and readability. Both Ure’s (1971) and Halliday’s (1985) approaches provided closer values of linguistic complexity. Liu and Dou (2023) indicate that higher ratios of lexemes are associated with a greater cognitive load during the processing of the information. In another study (Rahayu and Syaifullah, 2022), TNYT articles had a lexical density of 58.1% congruent with a level 12 of grammatical intricacy, compared to the Business of Fashion's lexical index of 54.2% and corresponding grammatical intricacy level of 8. Nevertheless, the above study did not examine the relationship between readability and lexical density. While widely used in the literature, readability formulas have been criticized for focusing on a limited number of textual features to inform reader comprehension (Crossley, 2024). Studies have established additional factors that influence reader comprehension. For example, Ji et al. (2023) demonstrated that information distribution over long documents influence comprehension. In addition, Hackemann et al. (2022) indicated that technical terms and longer sentences with multiple clauses can substantially reduce the comprehensibility of a corpus.

Cross-cultural considerations

Studies have examined English-language newspapers in ESL countries, revealing important cultural variations (Ghani et al., 2022; Liu and Dou, 2023; Salihoglu and Karatepe, 2023). Using Halliday’s (1985) model of systemic functional linguistics, Ghani et al. (2022) found that the ratios of lexemes in Pakistani and UK news articles were 49.89 and 53.09%, respectively. The findings suggest that cultural adaptations in non-native English-speaking countries often lead to simplified syntactic structures. Liu and Dou (2023) found that translation processes in cross-cultural contexts impact lexical complexity. They found that direct expression of content in English was linked with more complex vocabulary and content words compared to the interpretation of texts from Chinese or Russian to English.

Research gap and justification of research objectives

Some studies have examined lexical density indices in Western newspapers such as BBC News and TNYT (Kembaren and Aswani, 2022; Khorina and Handani, 2022; Rahayu and Syaifullah, 2022; Sari and Ekawati, 2021). However, limited amount of research that examined Western and Middle-Eastern English-language newspapers did not consider the relationship between information density and corresponding readability levels (Ghani et al., 2022; Salihoglu and Karatepe, 2023). Therefore, this study fills this gap by comparing the lexical density and readability of articles from different cultural contexts, Western and Middle-Eastern publications. To address these gaps, this study is structured into three objectives:

(1)
To examine the difference between lexical density of articles published in TNYT and AN.
(2)
To investigate the impact of lexical density on readability and comprehensibility.
(3)
To identify factors beyond lexical density that affect readability.

The examination of these objectives will address the gap in the role of lexical and structural choices in cross-cultural news. In addition, the objectives provide a framework for understanding how English-language newspapers can adjust and present their messages in a manner that aligns with the diverse international audiences. The objectives support a study approach that combines quantitative and qualitative approaches, which contributes to a better understanding of textual complexity, as well as other cultural-linguistic aspects influencing the reader's comprehension. The findings are relevant to news agencies aiming to disseminate detailed informational content while ensuring that their articles demonstrate reasonable comprehensibility and readability for both the local and the international audiences.

Methodology

Design

This study applied a mixed-method approach to examine the lexical density and readability of TNYT and AN newspapers. The quantitative method involves analyzing numerical data to elucidate facts about the phenomenon associated with the population. In addition, it reduces bias associated with subjective interpretation of results inherent to qualitative designs. A quantitative approach is relevant given that the current study involved comparing linguistic measures for the two newspapers. A qualitative content analysis allowed for in-depth analysis of contextual factors affecting text comprehensibility.

Data collection

Sampling

The sample in this study involved 100 newspaper articles published between 2022 and 2024 and covered topics related to climate change and environment. Fifty articles were obtained from TNYT and the remaining ones from AN. A purposive sampling was applied to help the researcher select articles that are relevant to the study objectives. The sampling approach can enhance the internal validity of findings by focusing on participants who truly represent the phenomenon under investigation. The method was useful for the collection of specific and meaningful data related to lexical density in media discourse.

Data sources and selection criteria

Articles were retrieved from the official websites of TNYT and AN. The selection criteria involved English-language articles and a minimum of 300 words. In addition, articles from wire services were excluded from the study.

Quantitative analysis

Lexical density measurement

Ure’s (1971) formula was applied to evaluate the lexical density of words within the selected articles. The content-to-total word ratio indicates the percentage of nouns, verbs, adjectives, and (lexical items) in the text. Prepositions are usually excluded from the computation due to a lack of semantic content. The formula is given below (Khorina and Handani, 2022):

Lexical density = [(Lexical items) / (total number of words)] * 100

Comprehensibility and readability analysis

In addition, the Flesch-Kincaid Grade Level, developed by Kincaid et al. (1975), was used to calculate the readability aspect. Readability is defined as the proficiency level required for individuals to comprehend the text; therefore, it was synonymous with comprehensibility (Kembaren and Aswani, 2022). Flesch-Kincaid framework accounts for the sentence length and syllable count to establish the grade level required to comprehend each text. As shown in Table 1, the complexity of the text increases as the score moves from a low value to higher Flesch-Kincaid Grade Level scores (Kincaid et al., 1975). The Automatic Readability Checker, a text analysis software, which analyzes discourses based on Flesch-Kincaid Grade Level, was applied for data entry and processing. Manual verification and adjustment were performed. Data were stored in an Excel spreadsheet for further statistical analysis. Flesch-Kincaid formula for appraisal of readability is as follows (Kincaid et al., 1975):

0.39 (\frac{t o t a l w o r d s}{t o t a l s e n t e n c e s}) - 11.8 (\frac{t o t a l s y l l a b l e s}{t o t a l w o r d s}) - 15.59

Table 1

Interpretation of Flesch-Kincaid grade level

Flesch-Kincaid score	School level	Age range	Readability ease
1–5	1st–5th Grade	11	Very Easy
6	6th Grade	11–12	Easy
7	7th Grade	12–13	Fairly Easy
8–9	8th–9th Grade	13–15	Standard; Plain English
10–11	10th–11 Grade	15–17	Fairly Difficult
12–15	College	17–20	Difficult
16+	College Graduates	20+	Very Difficult

Flesch-Kincaid score	School level	Age range	Readability ease
1–5	1st–5th Grade	11	Very Easy
6	6th Grade	11–12	Easy
7	7th Grade	12–13	Fairly Easy
8–9	8th–9th Grade	13–15	Standard; Plain English
10–11	10th–11 Grade	15–17	Fairly Difficult
12–15	College	17–20	Difficult
16+	College Graduates	20+	Very Difficult

Source(s): Adapted from Kembaren and Aswani (2022) and Kincaid et al. (1975)

Statistical analysis

Descriptive statistics were completed to determine the means and standard deviations for lexical density and Flesch-Kincaid grade-level scores. The impact of lexical density on readability scores was examined through the computation of Pearson's correlation coefficient. In addition, independent t-tests were conducted to examine if the mean lexical density of the two newspapers varied significantly.

Qualitative content analysis

The qualitative content analysis involved selecting five articles from each of the two newspapers. A thorough examination of the articles is essential for familiarization and understanding assumptions and patterns in the content (Nicmanis, 2024). Sentences and phrases from the articles were analyzed to determine cultural-linguistic factors that can affect the comprehensibility for general audience. The researcher formed categories and themes that allowed the identification of the similarities and differences between AN's and TNYT's climate change reporting styles.

Validity and reliability

Several measures were taken to ensure that the findings reflected the linguistic characteristics of each newspaper. Reliability focuses on the consistency of measures, while validity indicates the accuracy of measures. Ure’s (1971) lexical density formula and the Flesch-Kincaid Grade Level (Kincaid et al., 1975) has been proven reliable in previous studies. In addition, two researchers independently measured the variables for a subset of articles. The results indicated a high level of inter-rater reliability. The utilization of linguistic measures from a respected scholar, Ure (1971), and unambiguous operationalization of readability and lexical density enhanced the content and construct validity of the study. Nevertheless, external validity might be limited to newspapers with audiences and styles similar to those employed in this research.

Results

The sample for this study comprised 100 articles from TNYT and AN. Table 2 shows the descriptive statistics of the lexical density for the two newspapers. The summary of the individual lexical density elements for each newspaper is shown in Table 3. The average readability values in the form of Flesch-Kincaid Grade Level for TNYT and AN are depicted in Table 4. In addition, this study involved testing the statistical significance of the difference between TNYT and AN lexical density, which is shown in Table 5. A correlational analysis of the lexical density and Flesch-Kincaid Grade Level scores of all the articles from two newspapers is shown in Table 6.

Table 2

Lexical density results for TNYT and AN articles

Elements	TNYT	AN
Mean lexical density	57.96%	56.54%
Standard deviation	4.00	3.55
Range	50.12–63.36%	51.26–63.96%
Median	59.08%	56.62%