The increasing use of the terms “fake data” and “fake information” in digital systems has led to conceptual ambiguity and inconsistent usage across the literature. This study aims to clarify the definitional boundaries and relationships between these terms by examining how they are used and interpreted in existing research.
A structured literature review was conducted following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. A total of 334 studies published between 2009 and 2026 were selected after screening and eligibility assessment. Their abstracts were analyzed using thematic analysis, where recurring patterns were identified and encoded based on dimensions such as intent of creation, transformation processes and usage context.
The results reveal that neither “fake data” nor “fake information” is explicitly defined in the literature, with meanings typically inferred from context. The analysis also shows a marked increase in the use of these terms since 2020, with fake data more prevalent in technical domains such as cybersecurity and data science, and fake information more common in media and social contexts.
The review is restricted to the computer science domain and its related fields (e.g. IT, data science and artificial intelligence), which may limit the generalizability of the findings to other disciplines. Despite these limitations, the study highlights the need for consistent terminology and provides a foundation for future research to refine and validate the proposed conceptual distinctions across broader contexts.
The proposed framework supports improved classification, risk assessment and intervention strategies by distinguishing between data-level manipulation and information-level distortion in digital systems.
This study introduces a data–information–knowledge-grounded, two-level framework that clarifies ambiguity in the use of fake data and fake information. The first level distinguishes between legitimate and deceptive intent, while the second differentiates based on what is being produced (data vs. information). This structure enables a clearer conceptual separation between fake data, fake information and synthetic data.
