Shadow archives and underground RAGs: copyright evasion as a mechanism for knowledge resilience in competitive innovation

Li, Rende

doi:10.1108/JD-01-2026-0011

Article navigation

Research Article| June 02 2026

Shadow archives and underground RAGs: copyright evasion as a mechanism for knowledge resilience in competitive innovation

Rende Li

0000-0003-1056-8188

Rende Li

University of Shanghai for Science and Technology

, Shanghai,

China

Search for other works by this author on:

This Site

PubMed

Google Scholar

Author & Article Information

Rende Li can be contacted at: lirende@usst.edu.cn

Publisher: Emerald Publishing

Received: January 08 2026

Revision Received: April 12 2026

Accepted: May 06 2026

Online ISSN: 1758-7379

Print ISSN: 0022-0418

Funding

Funding Group:

Award Group:
- Funder(s):
- Award Id(s):
  G2025-0901-005
Funding Statement(s):
Funding: This work was supported by the China Youth and Children Research Association “Climb Plan” (award no: G2025-0901-005) and Projects of USST (award nos: CFTD2026ZD17, JGXM262229, SH2026295, SH2026296).

Funding Group:

Award Group:
- Funder(s):
- Award Id(s):
  CFTD2026ZD17, JGXM262229, SH2026295, SH2026296

2026

Emerald Publishing Limited

Licensed re-use rights only

Journal of Documentation 1–21.

https://doi.org/10.1108/JD-01-2026-0011

Purpose

This paper investigates how innovation teams construct shadow archives, defined as unauthorised, machine-readable knowledge bases that circumvent copyright restrictions, to fuel retrieval-augmented generation (RAG) systems under competitive time pressure. The study examines whether such evasion measurably improves innovation outcomes and how institutional actors, specifically academic librarians, enable or tolerate these practices.

Design/methodology/approach

A computational digital ethnography was conducted during a two-week university bootcamp for 20 student teams (N = 60) preparing for the 2025 Mathematical Contest in Modelling. The research triangulated (1) SBERT vector-space comparison of each team's shadow archive against official library holdings, (2) RAG query-log analysis to quantify shadow dependency and (3) participant observation and post-competition interviews to trace data-laundering routines and librarian mediation.

Findings

Shadow archives diverged semantically from licensed collections (mean cosine d = 0.47, p < 0.001) and incorporated 68% grey literature, preprints and pirated texts. Teams with higher shadow dependency achieved significantly better competition scores (β = 0.62, p < 0.001, R² = 0.68). Librarians facilitated this outcome through studied ambiguity: teaching digital rights management (DRM) removal tools while disclaiming legal responsibility, thereby normalising a five-stage data-laundering pipeline (acquisition, DRM circumvention, OCR, format conversion, vectorisation).

Originality/value

The study reconceptualises copyright not as a binary compliance variable but as a socio-material boundary that spawns parallel knowledge subsystems. It introduces data laundering as an empirical process and quantifies an innovation premium from copyright evasion, demonstrating that rigid licensing paradoxically undermines the knowledge resilience it purports to protect.

2026

Emerald Publishing Limited

Licensed re-use rights only

You do not currently have access to this content.

Don't already have an account? Register

Shadow archives and underground RAGs: copyright evasion as a mechanism for knowledge resilience in competitive innovation

Email Alerts

Cited By

Shadow archives and underground RAGs: copyright evasion as a mechanism for knowledge resilience in competitive innovation Available to Purchase

Sign in

Client Account

ICE Member Sign In

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Shadow archives and underground RAGs: copyright evasion as a mechanism for knowledge resilience in competitive innovation