This paper investigates how innovation teams construct shadow archives, defined as unauthorised, machine-readable knowledge bases that circumvent copyright restrictions, to fuel retrieval-augmented generation (RAG) systems under competitive time pressure. The study examines whether such evasion measurably improves innovation outcomes and how institutional actors, specifically academic librarians, enable or tolerate these practices.
A computational digital ethnography was conducted during a two-week university bootcamp for 20 student teams (N = 60) preparing for the 2025 Mathematical Contest in Modelling. The research triangulated (1) SBERT vector-space comparison of each team's shadow archive against official library holdings, (2) RAG query-log analysis to quantify shadow dependency and (3) participant observation and post-competition interviews to trace data-laundering routines and librarian mediation.
Shadow archives diverged semantically from licensed collections (mean cosine d = 0.47, p < 0.001) and incorporated 68% grey literature, preprints and pirated texts. Teams with higher shadow dependency achieved significantly better competition scores (β = 0.62, p < 0.001, R2 = 0.68). Librarians facilitated this outcome through studied ambiguity: teaching digital rights management (DRM) removal tools while disclaiming legal responsibility, thereby normalising a five-stage data-laundering pipeline (acquisition, DRM circumvention, OCR, format conversion, vectorisation).
The study reconceptualises copyright not as a binary compliance variable but as a socio-material boundary that spawns parallel knowledge subsystems. It introduces data laundering as an empirical process and quantifies an innovation premium from copyright evasion, demonstrating that rigid licensing paradoxically undermines the knowledge resilience it purports to protect.
