Digitising historical resources has become a critical step in historical research, augmenting manual processes and facilitating big data studies. However, older material is of lower quality and may affect the quality of the digitisation. We document and quantify this bias in the context of a trademark registry reconstruction project, which involves identifying and matching graphical trademarks from notifications published in the Official Gazette during the British Mandate in Palestine (1917–1948).
We develop a rule-based pipeline combining graphical object extraction and optical character recognition (OCR) to match trademark images to metadata records across more than 300 Gazette editions, spanning 26 years. To measure the temporal dimension of digitisation quality, logistic regression is applied to the full dataset of 7,263 trademarks to isolate the effect of publication date on matching probability, independently of structural layout changes.
The pipeline achieves an overall identification rate of 86.6%. Using logistic regression, we find a statistically significant temporal effect: each additional year is associated with a 0.38 percentage-point increase in matching probability, corresponding to approximately 13% improvement over the 26-year study period. The decline in accuracy for 1920s editions is attributable to physical paper deterioration and scan quality rather than structural layout changes.
The pipeline is tailored to the Palestine Gazette’s specific conventions and relies on commercial OCR software, which limits direct replicability without adaptation. The study draws on a single jurisdiction and source type, so the magnitude of chronological bias may differ across other corpora.
The rule-based extraction and matching pipeline provides a replicable, low-threshold tool for analogous digitisation projects, including other colonial trademark or patent registries and official gazettes, without requiring advanced machine-learning infrastructure. The reconstructed British Mandate trademark registry, now openly available, enables historical and comparative research on commercial activity, legal history and colonial governance. The finding of a chronological bias urges digitisation project managers to incorporate temporal bias measurement into quality-assurance workflows and to stratify or weight findings by publication period rather than aggregating results uniformly across time.
Our research reveals that the deteriorating efficiency of a newspaper-analysing algorithm over time may lead to a chronological bias, skewing historical data towards more recent periods, a critical factor we aim to document and highlight to support more accurate and balanced historical research. The research tool we developed facilitates similar digitisation projects for trademarks and other images, along with their accompanying information. This is a necessary step in reconstructing old or lost trademark registries, which can then contribute to understanding the economic and cultural dimensions of the relevant jurisdiction.
