Automatic classification of research data sets into the Chinese Library Classification with generative large language model

Luo, Pengcheng; Hong, Lingzi; Nie, Lei

doi:10.1108/EL-02-2025-0042

Article navigation

Research Article| June 10 2025

Automatic classification of research data sets into the Chinese Library Classification with generative large language model

Pengcheng Luo;

Pengcheng Luo

Peking University Library,

Peking University

, Beijing,

China

Search for other works by this author on:

This Site

PubMed

Google Scholar

Lingzi Hong;

Lingzi Hong

College of Information,

University of North Texas

, Denton, Texas,

USA

Search for other works by this author on:

This Site

PubMed

Google Scholar

Lei Nie

Country and Area Studies Academy,

Beijing Foreign Studies University

, Beijing,

China

Lei Nie can be contacted at: nielei@bfsu.edu.cn

Search for other works by this author on:

This Site

PubMed

Google Scholar

Author & Article Information

Lei Nie can be contacted at: nielei@bfsu.edu.cn

Publisher: Emerald Publishing

Received: February 02 2025

Revision Received: April 15 2025

Accepted: May 15 2025

Online ISSN: 1758-616X

Print ISSN: 0264-0473

Funding

Funding Group:

Award Group:
- Funder(s):
- Award Id(s):
  24CTQ022

2025

Emerald Publishing Limited

Licensed re-use rights only

The Electronic Library (2025) 43 (4): 600–618.

https://doi.org/10.1108/EL-02-2025-0042

Purpose

Research data sets are typically distributed across different data repositories and lack standardized classification information, which hinders effective discovery and access. This study aims to develop an automated method that assigns Chinese Library Classification (CLC) codes to data sets to facilitate user searching and browsing data sets.

Design/methodology/approach

This study experiments with a three-step method for the automatic classification of research data sets: firstly, a multilingual classification model is trained to identify data sets with valid descriptions; subsequently, a multilingual generative large language model fine-tuned with book bibliographic data is used to generate CLC codes for data sets based on their valid descriptions; and, finally, the generated CLC codes are validated and corrected by a prefix tree constructed with valid CLC codes.

Findings

Experimental results demonstrate that the proposed three-step method effectively classifies data sets. The CLC codes generated by the model are highly consistent with the classification information provided by the data set contributors, achieving a classification accuracy of 0.8520 for the first-level category and 0.4080 at the full CLC code level.

Originality/value

This study proposes a method for the hierarchical classification of multilingual research data sets by accurately identifying data sets with valid descriptions, generating classification codes and correcting faulty codes. It provides a scalable and effective solution for data set classification and management.

2025

Emerald Publishing Limited

Licensed re-use rights only

You do not currently have access to this content.

Don't already have an account? Register

Automatic classification of research data sets into the Chinese Library Classification with generative large language model

Email Alerts

Cited By

Automatic classification of research data sets into the Chinese Library Classification with generative large language model Available to Purchase

Sign in

Client Account

ICE Member Sign In

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Automatic classification of research data sets into the Chinese Library Classification with generative large language model