Skip to Main Content
Article navigation
Purpose

Research data sets are typically distributed across different data repositories and lack standardized classification information, which hinders effective discovery and access. This study aims to develop an automated method that assigns Chinese Library Classification (CLC) codes to data sets to facilitate user searching and browsing data sets.

Design/methodology/approach

This study experiments with a three-step method for the automatic classification of research data sets: firstly, a multilingual classification model is trained to identify data sets with valid descriptions; subsequently, a multilingual generative large language model fine-tuned with book bibliographic data is used to generate CLC codes for data sets based on their valid descriptions; and, finally, the generated CLC codes are validated and corrected by a prefix tree constructed with valid CLC codes.

Findings

Experimental results demonstrate that the proposed three-step method effectively classifies data sets. The CLC codes generated by the model are highly consistent with the classification information provided by the data set contributors, achieving a classification accuracy of 0.8520 for the first-level category and 0.4080 at the full CLC code level.

Originality/value

This study proposes a method for the hierarchical classification of multilingual research data sets by accurately identifying data sets with valid descriptions, generating classification codes and correcting faulty codes. It provides a scalable and effective solution for data set classification and management.

Licensed re-use rights only
You do not currently have access to this content.
Don't already have an account? Register

Purchased this content as a guest? Enter your email address to restore access.

Please enter valid email address.
Email address must be 94 characters or fewer.
Pay-Per-View Access
$41.00
Rental

or Create an Account

Close Modal
Close Modal