Research data sets are typically distributed across different data repositories and lack standardized classification information, which hinders effective discovery and access. This study aims to develop an automated method that assigns Chinese Library Classification (CLC) codes to data sets to facilitate user searching and browsing data sets.
This study experiments with a three-step method for the automatic classification of research data sets: firstly, a multilingual classification model is trained to identify data sets with valid descriptions; subsequently, a multilingual generative large language model fine-tuned with book bibliographic data is used to generate CLC codes for data sets based on their valid descriptions; and, finally, the generated CLC codes are validated and corrected by a prefix tree constructed with valid CLC codes.
Experimental results demonstrate that the proposed three-step method effectively classifies data sets. The CLC codes generated by the model are highly consistent with the classification information provided by the data set contributors, achieving a classification accuracy of 0.8520 for the first-level category and 0.4080 at the full CLC code level.
This study proposes a method for the hierarchical classification of multilingual research data sets by accurately identifying data sets with valid descriptions, generating classification codes and correcting faulty codes. It provides a scalable and effective solution for data set classification and management.
