Building a training dataset for classification under a cost limitation

Chen, Yen-Liang; Cheng, Li-Chen; Zhang, Yi-Jun

doi:10.1108/EL-07-2020-0209

Article navigation

Volume 39, Issue 1

18 May 2021

Editors

Paolo Di Barba;

Paolo Di Barba

Search for other works by this author on:

This Site

PubMed

Google Scholar

Fabrizio Dughiero;

Fabrizio Dughiero

Search for other works by this author on:

This Site

PubMed

Google Scholar

Michele Forzan;

Michele Forzan

Search for other works by this author on:

This Site

PubMed

Google Scholar

Maria Evelina Mognaschi

Search for other works by this author on:

This Site

PubMed

Google Scholar

Research Article| February 24 2021

Building a training dataset for classification under a cost limitation

Yen-Liang Chen;

Yen-Liang Chen

National Central University

, Taoyuan,

Taiwan

Search for other works by this author on:

This Site

PubMed

Google Scholar

Li-Chen Cheng;

Li-Chen Cheng

National Taipei University of Technology

, Taipei,

Taiwan

Li-Chen Cheng can be contacted at: lijen.cheng@gmail.com

Search for other works by this author on:

This Site

PubMed

Google Scholar

Yi-Jun Zhang

National Central University

, Taoyuan,

Taiwan

Search for other works by this author on:

This Site

PubMed

Google Scholar

Author & Article Information

Li-Chen Cheng can be contacted at: lijen.cheng@gmail.com

Publisher: Emerald Publishing

Received: July 20 2020

Revision Received: October 26 2020

Accepted: November 19 2020

Online ISSN: 1758-616X

Print ISSN: 0264-0473

2021

Emerald Publishing Limited

Licensed re-use rights only

The Electronic Library (2021) 39 (1): 77–96.

https://doi.org/10.1108/EL-07-2020-0209

Purpose

A necessary preprocessing of document classification is to label some documents so that a classifier can be built based on which the remaining documents can be classified. Because each document differs in length and complexity, the cost of labeling each document is different. The purpose of this paper is to consider how to select a subset of documents for labeling with a limited budget so that the total cost of the spending does not exceed the budget limit, while at the same time building a classifier with the best classification results.

Design/methodology/approach

In this paper, a framework is proposed to select the instances for labeling that integrate two clustering algorithms and two centroid selection methods. From the selected and labeled instances, five different classifiers were constructed with good classification accuracy to prove the superiority of the selected instances.

Findings

Experimental results show that this method can establish a training data set containing the most suitable data under the premise of considering the cost constraints. The data set considers both “data representativeness” and “data selection cost,” so that the training data labeled by experts can effectively establish a classifier with high accuracy.

Originality/value

No previous research has considered how to establish a training set with a cost limit when each document has a distinct labeling cost. This paper is the first attempt to resolve this issue.

2021

Emerald Publishing Limited

Licensed re-use rights only

You do not currently have access to this content.

Don't already have an account? Register

Building a training dataset for classification under a cost limitation

Email Alerts

Cited By

Building a training dataset for classification under a cost limitation Available to Purchase

Sign in

Client Account

ICE Member Sign In

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Building a training dataset for classification under a cost limitation