Identifying domain relevant user generated content through noise reduction: a test in a Chinese stock discussion forum

Yan, Xiangbin; Li, Yumei; Fan, Weiguo

doi:10.1108/IDD-04-2017-0043

Article navigation

Research Article| November 20 2017

Identifying domain relevant user generated content through noise reduction: a test in a Chinese stock discussion forum

Xiangbin Yan;

Xiangbin Yan

University of Science and Technology

, Beijing,

China

Search for other works by this author on:

This Site

PubMed

Google Scholar

Yumei Li;

Yumei Li

Harbin Institute of Technology

, Harbin,

China

Yumei Li can be contacted at: lym_27@126.com

Search for other works by this author on:

This Site

PubMed

Google Scholar

Weiguo Fan

Department of Accounting and Information Systems,

Virginia Polytechnic Institute and State University

, Blacksburg, Virginia,

USA

Search for other works by this author on:

This Site

PubMed

Google Scholar

Author & Article Information

Yumei Li can be contacted at: lym_27@126.com

Publisher: Emerald Publishing

Received: April 20 2017

Revision Received: June 24 2017

Accepted: June 30 2017

Online ISSN: 2398-6255

Print ISSN: 2398-6247

2017

Emerald Publishing Limited

Licensed re-use rights only

Information Discovery and Delivery (2017) 45 (4): 181–193.

https://doi.org/10.1108/IDD-04-2017-0043

Purpose

Getting high-quality data by removing the noisy data from the user-generated content (UGC) is the first step toward data mining and effective decision-making based on ubiquitous and unstructured social media data. This paper aims to design a framework for revoking noisy data from UGC.

Design/methodology/approach

In this paper, the authors consider a classification-based framework to remove the noise from the unstructured UGC in social media community. They treat the noise as the concerned topic non-relevant messages and apply a text classification-based approach to remove the noise. They introduce a domain lexicon to help identify the concerned topic from noise and compare the performance of several classification algorithms combined with different feature selection methods.

Findings

Experimental results based on a Chinese stock forum show that 84.9 per cent of all the noise data from the UGC could be removed with little valuable information loss. The support vector machines classifier combined with information gain feature extraction model is the best choice for this system. With longer messages getting better classification performance, it has been found that the length of messages affects the system performance.

Originality/value

The proposed method could be used for preprocessing in text mining and new knowledge discovery from the big data.

2017

Emerald Publishing Limited

Licensed re-use rights only

You do not currently have access to this content.

Don't already have an account? Register

Identifying domain relevant user generated content through noise reduction: a test in a Chinese stock discussion forum

Email Alerts

Cited By

Identifying domain relevant user generated content through noise reduction: a test in a Chinese stock discussion forum Available to Purchase

Sign in

Client Account

ICE Member Sign In

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Identifying domain relevant user generated content through noise reduction: a test in a Chinese stock discussion forum