PFSA-ID: an annotated Indonesian corpus and baseline model of public figures statements attributions

Purnomo W.P., Yohanes Sigit; Kumar, Yogan Jaya; Zulkarnain, Nur Zareen

doi:10.1108/GKMC-04-2022-0091

Article navigation

Research Article| November 08 2022

PFSA-ID: an annotated Indonesian corpus and baseline model of public figures statements attributions

Yohanes Sigit Purnomo W.P.;

Yohanes Sigit Purnomo W.P.

Informatics Department,

Universitas Atma Jaya Yogyakarta

, Yogyakarta,

Indonesia

and Center for Advanced Computing Technology (C-ACT), Fakulti Teknologi Maklumat Dan Komunikasi,

Universiti Teknikal Malaysia Melaka

, Melaka,

Malaysia

Yohanes Sigit Purnomo W.P. can be contacted at: sigit.purnomo@uajy.ac.id

Search for other works by this author on:

This Site

PubMed

Google Scholar

Yogan Jaya Kumar;

Yogan Jaya Kumar

Center for Advanced Computing Technology (C-ACT), Fakulti Teknologi Maklumat Dan Komunikasi,

Universiti Teknikal Malaysia Melaka

, Melaka,

Malaysia

Search for other works by this author on:

This Site

PubMed

Google Scholar

Nur Zareen Zulkarnain

Center for Advanced Computing Technology (C-ACT), Fakulti Teknologi Maklumat Dan Komunikasi,

Universiti Teknikal Malaysia Melaka

, Melaka,

Malaysia

Search for other works by this author on:

This Site

PubMed

Google Scholar

Author & Article Information

Yohanes Sigit Purnomo W.P. can be contacted at: sigit.purnomo@uajy.ac.id

Publisher: Emerald Publishing

Received: April 20 2022

Revision Received: August 24 2022

Accepted: October 17 2022

Online ISSN: 2514-9350

Print ISSN: 2514-9342

2022

Emerald Publishing Limited

Licensed re-use rights only

Global Knowledge, Memory and Communication (2024) 73 (6-7): 853–870.

https://doi.org/10.1108/GKMC-04-2022-0091

Purpose

By far, the corpus for the quotation extraction and quotation attribution tasks in Indonesian is still limited in quantity and depth. This study aims to develop an Indonesian corpus of public figure statements attributions and a baseline model for attribution extraction, so it will contribute to fostering research in information extraction for the Indonesian language.

Design/methodology/approach

The methodology is divided into corpus development and extraction model development. During corpus development, data were collected and annotated. The development of the extraction model entails feature extraction, the definition of the model architecture, parameter selection and configuration, model training and evaluation, as well as model selection.

Findings

The Indonesian corpus of public figure statements attribution achieved 90.06% agreement level between the annotator and experts and could serve as a gold standard corpus. Furthermore, the baseline model predicted most labels and achieved 82.026% F-score.

Originality/value

To the best of the authors’ knowledge, the resulting corpus is the first corpus for attribution of public figures’ statements in the Indonesian language, which makes it a significant step for research on attribution extraction in the language. The resulting corpus and the baseline model can be used as a benchmark for further research. Other researchers could follow the methods presented in this paper to develop a new corpus and baseline model for other languages.

2022

Emerald Publishing Limited

Licensed re-use rights only

You do not currently have access to this content.

Don't already have an account? Register

PFSA-ID: an annotated Indonesian corpus and baseline model of public figures statements attributions

Email Alerts

Cited By

PFSA-ID: an annotated Indonesian corpus and baseline model of public figures statements attributions Available to Purchase

Sign in

Client Account

ICE Member Sign In

Email Alerts

Suggested Reading

Recommended for you

Cited By

Sharing Unavailable

PFSA-ID: an annotated Indonesian corpus and baseline model of public figures statements attributions