De-identifying student personally identifying information in discussion forum posts with large language models

Bosch

Crues

Shaik

and

Paquette

(

2020

), “

Hello, [REDACTED]’: protecting student privacy in analyses of online discussion forums

”, in

Rafferty

A.N.

and

Whitehill

Cavalli-Sforza

, and

Romero

(Eds),

Proceedings of The 13th International Conference on Educational Data Mining (EDM 2020)

International Educational Data Mining Society

Online

, pp.

Carrell

Malin

Aberdeen

Bayer

Clark

Wellner

and

Hirschman

(

2012

), “

Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text

”,

Journal of the American Medical Informatics Association

, Vol.

No.

, pp.

342

348

Chambon

P.J.

Steinkamp

J.M.

Adleberg

Cook

T.S.

and

Langlotz

C.P.

(

2022

), “

Automated deidentification of radiology reports combining transformer and ‘hide in plain sight’ rule-based methods

”,

Journal of the American Medical Informatics Association

, Vol.

No.

, pp.

318

328

Crossley

Baffour

Tian

Picou

Benner

and

Boser

(

2022

), “

The persuasive essays for rating, selecting, and understanding argumentative and discourse elements (PERSUADE) corpus 1.0

”,

Assessing Writing

, Vol.

, p.

100667

Crossley

Tian

Baffour

Franklin

Kim

Morris

Benner

Picou

and

Boser

(

2023

), “

The English language learner insight, proficiency and skills evaluation (ELLIPSE) corpus

”,

International Journal of Learner Corpus Research

, Vol.

No.

, pp.

248

269

Farrow

Moore

J.D.

and

Gasevic

(

2023

), “

Names, nicknames, and spelling errors: protecting participant identity in learning analytics of online discussions

”,

LAK23: 13th International Learning Analytics and Knowledge Conference

Association for Computing Machinery

Arlington, TX, USA

, pp.

145

155

FERPA

(

1974

), “

Family educational rights and privacy Act

”,

20 U.S.C. § 1232g

Garfinkel

(

2015

), “

De-identification of personal information

”,

US Department of Commerce, National Institute of Standards and Technology

Holmes

Crossley

S.A.

Morris

Sikka

and

Trumbore

(

2023a

), “

Deidentifying student writing with rules and transformers

”, in

Wang

Rebolledo-Mendez

Dimitrova

Matsuda

and

Santos

O.C.

(Eds),

International Conference on Artificial Intelligence in Education

Springer

Tokyo

, pp.

708

713

Holmes

Crossley

Sikka

and

Morris

(

2023b

), “

PIILO: an open-source system for personally identifiable information labeling and obfuscation

”,

Information and Learning Sciences

, Vol.

124

Nos

9/10

, pp.

266

284

Holmes

Wang

Crossley

and

Zhang

(

2024

), “

The cleaned repository of annotated personally identifia-ble information

”,

Proceedings of the 17th International Conference on Educational Data Mining

, pp.

790

796

Jones

K.M.L.

(

2019

), “

Learning analytics and higher education: a proposed model for establishing informed consent mechanisms to promote student privacy and autonomy

”,

International Journal of Educational Technology in Higher Education

, Vol.

No.

, p.

Jones

K.M.L.

Asher

Goben

Perry

M.R.

Salo

Briney

K.A.

and

Robertshaw

M.B.

(

2020

), “

We’re being tracked at all times”: student perspectives of their privacy in relation to learning analytics in higher education

”,

Journal of the Association for Information Science and Technology

, Vol.

No.

, pp.

1044

1059

Kayaalp

Browne

A.C.

Callaghan

F.M.

Dodd

Z.A.

Divita

Ozturk

and

McDonald

C.J.

(

2014

), “

The pattern of name tokens in narrative clinical text and a comparison of five systems for redacting them

”,

Journal of the American Medical Informatics Association

, Vol.

No.

, pp.

423

431

Kovačević

Bašaragin

Milošević

and

Nenadić

(

2024

), “

De-identification of clinical free text using natural language processing: a systematic review of current approaches

”,

Artificial Intelligence in Medicine

, Vol.

151

, p.

102845

Liu

Huang

Zhang

Cao

Dai

Zhao

Shu

and

Zeng

(

2023

), “

Deid-gpt: zero-shot medical text de-identification by gpt-4

”,

arXiv Preprint arXiv:2303.11032

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

and

Stoyanov

(

2019

), “

RoBERTa: a robustly optimized BERT pretraining approach

”,

arXiv Preprint arXiv:1907.11692

Mansfield

Paullada

and

Howell

(

2022

), “

Behind the mask: demographic bias in name detection for PII masking

”,

available at: https://arxiv.org/abs/2205.04505

Megyesi

Granstedt

Johansson

Prentice

Rosén

Schenström

C.-J.

Sundberg

Wirén

and

Volodina

(

2018

), “

Learner corpus anonymization in the age of GDPR: insights from the creation of a learner corpus of Swedish

”,

Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning at SLTC 2018 (NLP4CALL 2018)

Linköping University Electronic Press

Linköping, Sweden

, pp.

Meystre

S.M.

Friedlin

F.J.

South

B.R.

Shen

and

Samore

M.H.

(

2010

), “

Automatic de-identification of textual documents in the electronic health record: a review of recent research

”,

BMC Medical Research Methodology

, Vol.

No.

, pp.

Mizrahi

Kaplan

Malkin

Dror

Shahaf

and

Stanovsky

(

2024

), “

State of what art? A call for multi-prompt LLM evaluation

”,

Transactions of the Association for Computational Linguistics

, Vol.

, pp.

933

949

Qin

Zhang

Chen

Yasunaga

and

Yang

(

2023

), “

Is ChatGPT a general-purpose natural language processing task solver?

”,

arXiv Preprint arXiv:2302.06476

Ray

P.P.

(

2023

), “

ChatGPT: a comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope

”,

Internet of Things and Cyber-Physical Systems

, Vol.

Rudniy

(

2018

), “

De-identification of laboratory reports in STEM

”,

The Journal of Writing Analytics

, Vol.

No.

, pp.

176

202

Sun

Mhaidli

A.H.

Watel

Brooks

C.A.

and

Schaub

(

2019

), “

It’s my data! tensions among stakeholders of a learning analytics dashboard

”,

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Association for Computing Machinery

Glasgow, Scotland

, pp.

Xiao

Lim

Pollard

T.J.

and

Ghassemi

(

2023

), “

In the name of fairness: assessing the bias in clinical record de-identification

”,

FAccT '23: the 2023 ACM Conference on Fairness, Accountability, and Transparency

Association for Computing Machinery

Chicago, IL, USA

, pp.

123

137

Zambrano

A.F.

Liu

Barany

Baker

R.S.

Kim

and

Nasiar

(

2023

), “

From nCoder to ChatGPT: from automated coding to refining human coding

”, in

Arastoopour Irgens

and

Knight

(Eds),

International Conference on Quantitative Ethnography

Springer

Melbourne, Australia

, pp.

470

485