Harmonizing occupational classifications in the UK health sector: challenges and opportunities for SOC 2020 and ISCO-08 integration

Abubakar

,

H. D.

,

Umar

,

M.

, &

Bakale

,

A.

(

2022

).

Sentiment classification: Review of text vectorization methods: Bag of words, tf-idf, word2vec and doc2vec

.

SLU Journal of Science and Technology

,

4

(

1

),

27

–

33

.

Ahuir

,

V.

,

Hurtado

,

L. F.

,

García-Granada

,

F.

, &

Sanchis

,

E.

(

2023

). ELiRF-VRAIN at PoliticES-IberLEF2023: Dealing with long texts in transformer-based systems for user profiling. In

IberLEF@ SEPLN

.

https://doi.org/10.1145/3625224

Cheong

,

C. W.

,

Yin

,

K.

,

Cheung

,

W. K.

,

Fung

,

B. C.

, &

Poon

,

J.

(

2023

).

Adaptive integration of categorical and multi-relational ontologies with EHR data for medical concept embedding

.

ACM Transactions on Intelligent Systems and Technology

,

14

(

6

),

1

–

20

. doi:

.

https://doi.org/10.34768/amcs-2023-0043

Cichosz

,

P.

(

2023

).

Bag of words and embedding text representation methods for medical article classification

.

International Journal of Applied Mathematics and Computer Science

,

33

(

4

),

603

–

621

. doi:

.

Clausen

,

N. F.

(

2015

). The Danish demographic database—principles and methods for cleaning and standardisation of data. In

Population Reconstruction

(pp.

3

–

22

).

Cham

:

Springer International Publishing

.

https://doi.org/10.60087/jaigs.v5i1.149

Dai

,

S.

,

Li

,

K.

,

Luo

,

Z.

,

Zhao

,

P.

,

Hong

,

B.

,

Zhu

,

A.

, &

Liu

,

J.

(

2024

).

AI-based NLP section discusses the application and effect of bag-of-words models and TF-IDF in NLP tasks

.

Journal of Artificial Intelligence General science (JAIGS)

,

5

(

1

),

13

–

21

. doi:

.

https://www.health-ni.gov.uk/articles/staff-numbers [

Department of Health

(

2024

).

'Northern Ireland health and social care (HSC) workforce statistics June 2023 to December 2023

.

Department of Health

.

Available from:

accessed

24 June 2024].

Devlin

,

J.

,

Chang

,

M. W.

,

Lee

,

K.

, &

Toutanova

,

K.

(

2019

).

Bert: Pre-training of deep bidirectional transformers for language understanding

. In

Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies

(Vol.

1

, pp.

4171

–

4186

).

https://doi.org/10.1007/978-981-96-1264-2_33

Dhawaleswar Rao

,

C. H.

, &

Pani

,

P.

(

2025

).

Advanced semantic text similarity analysis using sentence transformers

. In

International Conference on Intelligent Computing and Communication

(Vol.

1240

, pp.

371

–

379

). doi:

.

https://doi.org/10.1111/1440-1630.12452

Eagers

,

J.

,

Franklin

,

R. C.

,

Yau

,

M. K.

, &

Broome

,

K.

(

2018

).

Pre-retirement job and the work-to-retirement occupational transition process in Australia: A review

.

Australian Occupational Therapy Journal

,

65

(

4

),

314

–

328

. doi:

.

https://doi.org/10.1038/s41467-021-22328-4

Fries

,

J. A.

,

Steinberg

,

E.

,

Khattar

,

S.

,

Fleming

,

S. L.

,

Posada

,

J.

,

Callahan

,

A.

, &

Shah

,

N. H.

(

2021

).

Ontology-driven weak supervision for clinical entity classification in electronic health records

.

Nature Communications

,

12

(

1

), 2017. doi:

.

Ge

,

C.

,

Friesen

,

M. C.

,

Locke

,

S. J.

,

Russ

,

D. E.

,

Burstyn

,

I.

,

Baker

,

C. J.

, &

Huss

,

A.

(

2023

).

Automated coding of job descriptions from a general population study: Overview of existing tools, their application and comparison

.

Annals of Work Exposures and Health

,

67

(

5

),

663

–

672

.

https://doi.org/10.1017/S0269888904000074

Giunchiglia

,

F.

, &

Shvaiko

,

P.

(

2003

).

Semantic matching

.

The Knowledge Engineering Review

,

18

(

3

),

265

–

280

. doi:

.

https://doi.org/10.1016/j.nlp.2025.100154

Graff

,

M.

,

Moctezuma

,

D.

, &

Téllez

,

E. S.

(

2025

).

'Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges

.

Natural Language Processing Journal

,

11

, 100154. doi:

.

https://doi.org/10.1016/j.array.2025.100467

Guleria

,

P.

,

Frnda

,

J.

, &

Naga Srinivasu

,

P.

(

2025

).

NLP based text classification using TF-IDF enabled fine-tuned long short-term memory: An empirical analysis

.

Array

,

27

, 100467. doi:

.

https://ilostat.ilo.org/methods/concepts-and-definitions/classification-occupation [

ILOSTAT-1

(

2024

).

International standard classification of occupations (ISCO)

.

International Labour Organization

.

Available from:

accessed

17 June 2024].

Jalilifard

,

A.

,

Caridá

,

V. F.

,

Mansano

,

A. F.

,

Cristo

,

R. S.

, &

da Fonseca

,

F. P. C.

(

2021

). Semantic sensitive TF-IDF to determine word relevance in documents. In

Thampi

,

S. M.

,

Gelenbe

,

E.

,

Atiquzzaman

,

M.

,

Chaudhary

,

V.

, &

Li

,

K. C.

(Eds),

Advances in Computing and Network Communications. Lecture Notes in Electrical Engineering

.

Singapore

:

Springer

. doi:

https://doi.org/10.1007/978-981-33-6987-0_27

.

Kouretsis

,

A.

,

Bampouris

,

A.

,

Morfiris

,

P.

, &

Papageorgiou

,

K.

(

2020

).

labourR: classify multilingual labour market free-text to standardized hierarchical occupations

.

R package

.

https://doi.org/10.2105/AJPH.2023.307463

Krasna

,

H.

,

Venkataraman

,

M.

,

Robins

,

M.

,

Patino

,

I.

, &

Leider

,

J. P.

(

2024

).

Standard occupational classification codes: Gaps in federal data on the public health workforce

.

American Journal of Public Health

,

114

(

1

),

48

–

56

. doi:

.

https://doi.org/10.3390/app11135868

Kuodytė

,

V.

, &

Petkevičius

,

L.

(

2021

).

Education-to-skill mapping using hierarchical classification and transformer neural network

.

Applied Sciences

,

11

(

13

),

5868

. doi:

.

https://doi.org/10.1007/s11859-007-0038-4

Li

,

J.

,

Qi ’na

,

F.

and

Kuo

,

Z.

(

2007, In this issue

).

Keyword extraction based on tf/idf for Chinese news document

.

Wuhan University Journal of Natural Sciences

,

12

(

5

),

917

–

921

. doi:

.

https://doi.org/10.38094/jastt61237

Lwin

,

N.

,

Jain

,

R.

,

Dal

,

R.

,

Yan

,

H.

,

Thaw

,

K.

, &

Naung

,

S.

(

2025

).

Text classification for Clickbait detection: A model-driven approach using CountVectorizer and ML classifiers

.

Journal of Applied Science and Technology Trends

,

06

(

01

),

43

–

49

. doi:

.

https://doi.org/10.1093/ije/dyg080

Mannetje

,

A. T.

, &

Kromhout

,

H.

(

2003

).

The use of occupation and industry classifications in general population studies

.

International Journal of Epidemiology

,

32

(

3

),

419

–

428

. doi:

.

https://digital.nhs.uk/data-and-information/publications/statistical/nhs-workforce-statistics [

NHS England

(

2024

).

NHS Workforce Statistics - December 2023

,

Available from:

accessed

22 June 2024].

ONS

(

2024

).

The current standard occupational classification for the UK

,

published in three volumes, Available from:

https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020 [

accessed

19 June 2024].

Prasad

,

A.

,

Mohammad Alenazy

,

W.

,

Ahmad

,

N.

,

Ali

,

G.

,

Abdallah

,

H.

, &

Ahmad

,

S.

(

2025

).

Optimizing IoT intrusion detection with cosine similarity based dataset balancing and hybrid deep learning

,

15

(

1

), 30939. doi:

https://doi.org/10.1038/s41598-025-15631-3

.

https://doi.org/10.1109/ICROIT.2014.6798296

Python Software Foundation

(

2023

).

Python 3.11.4

.

Anaconda

.

Available from:

https://www.python.org/ [

accessed

14 October 2025].

Rana

,

V.

, &

Singh

,

G.

(

2014

).

An analysis of semantic heterogeneity issues and their countermeasures prevailing in semantic web

. In

International Conference on Reliability Optimization and Information Technology (ICROIT)

,

Faridabad

(pp.

80

–

85

).

IEEE

. doi:

.

https://arxiv.org/abs/1908.10084

Reimers

,

N.

, &

Gurevych

,

I.

(

2019

).

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

. In

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing

.

Available from:

https://doi.org/10.1136/oemed-2015-103152

Russ

,

D. E.

,

Ho

,

K.-Y.

,

Colt

,

J. S.

,

Armenti

,

K. R.

,

Baris

,

D.

,

Chow

,

W. H.

, … &

Friesen

,

M. C.

(

2016

).

Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies

.

Occupational and Environmental Medicine

,

73

(

6

),

417

–

424

. doi:

.

https://doi.org/10.1093/annweh/wxad020

Russ

,

D.E.

,

Josse

,

P.

,

Remen

,

T.

,

Hofmann

,

J.N.

,

Purdue

,

M.P.

,

Siemiatycki

,

J.

,

Silverman

,

D.T.

,

Zhang

,

Y.

,

Lavoué

,

J.

, &

Friesen

,

M.C.

(

2023

).

Evaluation of the updated SOCcer v2 algorithm for coding free-text job descriptions in three epidemiologic studies

.

Annals of Work Exposures and Health

,

67

(

6

),

772

–

783

. doi:

.

https://doi.org/10.1007/s44163-023-00050-y

Safikhani

,

P.

,

Avetisyan

,

H.

,

Föste-Eggers

,

D.

, &

Broneske

,

D.

(

2023

).

Automated occupation coding with hierarchical features: A data-centric approach to classification with pre-trained language models

.

Discover Artificial Intelligence

,

3

(

1

),

6

. doi:

.

https://doi.org/10.3390/brainsci12020270

Santander-Cruz

,

Y.

,

Salazar-Colores

,

S.

,

Paredes-García

,

W. J.

,

Guendulain-Arenas

,

H.

, &

Tovar-Arriaga

,

S.

(

2022

).

Semantic feature extraction using SBERT for Dementia detection

.

Brain Sciences

,

12

(

2

),

270

. doi:

.

https://doi.org/10.1007/978-1-4939-1887-4_28

Schmidtke

,

H. R.

(

2014

). Context and granularity. In

Context in Computing: A Cross-Disciplinary Approach for Modeling the Real World

(pp.

455

–

470

).

New York

:

Springer

. doi:

.

https://doi.org/10.1016/j.cmpb.2021.106146

Shahzad

,

S. K.

,

Ahmed

,

D.

,

Naqvi

,

M. R.

,

Mushtaq

,

M. T.

,

Iqbal

,

M. W.

, &

Munir

,

F.

(

2021

).

Ontology driven smart health service integration

.

Computer Methods and Programs in Biomedicine

,

207

, 106146. doi:

.

https://statswales.gov.wales/Catalogue/Health-and-Social-Care/NHS-Staff [

StatsWales

(

2024

).

Important update for StatsWales OData users, NHS staff

,

Available from:

accessed

24 June 2024].

Suarez Garcia

,

C. A.

,

Adisesh

,

A.

, &

Baker

,

C. J. O.

(

2021

) (

In this issue

).

S-464 Automated occupational encoding to the Canadian national occupation classification using an ensemble classifier from TF-IDF and Doc2Vec embeddings

.

Occupational and Environmental Medicine

,

78

(Supp.

1

),

A161

–

A161

. doi:

https://doi.org/10.1136/OEM-2021-EPI.442

.

https://doi.org/10.4114/intartif.vol24iss74

Susan

,

S.

,

Sharma

,

M.

, &

Choudhary

,

G.

(

2024

).

Uniqueness meets semantics: A novel semantically meaningful bag-of-words approach for matching resumes to job profiles

.

Journal Iberamia

,

24

(

74

),

117

–

132

. doi:

.

https://doi.org/10.13094/SMIF-2018-00007

Tijdens

,

K. G.

, &

Kaandorp

,

C. S.

(

2018

). Validating occupational coding indexes for use in multi-country surveys. In

Survey Methods: Insights from the Field

(pp.

1

–

12

). doi:

.

https://turasdata.nes.nhs.scot/data-and-reports/official-workforce-statistics/all-official-statistics-publications/04-june-2024-workforce/dashboards/nhs-scotland-workforce-phase-one/?pageid=11804 [

Turas Data Intelligence

(

2024

).

NHS Scotland workforce (phase one)

,

Available from:

accessed

21 June 2024].

UNECE

(

2019

).

Generic statistical Business Process Model. Version 5.1 January 2019

,

Available from:

https://unece.org/statistics/documents/2019/01/standards/gsbpm-v51 [

accessed

28 June 2024].

Warwick

(

2018

).

Cascot: Computer assisted structured coding tool

,

Available from:

https://warwick.ac.uk/fac/soc/ier/data_group/cascot [

accessed

20 June 2024].

Yilmaz

,

Y.

,

Jurado Nunez

,

A.

,

Ariaeinejad

,

A.

,

Lee

,

M.

,

Sherbino

,

J.

, &

Chan

,

T. M.

(

2022

).

Harnessing natural language processing to support decisions around workplace-based assessment: Machine learning study of competency-based medical education

.

JMIR Med Educ

,

8

(

2

), e30537. doi:

https://doi.org/10.2196/30537

.

Available from:

https://mededu.jmir.org/2022/2/e30537

https://doi.org/10.1080/22423982.2018.1492825

Young

,

T. K.

,

Fedkina

,

N.

,

Chatwood

,

S.

, &

Bjerregaard

,

P.

(

2018

).

Comparing healthcare workforce in circumpolar regions: Patterns, trends and challenges

.

International Journal of Circumpolar Health

,

77

(

1

), 1492825. doi:

.

https://doi.org/10.3233/jifs-161553

Zhang

,

L. Y.

,

Ren

,

J. D.

, &

Li

,

X. W.

(

2017

).

OIM-SM: A method for ontology integration based on semantic mapping

.

Journal of Intelligent and Fuzzy Systems

,

32

(

3

),

1983

–

1995

. doi:

.