Table 2.

NLP Techniques mentioned in the reviewed literature

ReferenceTechniqueBrief description
Text classification
Jafari et al. (2021) Naïve Bayes (NB)NB is a probabilistic algorithm that uses the Naïve Bayes equation to calculate the most likely classification. According to the literature sample, it is one of the most widely used algorithms for classifying text documents
Alaka et al. (2019), Baker et al. (2020), Juszczyk (2018b) Support vector machine (SVM)An SVM algorithm can classify an example set into two categories. In other words, this method is a binary linear classifier. SVM puts the training points on a plane and separates them into two intervals. The test points are then mapped into that same space and classified according to which side of the interval they fall into
Tajziyehchi et al. (2020), Ul Hassan et al. (2020) K-Nearest neighbours (KNN)KNN is based on the premise that similar data is found close to each other. KNN captures the idea of similarity using mathematical equations. Often, this similarity is calculated by the distance between points using simple equations like the Euclidean distance, although there are many other ways to calculate this distance
Bloch and Sacks (2020) K-Means clustering (KMC)K-means clustering is an unsupervised learning algorithm. Although it also has the letter k in its name, it is a different method than KNN. This method uses an iterative process where k is the number of clusters to find in the database, and this number is defined as a priori. Each data point is assigned to the closest k. After all objects are assigned, the positions of the k centroids are recalculated. This process is repeated until the k centroids do not change position
Jallan et al. (2019), Hong et al. (2021) Latent dirichlet allocation (LDA)LSA is a good algorithm for topic building, a subproblem of NLP. For this purpose, the algorithm in question takes a geometric approach. In this geometric approach, a plane is created (Dirichlet distribution), where each vertex is a classification category and each point inside this plane is a document. The number of classes is defined previously. The number of categories will determine the number of dimensions of the plan. A second Dirichlet distribution is formed where the vertices of the plan are terms within the documents, and the points within that plan are the topics. These terms within the documents constitute another geometric space. These distributions are associated with multinomial distributions. From the first distribution, we get topics, and from the second one, combinations of terms. The association of these two distributions forms new classified documents that try to replicate the initial input ones. N documents are created, corresponding to the N input documents (corpus). By comparing this corpus with the original one, we obtain the precision of the results
Hong et al. (2021) Latent semantic analysis (LSA)Latent semantic analysis is an unsupervised algorithm for classifying topics in documents or text. This technique is used to find hidden topics within the text. Hidden topics are then used to group similar documents (“clustering”). The LSA returns concepts instead of topics; concepts are combinations of words that describe the document. LSA works by performing a matrix decomposition on the document-term matrix using singular value decomposition (SVD) to reduce the computational complexity and increase the algorithmic efficiency. SVD decomposes the term co-occurrence matrix into three different matrixes: orthogonal column matrix, orthogonal row matrix and one singular matrix. The product of these matrixes represents the term co-occurrence matrix
Tajziyehchi et al. (2020), Ul Hassan et al. (2020), Yaqubi and Salhotra (2019) Random forest (RF)As its name implies, RF consists of a broad group of singular decision trees that run as an ensemble. RF can be used for classification and regression tasks. Every individual tree in the RF yields a class prediction, and the most recurrent class becomes the model’s predictions. It is advantageous because it creates an uncorrelated prediction in every individual tree through bagging and feature randomness
Pessoa et al. (2021), Tajziyehchi et al. (2020), Yaqubi and Salhotra (2019) Gradient boosting regressionGB is an ML algorithm for structured data sets. It is an ensemble method that combines multiple weak models and combines them to achieve better performance as a combined entity. It is capable of finding nonlinear correlations between the model target and features. Similar to RF, it has greater usability as it can deal with missing values and outliers
Text processing
Baker et al. (2020) Bag of words (BOW)BOW simplifies the representation of text in NLP applications. The BOW method takes all unique words from a corpus of text and stores the frequency of occurrence of these unique terms. This frequency metric represents the text or documents and can help algorithms select features in the training phase to enable later text classification
Kessler et al. (2019) N-gram analysisN-grams are combinations of adjacent words or letters of length n. An n-gram is a phrase made of n-words: a 1-gram is a single word, a 2-gram is a phrase made of two words and so forth. The most advantageous length of the n-grams depends on the type of utilisation.
Moon et al. (2021b) Named entity recognition (NER)NER is a subtask of information extraction that aims to identify and classify rigid designator members (named entities) from data such as organisations, people and places, among others (Goyal et al., 2018)
Kim et al. (2020), Guo et al. (2021) Part-of-speech tagging (POS)POS tagging is to mark words in a sentence to a POS. POS includes nouns, verbs, articles, adjectives, prepositions, pronouns and many other categories. POS tags are used to indicate lexical and functional categories of words
Text vectorization
Hong et al. (2021), Jeon et al. (2021a) Word2VecThese algorithms use neural network models to learn the association between words in a text with a large corpus. These models, once trained, can detect synonyms or suggest similar words. Word2Vec represents each word as a vector. These vectors are an optimised way of representing words in NPL applications which, when examined by functions such as cosine similarity, can determine the level of resemblance between vectors
Moon et al. (2021a), Moon et al. (2021b) Doc2VecLike Word2Vec, this method represents documents in vector form, as the name implies. It uses the same word-vector representation as Word2Vec and adds a new vector specific to each document (paragraph vector). In the word vector training phase, the document vector is also trained and holds the numerical representation of a document. This representation is helpful in NLP applications as it allows training for future classification of topics in documents
Jeon et al. (2021a) GloVeGloVe is an unsupervised learning algorithm for obtaining vector representations for words. Words are mapped in space, and their distance is related to their semantic similarity. It is an open-source project released by Stanford University. It bases itself on a log-bilinear regression model capable of word analogy, word similarity and NER tasks (Pennington et al., 2014)
Source: Created by authors

or Create an Account

Close Modal
Close Modal