Skip to Main Content

Traditionally, either applying the hard prompt for sentences by handcrafting the prompt templates or directly optimizing the soft or continuous prompt may not sufficiently generalize for unseen domain data. This paper presents a parameter efficient learning for domain-agnostic soft prompt which is developed for few-shot unsupervised domain adaptation. A pre-trained language model (PLM) is frozen and utilized to extract knowledge for unseen domains in various language understanding tasks. The meta learning and optimization over a set of trainable soft tokens is performed by minimizing the cross-entropy loss for masked language model from support and query data in source and target domains, respectively, where the masked tokens for text category and random masking are predicted. The meta soft prompt is learned through a doubly-looped optimization for individual learners and a meta learner when implementing the unsupervised domain adaptation. The PLM is then closely adapted to compensate the domain shift in a target domain. The domain adaptation loss and the prompt-based classification loss are jointly minimized through meta learning. The experiments on multi-domain natural language understanding illustrate the merit of the proposed meta soft prompt in pre-trained language modeling under few-shot setting.

A large-scaled language model has been trained by the supervised and selfsupervised schemes and has achieved the cutting-edge performance in a wide range of speech and language tasks [5, 9]. The bidirectional encoder representations from transformers (BERT) [41, 15] is known as a successful pre-trained language model (PLM) which was trained from a large corpus by using the tricks of masked language model and next sentence prediction. However, in real applications, the feature representation using PLM is strictly bounded for an unseen downstream task. The previous PLM did not cope with the issue that a specific domain of input sentences is missing and far apart from those of training articles. The generalization issue is serious, and the resulting performance is dropped drastically. Such a phenomenon, also called the domain shift [16, 42], is quite common in practice. The larger the domain shift, the more the restriction on the system performance. In addition, collecting a large amount of labeled data over a variety of domains is time-consuming and sometimes intractable. Blindly predicting the unknown domain is challenging. It is practical to conduct the supervised or even unsupervised domain adaptation [40] in few-shot setting. However, fine-tuning the large-scaled PLM to a new domain by few-shot learning is prone to be overfitting. To cope with this issue, the parameter efficient learning (PEL) was implemented by incorporating the adapters in transformer layers [22] or augmenting the inputs with the prompt templates. The PEL based on model reprogramming [43] was also developed. With the emerging of the powerful generative pre-trained transformer (GPT) [2] or particularly ChatGPT, it is crucial to utilize PLM for various downstream tasks via prompt-based learning or tuning where a large scaled PLM can be frozen. PEL is basically performed by estimating a very limited set of parameters in model construction. Model capacity can be boosted by using various PEL methods.

Conventionally, the learning objective in previous studies [42] was designed by adapting PLM to a target domain which should be provided beforehand. In addition, the performance was likely degraded by pre-training and finetuning according to different learning objectives. The fine-tuned model was also sensitive to the variations in domain knowledge [33]. To deal with the aforementioned problems, this study presents a new prompt-based learning for unsupervised domain adaptation [29] where the parameter efficient adaptation of PLM across multiple domains is developed for natural language understanding (NLU). The domain-agnostic soft prompt is proposed to carry out the generalization of adaptive prompt based on PLM over different unseen domains. The fast adaptation in multi-domain language modeling is activated in the low-resource few-shot setting. A few-shot or even zero-shot domain adaptation is performed through prompt-based learning where few unlabeled samples or even no any sample in an unseen target domain are enrolled. Particularly, a meta learning approach to the proposed domain-agnostic soft prompt is addressed. Model generalization for unseen domains is tackled accordingly.

In particular, this paper integrates two machine learning paradigms. The first paradigm is the soft prompting where a number of continuous random vectors or soft prompt tokens are learned and augmented with the input text tokens to enrich the capacity of input representation. Different from the discrete vectors of word tokens as the hard prompt through a trial-and-error search procedure [38, 38, 39], the proposed continuous vectors are automatically estimated from adaptation data as the soft prompt in accordance with an iterative gradient descent algorithm [27, 27, 37]. Prompt optimization is performed to calculate the data-driven soft prompt which looks more realistic and attractive than hard prompt since both specific domain expertise and human-engineering process can be avoided. The second paradigm is the model-agnostic meta learning (MAML) [17, 45] which is employed to estimate the general soft prompt from support data and then adapt the estimated soft prompt to various unseen target domains from query data [11, 10]. The adaptability of soft prompt for an unseen domain is enabled. The meta soft prompt is proposed by implementing the gradient-based MAML where the meta learner is estimated by using the gradients accumulated from individual task-specific learners over different downstream tasks. The individual metatraining tasks are merged in an unsupervised domain representation which is handled by soft prompt learning from the source-domain labeled data as support data as well as the target-domain unlabeled data as query data. The meta learning is run across a variety of meta-training tasks where the soft prompt is learned from a source domain and then generalized to an incoming target domain through an auxiliary task in present of few-shot unlabeled samples. A new type of meta learning is developed for multi-domain NLU based on the domain-agnostic soft prompt which is learned through a doubly-looped learning algorithm consisting of an inner loop for individual learners and an outer loop for a meta learner. Owing to the integration of individual learners and meta learner, the issue of sub-optimal performance in model construction and domain adaptation which is caused by the inconsistent objectives in pretraining stage and fine-tuning stage is tackled, respectively [33, 44]. In the experiments, a series of analyses and evaluations are conducted to illustrate the performance of the proposed prompt-based language model for multi-domain sentiment classification.

The remaining of this paper is organized as follows. First of all, the fundamentals of prompt-based learning for unsupervised domain adaptation are introduced in Section 2. The basics of meta learning are surveyed. The soft prompt is estimated for multi-domain natural language understanding (NLU). In Section 3, a new solution to learn the domain-agnostic soft prompt is proposed by utilizing a pre-trained language model which is frozen during model construction. The unsupervised domain adaptation based on meta learning for soft prompt is developed. The implementation algorithm is developed and detailed. In Section 4, a set of experiments are conducted to analyze and evaluate the proposed meta soft prompt in multi-domain NLU. The ablation studies on the proposed prompt through latent visualization and classification accuracy are evaluated. Finally, the conclusions drawn from this study are given in Section 5.

This study aims to enrich the learning representation and facilitate the domain knowledge in text representation across multiple domains for natural language understanding. The fundamentals of prompt engineering and domain adaptation are first introduced.

Prompt-based language model has been showing dramatic performance in the era of generative artificial intelligence since GPT-3 [2] or particularly ChatGPT [36] has achieved state-of-the-art results in a variety of NLU tasks. Such a text representation or embedding basically considers the pre-trained language model (PLM) as a frozen backbone model where the input sentence is augmented by merging with an adjustable prompt. Prompt-based language model has been recognized as a parameter efficient learning approach in low-resource setting by utilizing a pre-trained PLM and only estimating a small set of masked language model (MLM) head and continuous prompt tokens where a small amount of domain-specific data are enrolled for domain adaptation. MLM head was added on top of PLM and estimated by means of unsupervised scheme through prediction of the randomly masked words. Basically, the performance of PLM can be leveraged for a specific downstream task.

Figure 1 compares the fine-tuning and the prompt-based learning with MLM heads which are illustrated for sentiment classification. PLM is seen as an encoder to extract general features and MLM head serves as the switch to a downstream task. The PLM parameters and MLM head, which are frozen or adjustable, are depicted by either blue or yellow, respectively. Traditionally, fine-tuning a large set of PLM parameters and estimating the task-specific

head are inefficient. There was no masking scheme. The parameters are easily overestimated especially in case of low-resource setting where only a limited set of adaptation are available. Prompt-based learning is seen as a parameter efficient solution which is designed to improve the fine-tuning of PLM when the generative pre-trained transformer (GPT) has been publicly released. In general, the strategy of prompt-based learning allows training a model with a cloze-style input sentence which adds some textual string prompt “It was —’ to the original sentence that has some unfilled or masked slot —’ which is either ‘good’ or ‘bad’. Accordingly, the MLM head in prompt-based language model is learned to predict the masked token in the augmented sentence. In addition to this unsupervised scheme, MLM head can be further learned in a supervised manner to predict target label of an input text. Such an unsupervised or supervised scheme was also employed to strengthen the small-sized PLMs in Schick and Schütze [38, 39] and Gao et al. [18] where the whole model parameters were fine-tuned and the masked tokens were predicted.

There are two types of prompting, namely hard prompt and soft prompt, in the implementation of prompt sentence. The hard prompt is formed in natural language which consists of ‘discrete’ tokens from the vocabulary. However, selecting a suitable hard prompt for specific domain or task requires the domain-specific expertise with many trial-and-errors in a prompt engineering procedure. When the hard prompt template is selected and fixed, both PLM parameters and MLM head are fine-tuned for a downstream task in a new target domain. More recently, the soft prompt was constructed in the continuous space with adaptability. Since the real-valued soft prompt {v1v2—} or [v1v2v3} is differentiable and can be easily updated by gradient descent algorithm, the optimized prompt with the masked token `—` can be obtained in a handcraft-free fashion. The learning objective is formulated for optimization in accordance with an end performance criterion as well as a prediction of masked token. A recent approach was proposed to carry out the soft prompt which was represented in the continuous space and was optimized in accordance with a real-valued objective function where both PLM and MLM head are frozen. Only the soft prompt tokens are estimated. Relative to the fine-tuning and hard prompt, the number of tunable parameters in soft prompt is significantly reduced and the issue of overestimation can be mitigated. Nevertheless, either fine-tuning or prompting the PLM for a specific downstream task could not sufficiently represent the variations in different downstream tasks. A rapid and unsupervised domain adaptation is required in multi-domain language modeling for natural language understanding. In what follows, this paper surveys the related works and basics for unsupervised domain adaptation and optimization-based meta learning.

In general, pre-trained language model has recently demonstrated convincing results for natural language understanding through a fine-tuning or prompting stage. However, fine-tuning, hard prompt and soft prompt using PLM only work for a specific test domain, and could not easily generalize for the diverse input sentences which may come from a variety of domains in practical circumstances. In the implementation, the additional task-specific layers are configured as the MLM head for domain adaptation to a downstream task. The learned model likely faces the diverse inputs from unseen domains. The unsupervised domain adaptation (UDA) [8, 8, 28] has become crucial to deal with multi-domain language modeling in recent years [12, 6]. Basically, the goal of UDA is to compensate the issue of domain bias or shift caused by the varying distributions in source and target documents where the class labels of enrollment data in a new target domain are unavailable. A simple and meaningful approach to UDA is to fine-tune PLM or provide a prompt to input text so that the pre-training process is continued and extended by adopting the domain-specific documents. BioBERT [26] was developed to fulfill this strategy based on fine-tuning. In addition, the domain-specific pre-training [19] and the task-specific pre-training [20] were jointly performed in a way that the domain-specific pre-training was conducted by further using the task-specific unlabeled documents to adjust the text representation to be closer to the task distribution. In Karouzos et al. [25] and Chen and Chien [3], UDA was implemented under the framework of language modeling where the losses due to the masked language model and the downstream text classification were jointly minimized to enhance the model robustness and improve the sample efficiency in an unsupervised learning and adaptation. However, in the aforementioned works, there existed an inconsistency between the learning objectives in pre-training stage and fine-tuning stage for downstream task, which considerably constrained the utilization of domain knowledge using PLM [33]. This study introduces a new approach to UDA where the soft prompt is adapted across various domains through meta learning where a consistent hybrid learning objective is implemented for both domain adaptation and prompt learning.

This study presents an approach to multi-domain adaptation for prompt-based language model and text representation where the optimization-based meta learning [23] via the model-agnostic meta learning (MAML) [17] is implemented. Basically, MAML conducts a supervised learning to train individual learners and a meta learner where rapid adaptation is realizable under few-shot setting. MAML is feasible for implementation of any kinds of model architecture with learning objective through the gradient descent method. The underlying concept of MAML aims to learn a good initialization for model parameters so that the model can perform well or adapt rapidly on the new tasks or domains through a small number of enrollment data with a few steps of gradient updating. Considering a feature representation model θ, the learning process of a generalization to new tasks can be seen as a construction of internal representation which is broadly suitable for a number of tasks which are sampled from the distribution p(𝒯) of different tasks 𝒯 = {𝒯i}. From the perspective of dynamic systems, meta learning can be implemented by maximizing the sensitivity of a learning objective to new tasks with respect to model parameters θ. High sensitivity in task adaptation corresponds to the condition that the performance improvement due to new tasks is significant while only a few gradient updates are required for system optimization. Basically, the gradient-based method to estimate a model is performed by minimizing a loss function Ti of a single task 𝒯i through the gradient descent updating

(1)

where α is the learning rate. To activate the process for meta learning, the parameter of a meta learner is optimized to find the updated parameter θ’ where a number of tasks sampled from p(𝒯) are learned to achieve the minimum of total loss across different tasks

(2)

In the implementation, the meta optimization across different tasks is performed such that the meta parameter θ is updated by

(3)

The training procedure for optimization-based meta learning is shown in Figure 2 where a meta learner θmeta is optimized by using the gradients of individual tasks in the training stage and then the rapid adaptation of θmeta to a specific task is realizable in the test stage.

This paper presents a hybrid solution to soft prompt language model and unsupervised domain adaptation under low-resource settings as shown in Figure 3. The pre-trained language model (PLM) is utilized and frozen, and the labels of target-domain data ƊT are waived and ignored. There are two implementations for unsupervised domain adaptation (UDA). In the first implementation, the zero-shot domain adaptation is performed by using the learned prompt which is augmented with the test sample from ƊT. Only the test sentence in target domain is used. There is no extra adaptation data required in UDA. In the second implementation, two stages are planned in

a joint scenario of prompt estimation and low-resource domain adaptation where the frozen PLM is shared and applied. The first stage is to learn and adapt the soft prompt where the labeled data from source domain Ɗs and few-shot unlabeled data from target domain ƊT are provided for semi-supervised domain adaptation where meta learning is implemented. Subsequently, the second stage is to utilize the adapted soft prompt, shown in red, and further adjust the adapted soft prompt by using a test sentence in the target domain ƊT. To cope with the downstream tasks in multiple domains, the resulting soft prompt is estimated and adapted through the optimization-based meta learning.

In a practical situation, the input query in a natural language processing task is originated from a variety of domains. It is important to design a general solution to soft prompt over different domains, and present a rapid unsupervised adaptation to a new domain. This paper presents a new prompt-based language model for multi-domain natural language understanding. A system overview is shown in Figure 4 where the pre-trained language model (PLM) using BERT encoder and the masked language model (MLM) head are frozen and utilized. The meta soft prompt is seen as the controllable model parameter which is trained and adapted by using support data (shown by purple) and query data (shown by orange) across multiple domains.

Figure 5 shows an example of various meta-training tasks containing support and query sets from two different domains. There are two stages in implementation of meta learning in the system overview. First, the support data Disup of a domain i are used to adapt the soft prompt parameter θ of a meta learner to an individual learner θi’ with the prompt, e.g. consisting of three continuous-valued tokens {v1v2v3} (shown by yellow), via a prompt-based learning objective from support data ℒsupport (or ℒprompt(Disup,θ)). In the second stage, the query data Diqry of a domain i are used to calculate the gradients based on the updated prompt tokens of a learner θi = {v1v2v3} (here, the learner index i is ignored in soft prompt token v) via a learning objective from query data ℒquery (or ℒprompt(Diqry,θ)). After accumulating the gradients from query data of individual learners, the soft prompt tokens of the meta learner θ = {v1v2v3} (shown by red) are accordingly updated. In the whole process, the frozen PLM is responsible for encoding the input sentence and extracting the embedding or feature representation of soft prompt and masked token [7, 30]. Next, the extracted features are used to find the output of MLM head to calculate the likelihood over the vocabulary of language model for classification prediction using support data and query data in the natural language understanding tasks where the learning objectives ℒsupport and ℒquery are calculated, respectively. Notably, the learning objectives are calculated from the labeled data of source domain ƊS as well as the unlabeled data of target domain ƊT. The masked token [MASK] is arranged for supervised and unsupervised domain adaptation. The back propagation algorithm is run to minimize ℒsupport and ℒquery to update the parameters of meta learner θ and individual learners θi, as shown by red dashed lines. In such a meta training procedure, support set is used to update the soft prompt of learner θi = {v1v2v3}. Query set is then used to calculate the gradients for updating the parameters of earner θi = {v1v2v3} followed by updating the parameters of meta learner θ = {v1v2v3}. Different from the standard meta learning [17], which learns the whole model from a large number of few-shot classification tasks to allow the model to quickly adapt to an unseen classification task, the proposed meta learning focuses on tackling the domain shift by introducing the soft prompt learning conditioned on a frozen PLM. All of the meta-training tasks are designed to simulate the scenarios which will be encountered during test stage. Therefore, the frozen PLM is feasible to provide a seed model which is generalized from multiple source domains Ɗs to a target domain by only tuning the parameters of soft prompt.

Soft prompt language model is implemented through the prompt-based learning which is driven by the soft prompt template and the label word [14]. The soft prompt template is composed of a set of k continuous trainable vectors or tokens which are appended to the input word string as the description to make prediction in a target task. The label word is defined as the word with the highest probability that the prompt-based language model would like to predict. For the example of binary sentiment classification, the training pairs of input string of L words x = {x1,…,xL} and the corresponding output label yY = [positive, negative} are collected. Given the template function T(•), the input string x is transformed to an embedding of prompt input xprompt as the MLM input, i.e.

(4)

where the masked token [MASK] is merged for class prediction and e(·) denotes the embedding function of BERT-based PLM ℱ. For the label words, a verbalizer ℳ: 𝒴 ↦ 𝒱* is introduced to construct the label space by a set of label words 𝒱* ∈ 𝒱 where 𝒱 is the vocabulary of ℱ. This paper considers ℱ as a function of mapping the prompt input xprompt to a vocabulary distribution given the masked token [MASK], or equivalently finding the likelihood for predicting the output class y as

(5)

Therefore, the conditional likelihood of the predicted label word y* ∈ 𝒱* due to a masked token [MASK] by using the proposed soft prompt language model driven by a verbalizer ℳ is calculated as a softmax function by

(6)

where Vy* denotes the vocabulary of label words corresponding to the class label y, and vmask(Vy*) denotes the probability of label word Vy* in the vocabulary distribution vmask. The learning objective of soft prompt language model is calculated as a cross entropy error function between predicted output y* and true output y in a form of

(7)

where Ɗ = {x, y} denotes the training data, y is the ground truth label corresponding to a given input x and θ = {v1,…,vk} denotes k trainable parameters for soft prompt. In this study, the soft prompt parameter θ using soft prompt language model is estimated by minimizing ℒprompt(Ɗ, θ). Notably, the meta soft prompt is estimated across a number of domains to construct the domain agnostic soft prompt. The prompt-based learning objectives from support data ℒsupport (or ℒprompt(Ɗqry, θ)) and query data ℒquery (or ℒprompt(Ɗsup, θ)) are calculated in the implementation.

This study would like to enhance the generalization of soft prompt to unseen domains based on meta learning [24, 21] which acquires domain knowledge from various meta-training tasks. Different from traditional meta learning [17], which trains the model from a large number of few-shot classification tasks to allow the model to quickly adapt to an unseen few-shot classification task, the proposed meta learning is developed to tackle the domain shift issue by introducing soft prompt tuning, conditioned on a frozen PLM. The meta-training tasks are executed and integrated to simulate the scenario that an unseen test condition may encounter. Accordingly, the backbone PLM could generalize to a new unseen domain by simply tuning the parameters of soft prompt. The learning procedure of the proposed domain agnostic soft prompt is implemented through a nested-loop optimization. The inner loop of this optimization is to initialize the individual learners by using the parameters of meta learner and then update the parameters of learners in accordance with Equation (7) by using the support setDisup in each individual meta-training task. In the outer loop, the meta learner is trained by optimizing the performance of individual learners which is examined by using the query set Dqry in each meta-training task. The overall learning process is continuously run until the convergence condition is met. The parameters of a learner are optimized through the gradient-based method by

(8)

using support data Disup. The parameters of meta learner are then trained by pursuing the goodness of each individual learner θi’, which is learned over a lot of meta-training tasks 𝒯 = {𝒯i} from the distribution of tasks p(𝒯) using query data Diqry as

(9)

Algorithm 1 illustrates the entire training for domain-agnostic soft prompt across m domain pairs or meta-training tasks. This solution is implemented for multi-domain supervised domain adaptation by using the prompt-based learning objective ℒprompt where the prompt samples {xprompt, y] with input string x and its class label y from source domains are required.

To activate the unsupervised domain adaptation (UDA) by using the targetdomain data without labels, the domain agnostic soft prompt based on supervised learning via ℒprompt in Figure 4 is accordingly modified to that via semi-supervised domain adaptation in Figure 6 by using source-domain data with labels and target-domain data without labels in accordance with [13]

(10)

with a hyperparameter λ which tunes two learning objectives. In a nested-loop optimization, firstly, the learners are initialized by sharing the parameters from meta learner in the inner loop. Then, the parameters of individual learners are updated by using support set in each meta-training task based on ℒuda. The support set Disup consists of ks labeled samples from source domains {xSj,ySj}j=1ks and kt unlabeled samples from target domains {xTj}j=1kt which are used to calculate the prompt-based objective ℒprompt in Equation (7) and the masked language model (MLM) objective ℒmlm, respectively, and the query set Diqry consisting of kq unlabeled samples from target domain {xTj}j=1kq are used to calculate the MLM objective ℒmlm. Both MLM losses are calculated in a form of cross-entropy error function. Different from the cross-entropy loss ℒprompt calculated for predicting the class words from labeled data ƊS, the MLM loss ℒmlm is measured for predicting those randomly-masked word

tokens from unlabeled text strings in target domains. Next, in the outer loop, the meta learner is optimized according to the performance of those learners which is evaluated on the query set in each meta-training task by using the unlabeled data in target domain ƊT. The overall learning process is repeated until reaching convergence. Meta learner θ^ is estimated by minimizing the prompt-based and MLM losses over individual learners θi and the MLM loss over meta learning across various meta-training tasks 𝒯i sampled from p(𝒯) without class labels

(11)

where α can be implemented via the popular optimizer using AdamW. The overall meta soft prompt learning and adaptation is shown in Algorithm 2. A kind multi-task learning over various domains is implemented. A combination of soft prompt learning and masked language modeling is fulfilled for unsupervised

domain adaptation. The adapted soft prompt is able to simultaneously capture the target domain information from MLM objective and the task language knowledge from prompt objective. The domain adaptation loss and prompt-based classification loss are jointly optimized in the proposed meta soft prompt.

The proposed meta soft prompt learning was evaluated by two multi-domain text classification benchmarks which were Amazon review dataset [1] and FDU-MTL dataset [31].

Amazon review dataset contained text documents from Amazon.com consisting of four domains including books, DVD, electronics and kitchen (or precisely kitchen and housewares). Each domain contained 1000 positive reviews and 1000 negative reviews. All reviews had the ratings from 1 to 5 stars which were converted into binary labels as the positive review and negative review. Each domain also contained additional unlabeled data which were collected for evaluation of unsupervised domain adaptation and text categorization as shown in Table 1 where the sizes of labeled and unlabeled data, and the document length are provided. FDU-MTL was a relatively challenging dataset consisting of up to 16 domains, which were broadly categorized into Amazon product reviews and movie reviews, where the statistics of experimental data are shown in Table 2. In the experimental setting for domain adaptation, one of the domains was selected as the target domain, and the remaining domains were viewed as the source domains. For meta learning over various source domains, the supervised domain adaptation was performed and the test data in a target domain were evaluated. Zero-shot domain adaptation was then implemented. For the setting of unsupervised domain adaptation, the classification labels in target-domain data were missing. Meta soft prompt was learned from labeled source data and few-shot unlabeled target data, and was evaluated by the rest of test data in target domain. Few-shot domain adaptation was implemented accordingly. Classification accuracy was reported in different sets of evaluation over various domains.

In the implementation, the stochastic gradient descent algorithm based on AdamW optimizer [34] was used. The mini-batch size was 256 tokens, and the number of meta-training tasks in one batch was eight. BERT [15] was used as the pre-trained language model (PLM). In meta learning, the learning rates in inner-loop optimization and outer-loop optimization were set differently. The outer-loop learning rate was fixed as 10−3 for different lengths of soft prompt, and the inner-loop learning rate was set as 10−2 and 5 × 10−3 for the length of soft prompt as 2 and 10, respectively. The dimension of the input to the task-specific layer was 768. The parameters of task-specific layer and soft prompt tokens were trained with the same dimension 768 with two layers and two soft tokens, respectively. In the evaluation, two-dimensional (2D) latent visualization based on t-distributed stochastic neighbor embedding (SNE) [35] was analyzed.

In the experiments, the ablation study on the effect with and without the learned meta soft prompt was first conducted by showing the comparison of two-dimensional visualization samples from original 768 dimensional samples of the masked tokens for the positive and negative reviews from four individual domains in Amazon review dataset. The results of those review samples without and with meta soft prompt are displayed for comparison in Figure 7 where blue and red samples correspond to the positive and negative emotions for test reviews in unseen domains for sentiment classification, respectively. As we can see, without considering the meta soft prompt, the frozen PLM using BERT could not properly separate different classes of Amazon review samples. However, the samples from positive and negative reviews are clearly separated by introducing the proposed meta soft prompt which could correctly guides the frozen BERT to extract the distinguishable features for two sentiment classes.

Next, the usefulness of meta soft prompt language model is further examined by comparing two-dimensional latent visualization over positive versus negative as well as source domain versus target domain as illustrated in Figure 8 where the perspectives of classes and domains are analyzed, respectively. In this comparison, the samples of the masked tokens in the product reviews of ‘Books’ by using the prompt sentences based on meta soft prompt are shown. The test samples in unseen domains are evaluated. From the perspective of classes, it is found that the learned meta soft prompt language model does extract useful feature representations for masked tokens where the latent variables look separately for two individual classes. More interestingly, 2D latent samples of the masked tokens of two classes in source domain and target domain are close together. This finding reveals that the proposed meta soft prompt does pursue the domain-matching property for multi-domain adaptation. In the

above evaluation of latent visualization, the soft prompt length was fixed as 5 and λ = 0.8 was used.

Subsequently, the effect of the length of meta soft prompt is investigated [27]. The experiments on classification accuracy are carried out by considering various prompt lengths. Figure 9 displays the classification accuracy when the length of meta soft prompt was varied from 2 to 20. The four domains in Amazon review dataset are individually analyzed in the comparison. It is found that domain ‘kitchen’ receives the highest accuracy while domain ‘DVD’ obtains the worst accuracy. The accuracy was increased when the length of soft tokens in meta learning was increased, but was saturated with the length around 10.

Furthermore, the prediction of masked tokens based on the hard prompt and the meta soft prompt is compared. Table 3 shows the examples of the templates of prompt sentences based on hard prompt and meta soft prompt where the input text and masked token [MASK] are included. A popular hard prompt is designed as “This {domain} is [MASK]” by introducing the domain name for input representation in a sentiment classification task. In this study, the soft tokens {v1,v2,v3} with prompt length 3 are estimated via meta learning to simulate and act as the hand-crafted hard tokens. As a result, Table 4 compares the top five predicted words of two examples in the domains of product reviews of ‘Music’ and ‘Magazine’ in FDU-MTL dataset by using hard prompt and meta soft prompt. It is found that using hard prompt could not correctly guide the BERT backbone model to seek right classification. The prediction is prone to be irrelevant or wrong. In contrast, using the proposed meta soft prompt likely finds the right sentiment class in different applied domains.

(a) Music product review

(b) Magazine product review

For comparative study, the sentiment classification accuracy of review data by using fine-tuning (FT) method, standard soft prompt (SP) [27] and the proposed meta soft prompt (MSP) is illustrated in Table 5 where four individual domains in Amazon review dataset are evaluated. The experiments were setup by choosing one domain as the target for testing and the remaining domains as the sources for meta training. Meta training over various pairs of source domains was performed in a supervised manner. Notably, the testing is performed over an unseen target domain without any adaptation data. Zero-shot adaptation is examined. To conduct a fair evaluation, the number of parameters in different methods was controlled to be comparable. For fine-tuning, the number of trainable parameters in task-specific layer was 768 × 2. The length of 768 dimensional soft tokens in SP and MSP was therefore fixed as 2. In this set of experiments, the proposed meta soft prompt achieves the highest accuracy when compared with fine-tuning and standard soft prompt in most of domains. In addition, the experimental setting on review or document classification is extended to the case of SP and MSP where the length of soft prompt is increased and fixed as 5. For this case, the accuracy using MSP is consistently higher than that using SP for various domains. Meta soft prompt does learn different cross-domain knowledge from multiple domains which is helpful for a better domain adaptation than that using standard soft prompt where multi-domain information is simply integrated without the pairwise cross-domain learning.

Next, the proposed MSP language model is compared with the previous multi-task learning methods over individual 16 domains of FDU-MTL dataset. The adapted meta soft prompt is collaborated with the frozen PLM to conduct the evaluation on target domain. Similar to the experimental setting in Table 5, different learning methods were performed over various source domains in a supervised way, but the testing was run on an unseen target domain without additional adaptation data. The classification result is shown in Table 6. The result using the proposed meta soft prompt (MSP) is compared with strong baselines including the multi-task deep neural network (MT-DNN) [32], the adversarial multi-task learning (ASP-MTL) [31], the multinomial adversarial network with the least square loss (MAN-L2), the negative log-likelihood loss (MAN-NLL) [4], and the BERT model [15] which is fine-tuned (FT) on each domain. ASP-MTL [31] used the additional abundant unlabeled samples from the target domain. This set of experiments shows that MSP performs better than the other methods in most of domains. In average, MSP obtains the classification accuracy of 89.5% over 16 domains which is the highest in this set of comparison.

In addition, the comparison of classification accuracy is extended to a recent work called MoE-Tr [42]. MoE-Tr is a transformer-based multi-source domain adaptation method, which introduced a mixture of experts (MoE) technique where multiple PLMs were involved in adaptation. This method required additional unlabeled target domain data for domain adversarial training. MoE-Tr performed the domain adversarial learning [40] where many shots were used. Different from MoE-Tr, the proposed MSP only involves one single PLM which is frozen and used to train soft prompt via meta learning. The length of soft prompt is now extended to 10 for both SP and MSP. Table 7 shows that MSP performs better than SP. This likely happens because MSP is learned from the pairwise cross-domain information which is richer than the integrated domain information in SP without pairwise domain adaptation. Meanwhile, the performance of MSP without additional unlabeled adaptation data is not as good as that of MoE-Tr where the unlabeled adaptation data are required. But, the number of trainable parameters using MSP is only 7.68K which is extremely smaller than 264M by using MoE-Tr. MoE-Tr requires 33.6K times of trainable parameters relative to MSP. MoE-Tr suffers from high parameter size and computational cost.

To conduct a fair comparison, the set of experiments in Table 7 also include the results of SP and MSP where additional few-shot unlabeled samples are provided for unsupervised domain adaptation. The results of classification accuracy show that MSP considerably works better than SP. This is likely because MSP is learned from different pairs of domain adaptation which result in an optimized model for a better adaptation than SP where the learning to learn for cross-domain adaptation is ignored. In addition, relative to MoE-Tr which involved multiple PLMs and utilized abundant adaptation data in target domain, SP and MSP are implemented by only using 4 or 8 shots of unlabeled adaptation samples. When eight shots of adaptation data are employed for unsupervised meta soft prompt learning, the resulting classification accuracy is close to that using MoE-Tr where a large-scaled controllable model and multiple PLMs were required and a large number of adaptation samples were enrolled. On the other hand, the additional few-shot unsupervised domain adaptation (either 4 or 8 shots) is further evaluated by using the FDU-MTL dataset as shown in Table 8. In this set of experiments, the ablation study without and with unsupervised domain adaptation (UDA) is also shown. The length of soft tokens in SP and MSP is fixed as 10. There are some findings.

First, similar to Table 7 for Amazon review dataset, the classification accuracy using FDU-MTL dataset is consistently improved from SP to MSP in a setting of unsupervised domain adaptation. Using SP, UDA is not working as well as that in MSP. This is possibly because MSP learns a better foundation of soft prompt than SP. Also, increasing number of shots in MSP elevates the classification accuracy in most of target domains. In general, the improvements by increasing shot number and soft prompt length are obtained through the proposed meta soft prompting and learning.

This paper has presented a meta learning for domain-agnostic soft prompt for multi-domain text representation and language understanding. This soft prompt was augmented with the observed and the masked word tokens as the prompt sentence to enrich input representation. Based on the doubly-looped optimization, the learned meta learner was able to extract useful features from the frozen pre-trained language model where the fine-tuning process was avoided. On the other hand, an additional unsupervised domain adaptation was developed in presence of very limited trainable parameters without the need of labeled data in target domain. Such an unsupervised domain adaptation was proposed to pursue the extra benefit from learning with unlabeled data in target domain. The experiments on multi-domain representation showed that this method was able to obtain latent spaces with obvious class separation. The results have shown that the meta soft prompt could successfully boost a frozen pre-trained model to capture domain-specific information and achieve desirable classification performance by only training a few parameters. This method1 has obtained higher accuracy than the other methods in most of domains over two text datasets. For future work, the proposed method could be integrated with other unsupervised domain adaptation in the inner optimization for meta learning. The proposed method is feasible to collaborate with not only masked language model but also sequence-to-sequence model or autoregressive model for different applications.

This work was supported in part by the National Science and Technology Council, Taiwan, under Contract NSTC 112-2634-F-A49-006.

Jen-Tzung Chien is currently the Lifetime Chair Professor at National Yang Ming Chiao Tung University, Hsinchu, Taiwan. He has authored more than 250 peer-reviewed articles in machine learning with applications on natural language processing and computer vision, and three books including Bayesian Speech and Language Processing, Cambridge University Press, 2015, Source Separation and Machine Learning, Academic Press, 2018, and Machine Learning for Speaker Recognition, Cambridge University Press, 2020. He was a Tutorial Speaker of AAAI, IJCAI, ACL, MM, KDD, ICASSP and Interspeech. He received the Best Paper Award in IEEE Workshop on Automatic Speech Recognition and Understanding in 2011, and IEEE International Workshop on Machine Learning for Signal Processing in 2023.

Ming-Yen Chen received the M.S. degree from the Artificial Intelligence Graduate Program at National Yang Ming Chiao Tung University, Hsinchu, Taiwan in 2022. His research interests include meta learning, prompt engineering and natural language processing.

Ching-hsien Lee is currently the Lead of the Computational Intelligence Technology Division in the Information and Communications Labs, Industrial Technology Research Institute, Hsinchu, Taiwan. His research interests include large language model, natural language technology and generative artificial intelligence.

Jing-Hao Xue received the Dr.Eng. degree in signal and information processing from Tsinghua University in 1998, and the Ph.D. degree in statistics from the University of Glasgow in 2008. He is a Professor of Statistical Pattern Recognition in the Department of Statistical Science, University College London. His research interests include statistical pattern recognition, machine learning, and computer vision. He received the Best Associate Editor Award of 2021 from the IEEE Transactions on Circuits and Systems for Video Technology, and the Outstanding Associate Editor Award of 2022 from the IEEE Transactions on Neural Networks and Learning Systems.

References

[1]
J.
Blitzer
,
M.
Dredze
, and
F.
Pereira
, “
Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification
”, in
Proc. of Annual Meeting of Association of Computational Linguistics
,
2007
,
440
447
.
[2]
T.
Brown
,
B.
Mann
,
N.
Ryder
,
M.
Subbiah
,
J. D.
Kaplan
,
P.
Dhariwal
,
A.
Neelakantan
,
P.
Shyam
,
G.
Sastry
,
A.
Askell
, et al.
, “
Language models are few-shot learners
”,
Advances in Neural Information Processing Systems
,
33
,
2020
,
1877
1901
.
[3]
H.-Y.
Chen
and
J.-T.
Chien
, “
Deep semi-supervised learning for domain adaptation
”, in
Proc. of International Workshop on Machine Learning for Signal Processing
,
2015
,
1
6
.
[4]
X.
Chen
and
C.
Cardie
, “
Multinomial adversarial networks for multi-domain text classification
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics: Human Language Technologies
,
2018
,
1226
1240
.
[5]
J.-T.
Chien
, “
Deep Bayesian natural language processing
”, in
Proc. of Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
,
2019
,
25
30
.
[6]
J.-T.
Chien
and
C.-W.
Huang
, “
Stochastic adversarial learning for domain adaptation
”, in
Proc. of International Joint Conference on Neural Networks
,
2020
,
1
7
.
[7]
J.-T.
Chien
and
Y.-H.
Huang
, “
Bayesian transformer using disentangled mask attention
”, in
Proc. of Annual Conference of International Speech Communication Association
,
2022
,
1761
1765
.
[8]
J.-T.
Chien
and
J.-C.
Junqua
, “
Unsupervised hierarchical adaptation using reliable selection of cluster-dependent parameters
”,
Speech Communication
,
30
(
4
),
2000
,
235
253
.
[9]
J.-T.
Chien
and
Y.-C.
Ku
, “
Bayesian recurrent neural network for language modeling
”,
IEEE Transactions on Neural Networks and Learning Systems
,
27
(
2
),
2015
,
361
374
.
[10]
J.-T.
Chien
and
W.
Lai
, “
Variational skill embeddings for meta reinforcement learning
”, in
Proc. of International Joint Conference on Neural Networks
,
2023
,
1
8
.
[11]
J.-T.
Chien
and
W. X.
Lieow
, “
Meta learning for hyperparameter optimization in dialogue system
”,
Proc. of Annual Conference of International Speech Communication Association
,
2019
,
839
843
.
[12]
J.-T.
Chien
and
Y.-Y.
Lyu
, “
Partially adversarial learning and adaptation
”, in
Proc. of European Signal Processing Conference
,
2019
,
15
.
[13]
J.-T.
Chien
,
M.-Y.
Chen
, and
J.-H.
Xue
, “
Learning meta soft prompt for few-shot language models
”, in
Proc. of Asia Pacific Signal and Information Processing Association Annual Summit and Conference
,
2023
,
57
62
.
[14]
J.-T.
Chien
,
H.-T.
Wang
, and
C.-H.
Lee
, “
Contrastive meta learning for soft prompts using dynamic mixup
”, in
Proc. of International Joint Conference on Neural Networks
,
2024
,
1
6
.
[15]
J.
Devlin
,
M.-W.
Chang
,
K.
Lee
, and
K.
Toutanova
, “
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics
,
2019
,
4171
4186
.
[16]
C.
Du
,
H.
Sun
,
J.
Wang
,
Q.
Qi
, and
J.
Liao
, “
Adversarial and domain-aware BERT for cross-domain sentiment analysis
”, in
Proc. of Annual Meeting of Association for Computational Linguistics
,
2020
,
4019
4028
.
[17]
C.
Finn
,
P.
Abbeel
, and
S.
Levine
, “
Model-agnostic meta-learning for fast adaptation of deep networks
”, in
Proc. of International Conference on Machine Learning
,
2017
,
1126
1135
.
[18]
T.
Gao
,
A.
Fisch
, and
D.
Chen
, “
Making pre-trained language models better few-shot learners
”, in
Proc. of International Joint Conference on Natural Language Processing
,
2021
,
3816
3830
.
[19]
S.
Gururangan
,
A.
Marasovic
,
S.
Swayamdipta
,
K.
Lo
,
I.
Beltagy
,
D.
Downey
, and
N. A.
Smith
, “
Don’t stop pretraining: adapt language models to domains and tasks
”, in
Proc. of Annual Meeting of Association for Computational Linguistics
,
2020
,
8342
8360
.
[20]
X.
Han
and
J.
Eisenstein
, “
Unsupervised domain adaptation of contex-tualized embeddings for sequence labeling
”, in
Proc. of Conference on Empirical Methods in Natural Language Processing
,
2019
,
4238
4248
.
[21]
Y.
Hou
,
H.
Dong
,
X.
Wang
,
B.
Li
, and
W.
Che
, “
MetaPrompting: learning to learn better prompts
”, in
Proc. of International Conference on Computational Linguistics
,
2022
,
3251
3262
.
[22]
N.
Houlsby
,
A.
Giurgiu
,
S.
Jastrzebski
,
B.
Morrone
,
Q.
De Laroussilhe
,
A.
Gesmundo
,
M.
Attariyan
, and
S.
Gelly
, “
Parameter-efficient transfer learning for NLP
”, in
Proc. of International Conference on Machine Learning
,
2019
,
2790
2799
.
[23]
Y.
Huang
,
K.
Qian
, and
Z.
Yu
, “
Learning a Better Initialization for Soft Prompts via Meta-Learning
”, in
Proc. of International Joint Conference on Natural Language Processing
,
2023
,
67
75
.
[24]
W.
Jiang
,
Y.
Zhang
, and
J.
Kwok
, “
Effective structured prompting by meta-learning and representative verbalizer
”, in
Proc. of International Conference on Machine Learning
,
2023
,
15186
99
.
[25]
C.
Karouzos
,
G.
Paraskevopoulos
, and
A.
Potamianos
, “
UDALM: un-supervised domain adaptation through language modeling
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics: Human Language Technologies
,
2021
,
2579
2590
.
[26]
J.
Lee
,
W.
Yoon
,
S.
Kim
,
D.
Kim
,
S.
Kim
,
C. H.
So
, and
J.
Kang
, “
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
”,
Bioinformatics
,
36
(
4
),
2020
,
1234
1240
.
[27]
B.
Lester
,
R.
Al-Rfou
, and
N.
Constant
, “
The power of scale for parameter-efficient prompt tuning
”, in
Proc. of Conference on Empirical Methods in Natural Language Processing
,
2021
,
3045
3059
.
[28]
L.
Li
,
M.-W.
Mak
, and
J.-T.
Chien
, “
Contrastive adversarial domain adaptation networks for speaker recognition
”,
IEEE Transactions on Neural Networks and Learning Systems
,
33
(
5
),
2022
,
2236
2245
.
[29]
W.
Lin
,
M.-W.
Mak
, and
J.-T.
Chien
, “
Multisource i-vectors domain adaptation using maximum mean discrepancy based autoencoders
”,
IEEE/ACM Transactions on Audio, Speech, and Language Processing
,
26
(
12
),
2018
,
2412
2422
.
[30]
H.
Lio
,
S.-E.
Li
, and
J.-T.
Chien
, “
Adversarial mask transformer for sequential learning
”, in
Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing
,
2022
.
[31]
P.
Liu
,
X.
Qiu
, and
X.
Huang
, “
Adversarial multi-task learning for text classification
”, in
Proc. of Annual Meeting of Association for Computational Linguistics
,
2017
,
1
10
.
[32]
X.
Liu
,
J.
Gao
,
X.
He
,
L.
Deng
,
K.
Duh
, and
Y.
Wang
, “
Representation learning using multi-task deep neural networks for semantic classification and information retrieval
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics
,
2015
,
912
921
.
[33]
X.
Liu
,
Y.
Zheng
,
Z.
Du
,
M.
Ding
,
Y.
Qian
,
Z.
Yang
, and
J.
Tang
, “
GPT understands, too
”,
arXiv preprint arXiv:2103.10385
,
2021
.
[34]
I.
Loshchilov
and
F.
Hutter
, “
Decoupled weight decay regularization
”, in
Proc. of International Conference on Learning Representations
,
2019
.
[35]
L.
van der Maaten
and
G.
Hinton
, “
Visualizing data using t-SNE
”,
Journal of Machine Learning Research
,
9
(
86
),
2008
,
2579
2605
.
[36]
L.
Ouyang
,
J.
Wu
,
X.
Jiang
,
D.
Almeida
,
C.
Wainwright
,
P.
Mishkin
,
C.
Zhang
,
S.
Agarwal
,
K.
Slama
,
A.
Gray
,
J.
Schulman
,
J.
Hilton
,
F.
Kelton
,
L.
Miller
,
M.
Simens
,
A.
Askell
,
P.
Welinder
,
P.
Christiano
,
J.
Leike
, and
R.
Lowe
, “
Training language models to follow instructions with human feedback
”, in
Advances in Neural Information Processing Systems
,
2022
.
[37]
G.
Qin
and
J.
Eisner
, “
Learning how to ask: querying LMs with mixtures of soft prompts
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics: Human Language Technologies
,
2021
,
5203
5212
.
[38]
T.
Schick
and
H.
Schütze
, “
Exploiting cloze-questions for few-shot text classification and natural language inference
”, in
Proc. of Conference of European Chapter of the Association for Computational Linguistics
,
2021
,
255
269
.
[39]
T.
Schick
and
H.
Schütze
, “
It’s not just size that matters: small language models are also few-shot learners
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics: Human Language Technologies
,
2021
,
2339
2352
.
[40]
J.-C.
Tsai
and
J.-T.
Chien
, “
Adversarial domain separation and adaptation
”, in
Proc. of International Workshop on Machine Learning for Signal Processing
,
2017
,
1
6
.
[41]
A.
Vaswani
,
N.
Shazeer
,
N.
Parmar
,
J.
Uszkoreit
,
L.
Jones
,
A. N.
Gomez
,
L.
Kaiser
, and
I.
Polosukhin
, “
Attention is all you need
”, in
Advances in Neural Information Processing Systems
,
2017
,
5998
6008
.
[42]
D.
Wright
and
I.
Augenstein
, “
Transformer based multi-source domain adaptation
”, in
Proc. of Conference on Empirical Methods in Natural Language Processing
,
2020
,
7963
7974
.
[43]
C.-H.
Yang
,
Y.-Y.
Tsai
, and
P.-Y.
Chen
, “
Voice2series: reprogramming acoustic models for time series classification
”, in
Proc. of International Conference on Machine Learning
,
2021
,
11808
19
.
[44]
L.-J.
Yang
,
I.-P.
Yeh
, and
J.-T.
Chien
, “
Low-resource speech synthesis with speaker-aware embedding
”, in
Proc. of International Symposium on Chinese Spoken Language Processing
,
2022
,
235
239
.
[45]
H.
Zhang
,
X.
Zhang
,
H.
Huang
, and
L.
Yu
, “
Prompt-based meta-learning for few-shot text classification
”, in
Proc. of Conference on Empirical Methods in Natural Language Processing
,
2022
,
1342
1357
.
[46]
N.
Zhang
,
L.
Li
,
X.
Chen
,
S.
Deng
,
Z.
Bi
,
C.
Tan
,
F.
Huang
, and
H.
Chen
, “
Differentiable prompt makes pre-trained language models better few-shot learners
”, in
Proc. of International Conference on Learning Representations
,
2022
.
Published in APSIPA Transactions on Signal and Information Processing. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for non-commercial purposes only), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at Link to the terms of the CC BY-NC 4.0 licence.

Data & Figures

Figure 1

Illustration of (a) fine-tuning, (b) hard prompt and (c) soft prompt for sentiment classification where the pre-trained language model is utilized. The tunable parameters are shown by yellow while the frozen parameters are shown by blue. Outputs of these methods could be either class labels or word tokens (shown by green) via verbalizer.

Figure 1

Illustration of (a) fine-tuning, (b) hard prompt and (c) soft prompt for sentiment classification where the pre-trained language model is utilized. The tunable parameters are shown by yellow while the frozen parameters are shown by blue. Outputs of these methods could be either class labels or word tokens (shown by green) via verbalizer.

Close modal
Figure 2

Illustration of optimization-based meta learning which optimizes for a meta representation θmeta (shown by black) that can be quickly adapted to new tasks {θ1*,θ2*,θ3*}. The blue arrows indicate the gradients θTi or simply ∇ℒi of individual tasks 𝒯i (i = 1, 2, 3) and the red arrows indicate the updates of the corresponding models θi*.

Figure 2

Illustration of optimization-based meta learning which optimizes for a meta representation θmeta (shown by black) that can be quickly adapted to new tasks {θ1*,θ2*,θ3*}. The blue arrows indicate the gradients θTi or simply ∇ℒi of individual tasks 𝒯i (i = 1, 2, 3) and the red arrows indicate the updates of the corresponding models θi*.

Close modal
Figure 3

Soft prompt language models for two implementations of unsupervised domain adaptation (UDA) under low-resource settings. (a) Zero-shot UDA is performed by using the learned soft prompt combined with the test data from Ɗt. (b) Few-shot semi-supervised domain adaptation by using few-shot labeled data from Ɗs and unlabeled data from ƊT where the learned soft prompt (shown by yellow) is applied. The soft prompt is then adapted (as shown by red) and employed for UDA using test data ƊT. The pre-trained language model is utilized and frozen in these two realizations of UDA.

Figure 3

Soft prompt language models for two implementations of unsupervised domain adaptation (UDA) under low-resource settings. (a) Zero-shot UDA is performed by using the learned soft prompt combined with the test data from Ɗt. (b) Few-shot semi-supervised domain adaptation by using few-shot labeled data from Ɗs and unlabeled data from ƊT where the learned soft prompt (shown by yellow) is applied. The soft prompt is then adapted (as shown by red) and employed for UDA using test data ƊT. The pre-trained language model is utilized and frozen in these two realizations of UDA.

Close modal
Figure 4

System overview for meta soft prompting and learning, which are developed for multi-domain adaptation by using the labeled data from source domain T>s where the learning objectives £support and Lquery are calculated from class labels by using the masked tokens [MASK] (shown by white). The learning objectives are also accumulated from unlabeled data in target domain T>t for prediction of the randomly masked tokens [MASK] in input sentences. The objectives £supp0rt and Lquery are calculated and minimized to clone the soft prompt tokens through the error back propagation.

Figure 4

System overview for meta soft prompting and learning, which are developed for multi-domain adaptation by using the labeled data from source domain T>s where the learning objectives £support and Lquery are calculated from class labels by using the masked tokens [MASK] (shown by white). The learning objectives are also accumulated from unlabeled data in target domain T>t for prediction of the randomly masked tokens [MASK] in input sentences. The objectives £supp0rt and Lquery are calculated and minimized to clone the soft prompt tokens through the error back propagation.

Close modal
Figure 5

Meta training over different support set (shown by purple) and query set (shown by orange). There are m individual mappings of domain adaptation across domain pairs for sentiment classification.

Figure 5

Meta training over different support set (shown by purple) and query set (shown by orange). There are m individual mappings of domain adaptation across domain pairs for sentiment classification.

Close modal
Algorithm 1:

Training meta soft prompt over various pairs of source domains ƊS

Algorithm 1:

Training meta soft prompt over various pairs of source domains ƊS

Close modal
Figure 6

Unsupervised domain adaptation for multi-domain language modeling. Domain agnostic soft prompt is learned by using support data Ɗsup (shown by purple) and query data Ɗqry (shown by orange) from various source domains Ɗss and target domains Ɗt where few-shot unlabeled data in target domain ƊT· are enrolled. Calculation flow of UDA and MLM losses is shown. Error backpropagation over losses ℒuda and ℒmlm in inner and outer optimizations is shown by red dash lines, respectively.

Figure 6

Unsupervised domain adaptation for multi-domain language modeling. Domain agnostic soft prompt is learned by using support data Ɗsup (shown by purple) and query data Ɗqry (shown by orange) from various source domains Ɗss and target domains Ɗt where few-shot unlabeled data in target domain ƊT· are enrolled. Calculation flow of UDA and MLM losses is shown. Error backpropagation over losses ℒuda and ℒmlm in inner and outer optimizations is shown by red dash lines, respectively.

Close modal
Algorithm 2:

Unsupervised domain adaptation over various source ƊS and target domains ƊT

Algorithm 2:

Unsupervised domain adaptation over various source ƊS and target domains ƊT

Close modal
Figure 7

Two-dimensional latent visualization for the positive reviews and negative reviews in Amazon review dataset shown by blue and red, respectively, for four individual domains where the results of (a) without meta soft prompt and (b) with meta soft prompt are compared.

Figure 7

Two-dimensional latent visualization for the positive reviews and negative reviews in Amazon review dataset shown by blue and red, respectively, for four individual domains where the results of (a) without meta soft prompt and (b) with meta soft prompt are compared.

Close modal
Figure 8

Two-dimensional latent visualization for the product reviews of ‘Books’ in target domain in Amazon review dataset by using the learned meta soft prompt. The results of (a) positive reviews (blue) versus negative reviews (red), and (b) the reviews in source domain (orange) versus target domain (green) are compared.

Figure 8

Two-dimensional latent visualization for the product reviews of ‘Books’ in target domain in Amazon review dataset by using the learned meta soft prompt. The results of (a) positive reviews (blue) versus negative reviews (red), and (b) the reviews in source domain (orange) versus target domain (green) are compared.

Close modal
Figure 9

Comparison of classification accuracy for four individual domains in Amazon review dataset where the length of meta soft prompt was varied for evaluation.

Figure 9

Comparison of classification accuracy for four individual domains in Amazon review dataset where the length of meta soft prompt was varied for evaluation.

Close modal
Table 1

Statistics of individual domains in Amazon review dataset.

Domainlabeled data sizeunlabeled data sizeavg. len.
Books20002000159
DVD20002000173
Electronics19982000101
Kitchen2000200089
Table 2

Statistics of individual domains in FDU-MTL dataset.

Domainlabeled datasize unlabeled data size avg. len.
Books20002000159
DVD20002000173
Electronics19982000101
Kitchen2000200089
Apparel2000200057
Camera20002000130
Health1998200081
Music20002000136
Toys2000200090
Video20002000156
Baby19982000104
Magazine20002000117
Software20002000129
Sports2000200094
IMDB19982000269
MR2000200021
Table 3

Comparison of prompt templates based on hard prompt and soft prompt where input text and masked token are included to construct the prompt sentences. A standard hard prompt for sentiment classification is designed by applying different domains listed in Tables 1 and 2.

prompt sentence
hard promptinput text This {domain} is [MASK].
soft promptinput text v1,v2,v3 [MASK].
Table 4

Top five predicted words based on the probability for the masked token where text representations using hard prompt and meta soft prompt are compared. The examples of product reviews of the domains (a) Music and (b) Magazine in FDU-MTL dataset are shown. Red shows that the predicted word is seen as an irrelevant or wrong sentiment for the given product review. Blue shows that the predicted word is viewed as a correct sentiment.

input text: This album contains only rap and no rock songs. This was very disappointing to say the least. → negative review
hard promptmeta soft prompt
1. positive1. bad
2. worth2. unacceptable
3. negative3. disappointing
4. lacking4. terrible
5. good5. wrong
input text: I still have not received this magazine, what is taking so long! — negative review
hard promptmeta soft prompt
1. interesting1. terrible
2. good2. difficult
3. great3. unacceptable
4. excellent4. frightening
5. boring5. complicated
Table 5

Comparison of classification accuracy (%) on various domains in Amazon review dataset by using fine-tuning (FT), soft prompt (SP) [27] and the proposed meta soft prompt (MSP) where the length of soft prompt is set to be 2 (denoted by SP and MSP) and 5 (denoted by SP and MSP). The highest number among different methods is shown by bold.

DomainFTSPMSPSPMSP
Books80.885.283.686.888.0
DVD79.383.283.484.485.6
Electronics79.482.182.684.885.1
Kitchen79.584.586.286.187.6
Table 6

Comparison of classification accuracy (%) on 16 domains in FDU-MTL dataset where the previous methods (MT-DNN, ASP-MTL, MAN-L2, MAN-NLL, FT) and the proposed MSP are evaluated. Length of soft prompt is set as 5.

DomainMT-DNNASP-MTLMAN-L2MAN-NLLFTMSP
Books82.284.087.686.887.089.0
DVD84.285.588.188.685.688.1
Electronics81.786.887.488.888.390.3
Kitchen80.786.289.889.991.090.7
Apparel85.087.087.687.690.092.0
Camera86.289.291.490.790.090.8
Health85.788.289.889.488.391.3
Music84.782.585.985.586.887.8
Toys87.788.090.090.490.390.8
Video85.084.589.589.688.088.4
Baby88.088.290.090.291.591.3
Magazine89.592.292.592.989.890.2
Software85.787.290.490.989.390.9
Sports83.285.789.089.090.891.8
IMDB83.285.586.687.085.888.3
MR75.576.776.176.774.080.4
AVG84.386.188.288.487.989.5
Table 7

Comparison of classification accuracy (%) and number of trainable parameters (N) for the methods without and with additional few-shot unsupervised domain adaptation (UDA) where number of shots is set as 4 and 8. Length of soft prompt is fixed as 10. MoE-Tr [42] is included in the comparison. Notably, MoE-Tr requires a large-scaled trainable model and needs multiple PLMs. MSP is parameter efficient and involves only one PLM. Various domains in Amazon review dataset are evaluated.

DomainMoE-TrSPMSPSP (4)MSP (4)MSP (8)
Books90.087.588.687.988.789.0
DVD89.386.286.987.088.588.1
Electronics90.687.488.487.989.290.3
Kitchen90.888.589.889.290.590.7
UDAyesnonoyesyesyes
N264M7.68K7.68K7.68K7.68K7.68K
Table 8

Comparison of classification accuracy (%) for the methods without and with additional few-shot unsupervised domain adaptation (UDA) where number of shots is set as 4 and 8. Length of soft prompt is fixed as 10. The randomly-selected domains in FDU-MTL dataset are evaluated.

DomainFTSPMSPSP (4)MSP (4)SP (8)MSP (8)
Health88.390.591.390.891.890.792.2
Music86.886.287.886.088.486.489.2
Toys90.390.090.890.791.990.591.8
Magazine89.888.990.288.791.088.992.1
UDAnononoyesyesyesyes

Supplements

References

[1]
J.
Blitzer
,
M.
Dredze
, and
F.
Pereira
, “
Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification
”, in
Proc. of Annual Meeting of Association of Computational Linguistics
,
2007
,
440
447
.
[2]
T.
Brown
,
B.
Mann
,
N.
Ryder
,
M.
Subbiah
,
J. D.
Kaplan
,
P.
Dhariwal
,
A.
Neelakantan
,
P.
Shyam
,
G.
Sastry
,
A.
Askell
, et al.
, “
Language models are few-shot learners
”,
Advances in Neural Information Processing Systems
,
33
,
2020
,
1877
1901
.
[3]
H.-Y.
Chen
and
J.-T.
Chien
, “
Deep semi-supervised learning for domain adaptation
”, in
Proc. of International Workshop on Machine Learning for Signal Processing
,
2015
,
1
6
.
[4]
X.
Chen
and
C.
Cardie
, “
Multinomial adversarial networks for multi-domain text classification
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics: Human Language Technologies
,
2018
,
1226
1240
.
[5]
J.-T.
Chien
, “
Deep Bayesian natural language processing
”, in
Proc. of Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
,
2019
,
25
30
.
[6]
J.-T.
Chien
and
C.-W.
Huang
, “
Stochastic adversarial learning for domain adaptation
”, in
Proc. of International Joint Conference on Neural Networks
,
2020
,
1
7
.
[7]
J.-T.
Chien
and
Y.-H.
Huang
, “
Bayesian transformer using disentangled mask attention
”, in
Proc. of Annual Conference of International Speech Communication Association
,
2022
,
1761
1765
.
[8]
J.-T.
Chien
and
J.-C.
Junqua
, “
Unsupervised hierarchical adaptation using reliable selection of cluster-dependent parameters
”,
Speech Communication
,
30
(
4
),
2000
,
235
253
.
[9]
J.-T.
Chien
and
Y.-C.
Ku
, “
Bayesian recurrent neural network for language modeling
”,
IEEE Transactions on Neural Networks and Learning Systems
,
27
(
2
),
2015
,
361
374
.
[10]
J.-T.
Chien
and
W.
Lai
, “
Variational skill embeddings for meta reinforcement learning
”, in
Proc. of International Joint Conference on Neural Networks
,
2023
,
1
8
.
[11]
J.-T.
Chien
and
W. X.
Lieow
, “
Meta learning for hyperparameter optimization in dialogue system
”,
Proc. of Annual Conference of International Speech Communication Association
,
2019
,
839
843
.
[12]
J.-T.
Chien
and
Y.-Y.
Lyu
, “
Partially adversarial learning and adaptation
”, in
Proc. of European Signal Processing Conference
,
2019
,
15
.
[13]
J.-T.
Chien
,
M.-Y.
Chen
, and
J.-H.
Xue
, “
Learning meta soft prompt for few-shot language models
”, in
Proc. of Asia Pacific Signal and Information Processing Association Annual Summit and Conference
,
2023
,
57
62
.
[14]
J.-T.
Chien
,
H.-T.
Wang
, and
C.-H.
Lee
, “
Contrastive meta learning for soft prompts using dynamic mixup
”, in
Proc. of International Joint Conference on Neural Networks
,
2024
,
1
6
.
[15]
J.
Devlin
,
M.-W.
Chang
,
K.
Lee
, and
K.
Toutanova
, “
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics
,
2019
,
4171
4186
.
[16]
C.
Du
,
H.
Sun
,
J.
Wang
,
Q.
Qi
, and
J.
Liao
, “
Adversarial and domain-aware BERT for cross-domain sentiment analysis
”, in
Proc. of Annual Meeting of Association for Computational Linguistics
,
2020
,
4019
4028
.
[17]
C.
Finn
,
P.
Abbeel
, and
S.
Levine
, “
Model-agnostic meta-learning for fast adaptation of deep networks
”, in
Proc. of International Conference on Machine Learning
,
2017
,
1126
1135
.
[18]
T.
Gao
,
A.
Fisch
, and
D.
Chen
, “
Making pre-trained language models better few-shot learners
”, in
Proc. of International Joint Conference on Natural Language Processing
,
2021
,
3816
3830
.
[19]
S.
Gururangan
,
A.
Marasovic
,
S.
Swayamdipta
,
K.
Lo
,
I.
Beltagy
,
D.
Downey
, and
N. A.
Smith
, “
Don’t stop pretraining: adapt language models to domains and tasks
”, in
Proc. of Annual Meeting of Association for Computational Linguistics
,
2020
,
8342
8360
.
[20]
X.
Han
and
J.
Eisenstein
, “
Unsupervised domain adaptation of contex-tualized embeddings for sequence labeling
”, in
Proc. of Conference on Empirical Methods in Natural Language Processing
,
2019
,
4238
4248
.
[21]
Y.
Hou
,
H.
Dong
,
X.
Wang
,
B.
Li
, and
W.
Che
, “
MetaPrompting: learning to learn better prompts
”, in
Proc. of International Conference on Computational Linguistics
,
2022
,
3251
3262
.
[22]
N.
Houlsby
,
A.
Giurgiu
,
S.
Jastrzebski
,
B.
Morrone
,
Q.
De Laroussilhe
,
A.
Gesmundo
,
M.
Attariyan
, and
S.
Gelly
, “
Parameter-efficient transfer learning for NLP
”, in
Proc. of International Conference on Machine Learning
,
2019
,
2790
2799
.
[23]
Y.
Huang
,
K.
Qian
, and
Z.
Yu
, “
Learning a Better Initialization for Soft Prompts via Meta-Learning
”, in
Proc. of International Joint Conference on Natural Language Processing
,
2023
,
67
75
.
[24]
W.
Jiang
,
Y.
Zhang
, and
J.
Kwok
, “
Effective structured prompting by meta-learning and representative verbalizer
”, in
Proc. of International Conference on Machine Learning
,
2023
,
15186
99
.
[25]
C.
Karouzos
,
G.
Paraskevopoulos
, and
A.
Potamianos
, “
UDALM: un-supervised domain adaptation through language modeling
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics: Human Language Technologies
,
2021
,
2579
2590
.
[26]
J.
Lee
,
W.
Yoon
,
S.
Kim
,
D.
Kim
,
S.
Kim
,
C. H.
So
, and
J.
Kang
, “
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
”,
Bioinformatics
,
36
(
4
),
2020
,
1234
1240
.
[27]
B.
Lester
,
R.
Al-Rfou
, and
N.
Constant
, “
The power of scale for parameter-efficient prompt tuning
”, in
Proc. of Conference on Empirical Methods in Natural Language Processing
,
2021
,
3045
3059
.
[28]
L.
Li
,
M.-W.
Mak
, and
J.-T.
Chien
, “
Contrastive adversarial domain adaptation networks for speaker recognition
”,
IEEE Transactions on Neural Networks and Learning Systems
,
33
(
5
),
2022
,
2236
2245
.
[29]
W.
Lin
,
M.-W.
Mak
, and
J.-T.
Chien
, “
Multisource i-vectors domain adaptation using maximum mean discrepancy based autoencoders
”,
IEEE/ACM Transactions on Audio, Speech, and Language Processing
,
26
(
12
),
2018
,
2412
2422
.
[30]
H.
Lio
,
S.-E.
Li
, and
J.-T.
Chien
, “
Adversarial mask transformer for sequential learning
”, in
Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing
,
2022
.
[31]
P.
Liu
,
X.
Qiu
, and
X.
Huang
, “
Adversarial multi-task learning for text classification
”, in
Proc. of Annual Meeting of Association for Computational Linguistics
,
2017
,
1
10
.
[32]
X.
Liu
,
J.
Gao
,
X.
He
,
L.
Deng
,
K.
Duh
, and
Y.
Wang
, “
Representation learning using multi-task deep neural networks for semantic classification and information retrieval
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics
,
2015
,
912
921
.
[33]
X.
Liu
,
Y.
Zheng
,
Z.
Du
,
M.
Ding
,
Y.
Qian
,
Z.
Yang
, and
J.
Tang
, “
GPT understands, too
”,
arXiv preprint arXiv:2103.10385
,
2021
.
[34]
I.
Loshchilov
and
F.
Hutter
, “
Decoupled weight decay regularization
”, in
Proc. of International Conference on Learning Representations
,
2019
.
[35]
L.
van der Maaten
and
G.
Hinton
, “
Visualizing data using t-SNE
”,
Journal of Machine Learning Research
,
9
(
86
),
2008
,
2579
2605
.
[36]
L.
Ouyang
,
J.
Wu
,
X.
Jiang
,
D.
Almeida
,
C.
Wainwright
,
P.
Mishkin
,
C.
Zhang
,
S.
Agarwal
,
K.
Slama
,
A.
Gray
,
J.
Schulman
,
J.
Hilton
,
F.
Kelton
,
L.
Miller
,
M.
Simens
,
A.
Askell
,
P.
Welinder
,
P.
Christiano
,
J.
Leike
, and
R.
Lowe
, “
Training language models to follow instructions with human feedback
”, in
Advances in Neural Information Processing Systems
,
2022
.
[37]
G.
Qin
and
J.
Eisner
, “
Learning how to ask: querying LMs with mixtures of soft prompts
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics: Human Language Technologies
,
2021
,
5203
5212
.
[38]
T.
Schick
and
H.
Schütze
, “
Exploiting cloze-questions for few-shot text classification and natural language inference
”, in
Proc. of Conference of European Chapter of the Association for Computational Linguistics
,
2021
,
255
269
.
[39]
T.
Schick
and
H.
Schütze
, “
It’s not just size that matters: small language models are also few-shot learners
”, in
Proc. of Conference of North American Chapter of Association for Computational Linguistics: Human Language Technologies
,
2021
,
2339
2352
.
[40]
J.-C.
Tsai
and
J.-T.
Chien
, “
Adversarial domain separation and adaptation
”, in
Proc. of International Workshop on Machine Learning for Signal Processing
,
2017
,
1
6
.
[41]
A.
Vaswani
,
N.
Shazeer
,
N.
Parmar
,
J.
Uszkoreit
,
L.
Jones
,
A. N.
Gomez
,
L.
Kaiser
, and
I.
Polosukhin
, “
Attention is all you need
”, in
Advances in Neural Information Processing Systems
,
2017
,
5998
6008
.
[42]
D.
Wright
and
I.
Augenstein
, “
Transformer based multi-source domain adaptation
”, in
Proc. of Conference on Empirical Methods in Natural Language Processing
,
2020
,
7963
7974
.
[43]
C.-H.
Yang
,
Y.-Y.
Tsai
, and
P.-Y.
Chen
, “
Voice2series: reprogramming acoustic models for time series classification
”, in
Proc. of International Conference on Machine Learning
,
2021
,
11808
19
.
[44]
L.-J.
Yang
,
I.-P.
Yeh
, and
J.-T.
Chien
, “
Low-resource speech synthesis with speaker-aware embedding
”, in
Proc. of International Symposium on Chinese Spoken Language Processing
,
2022
,
235
239
.
[45]
H.
Zhang
,
X.
Zhang
,
H.
Huang
, and
L.
Yu
, “
Prompt-based meta-learning for few-shot text classification
”, in
Proc. of Conference on Empirical Methods in Natural Language Processing
,
2022
,
1342
1357
.
[46]
N.
Zhang
,
L.
Li
,
X.
Chen
,
S.
Deng
,
Z.
Bi
,
C.
Tan
,
F.
Huang
, and
H.
Chen
, “
Differentiable prompt makes pre-trained language models better few-shot learners
”, in
Proc. of International Conference on Learning Representations
,
2022
.

Languages

or Create an Account

Close Modal
Close Modal