1 Introduction

Code-mixing involves borrowing words from one language and incorporating them into another without affecting the context [1, 2]. Code-switching, or language alternation, occurs when individuals alternate between two or more languages within a single conversation or situation [3]. In the context of code-mixed and code-switched (CMCS) text, we distinguish two subtypes: (1) text comprising words that alternate between two languages, and (2) text transitioning from one script to another by substituting letters in a predictable manner, known as Transliteration [4].

Code-mixing and code-switching are intricate phenomena of linguistic behaviour, characterized by the intentional or spontaneous alternation of languages within a single discourse. Another characteristic of CMCS data is lexical borrowing, where words or phrases from one language are used in another. Grammatical hybridity [5], a distinct feature of CMCS, results in blending grammatical structures from different languages. Furthermore, CMCS is influenced by linguistic, social, and cultural constraints, leading to a specific contextual framework.

CMCS is commonly observed in online conversations. A thorough understanding of CMCS data is pivotal for effective communication, advertising, sentiment analysis, and fostering inclusivity across language boundaries. However, the inherent characteristics of CMCS data introduce unique challenges to NLP systems. In particular, the inclusion of multiple scripts and lexical patterns and the potential misidentification of transliterated tokens pose challenges even to modern Natural Language Processing (NLP) systems when processing such text. These challenges are particularly pronounced when working with low-resource languages [6, 7].

In recent years, the domain of NLP has witnessed remarkable advancements, notably propelled by the emergence of pre-trained language models (PLMs) [8, 9]. These PLMs have been trained on extensive datasets, preserving a task-agnostic stance regarding the specific tasks for which they will be later used. To leverage the extensive knowledge embedded in PLMs for diverse NLP tasks, the PLM has to be fine-tuned with task-specific data [10]. This “pre-train and fine-tune” paradigm has been able to activate and harness the comprehensive knowledge within PLMs, leading to very promising results across various downstream tasks such as text classification and named entity recognition [10, 11]. On the negative side, this paradigm faces challenges due to the disparity between pre-training and fine-tuning objectives, leading to inefficiencies in utilizing PLMs across diverse tasks, as they may be unstable in low-resource settings, and less transferable to new tasks after fine-tuning [10,11,12,13].

Prompt-based learning has recently been demonstrated to yield promising results compared to full fine-tuning of PLMs for many downstream tasks [13], even in low-resource scenarios [14]. This paradigm involves redefining downstream tasks using textual prompts, encompassing both prompt engineering and answer engineering [11]. In contrast to fine-tuning, prompt-based learning leverages the existing knowledge of PLMs by redefining downstream tasks as pre-training objectives [10, 11, 15]. This removes the need for extensive parameter updates in PLMs, thus preserving their transferability across various tasks. Prompt-based learning has been extended to incorporate pre-trained multilingual language models (PMLMs) as well, enabling experimentation in languages beyond English [16,17,18].

Existing research on CMCS text classification mainly focuses on the full fine-tuning of PMLMs for downstream tasks [6, 19]. On the other hand, while prompt-based learning has shown success over full fine-tuning for monolingual text, its application to CMCS data has not been explored. Given that prompt-based learning relies on textual prompts, designing effective prompts for CMCS text remains an open question. In other words, a prompt formulated in one language might not be suitable for effectively classifying CMCS data. The absence of multilingual prompts poses a challenge in inducing knowledge from PMLMs effectively, and the potential misidentification of transliterated tokens adds further complexity to accurate classification. These challenges are even more pronounced for low-resource languages. Therefore, addressing these unique challenges is crucial for advancing CMCS text classification through prompt-based learning.

In this study, we focus on prompt-based learning for CMCS text classification. To the best of our knowledge, we believe that we are the first to explore prompt-based learning for CMCS text classification. Therefore, we first delve into the challenges surrounding CMCS text classification and the intricacies introduced by the presence of multiple scripts within a single text. Our experiments unveil that the performance of prompt-based CMCS text classification is influenced by the inclusion of multiple scripts and the intensity of code-mixing.

In response to the aforementioned challenges, we propose a novel methodology named Dynamic+AdapterPrompt. This approach employs distinct models for each script to generate script-specific representations by considering the script of the input sentence (DynamicPrompt). Additionally, it effectively captures task-specific representations necessary for the respective CMCS classification tasks through the utilization of adapters (AdapterPrompt). This combined approach leverages the benefits of both adapters and dynamic script considerations.

We have conducted extensive experiments across Sinhala-English, Kannada-English, and Hindi-English datasets, for the tasks of sentiment classification, hate-speech detection, and humour detection. It is noteworthy that Sinhala and Kannada are categorized as low-resource languages [20]. The outcomes demonstrate that our novel approach, Dynamic+AdapterPrompt, outperforms the existing methodologies: full fine-tuning, adapter-based fine-tuning, and conventional prompt-based learning techniques.

To summarize, the key contributions of this paper are as follows:

  • We present an extensive study on prompt-based learning for CMCS text classification and the first comprehensive exploration of the impact of the script on CMCS text classification.

  • We introduce a novel prompt tuning approach for CMCS text classification termed Dynamic+AdapterPrompt that provides script-specific and task-specific representations, to address the intricacies introduced by the inclusion of multiple scripts in CMCS data.

2 Related work

In this section, we delve into three key areas: prompt-based learning, adapter-based fine-tuning of PLMs, and the challenges and advancements in CMCS text classification.

2.1 Prompt-based learning

Until recently, full fine-tuning, also known as vanilla fine-tuning, was the predominant method for adapting PLMs to downstream tasks [13, 21, 22]. In full fine-tuning, all the parameters of the PLM are trained for an underlying downstream task, which demands a significant amount of computational resources. Full fine-tuning also struggles in fully exploiting the linguistic knowledge acquired during pre-training, due to the disparity between the objectives of pre-training and fine-tuning stages [10, 23, 24]. While pre-training typically encompasses self-supervised tasks such as masked language modelling, full fine-tuning has to use task-specific training objectives (e.g. classification, sequence labelling, or generation). Prompt-based learning aims to bridge this gap between pre-training and fine-tuning objectives. In other words, prompt-based learning reformulates downstream tasks to be similar to training objectives used during PLM pre-training [11]. For encoder-based models that use a masked language modelling objective, one such reformulation technique is to convert the downstream task into a cloze-style format, as illustrated in Figure 1.

Fig. 1
figure 1

Prompt-Based Learning

Prompt-based learning primarily comprises three key components: the prompt, the PLM, and the verbalizer  [11]. As depicted in Figure 1, prompt engineering involves the selection of a prompt template for a downstream task. Early research used manually designed human-readable prompts, referred to as manual or discrete prompts  [22, 23, 25]. Subsequent studies have shifted focus towards soft prompts, also known as continuous prompts, which are optimized during training for specific downstream tasks [22, 23, 25]. Answer engineering refers to the selection of the verbalizer. The verbalizer is the component that maps the predicted mask token of the PLM into the intended label [26] as illustrated in Figure 1. Verbalizers that are human-readable are denoted as discrete verbalizers, whereas soft verbalizers undergo optimization during the training process. Several studies have explored designing suitable verbalizers for downstream tasks, utilizing both discrete and soft tokens [12, 16, 27]. The primary aim of these studies has been to broaden the coverage of the answer space of the verbalizer for each respective label. The effectiveness of this pipeline is significantly determined by prompt engineering and answer engineering [28].

2.2 Adapter-based fine-tuning of PLMs

Adapters are compact trainable modules that can be integrated into transformer layers. They provide a lightweight fine-tuning alternative to the full fine-tuning approach [29]. Houlsby [29] and Peiffer [30] are the two adapter architectures that are commonly used. The key distinction between these two is that the Houlsby adapter employs two down- and up-projection modules, whereas the Pfeiffer adapter utilizes only one module. Adapters can be generally categorized into two categories: task adapters, which learn task-specific representations, and language adapters, which learn language-specific representations [30]. Typically, language adapters are used in conjunction with task adapters [6, 31]. Extensive research has been conducted on adapters as a parameter-efficient fine-tuning method for various tasks. In Rathnayake et al. [6], Sinhala-English CMCS text classification was performed employing different combinations of adapters, yielding improved results compared to full fine-tuning with minimal parameter updates. Moreover, Rücklć et al. [32] demonstrated the benefits of adapters beyond lightweight fine-tuning. They observed a minimal impact on task performance when adapters were dropped from the lower layers of the PLM.

The application of adapters has proven to be beneficial for prompt-based learning as well. Karimi Mahabadi et al. [15] introduced a few-shot learning method utilizing a masked language modelling objective, and leveraged task-specific adapters as a prompt-free strategy. Their experimental results showcased the effectiveness of this technique in comparison to manual and soft prompts.

Smaller language models face difficulties with soft prompts, as discussed by Shah et al. [33]. Li and Liang [22], Reynolds and McDonell [34] suggest that, as the model size increases, the performance gap between prompt-based approaches and fine-tuning narrows, indicating that larger models tend to benefit more from fine-tuning. To enhance smaller language models’ effectiveness, Shah et al. [33] suggest using adapters in combination with soft prompts. Their novel approach shows promise in optimizing smaller models, achieving up to 98% of the performance of full fine-tuning.

2.3 CMCS text classification

Classifying CMCS text poses a significant challenge in NLP, largely due to the scarcity of annotated datasets, particularly in the context of low-resource languages. Despite these challenges, studies have been made in developing manually annotated CMCS text classification datasets for low-resource languages [2, 6, 19, 35, 36]. A range of Deep Learning (DL) approaches has been employed for classifying CMCS data. For instance, Chathuranga and Ranathunga [37], Kamble and Joshi [38] utilized techniques such as capsule networks, LSTM, and BiLSTM for CMCS text classification.

Currently, state-of-the-art performance in CMCS text classification is achieved using PMLMs [4, 6, 19, 31, 39,40,41,42,43]. However, Zhang et al. [44] showed that PMLMs are not perfectly code-switching compatible. When no training examples are provided (zero-shot), the observed performance of PLMs for CMCS-related tasks shows that these models are less effective compared to models that have been specifically trained for a task. Additionally, they exhibit limited learning capabilities in few-shot settings. Table 1 provides a summary of different PMLM approaches for CMCS text classification.

Table 1 Related Work in CMCS Text Classification

3 Datasets

We select three publicly available CMCS datasets. These cover low-resource languages (Sinhala, Kannada), as well as Hindi, a high-resource language [20]. They exhibit different levels of code-mixing and have been annotated for various classification tasks.

Table 2 Variations of CMCS Data Across Datasets

The first dataset [6] includes CMCS sentences in Sinhala and English languages. This dataset has been annotated for sentiment classification, humour detection, and hate-speech detection tasks. The second dataset [46] consists of Kannada and English CMCS content and has annotations for sentiment analysis and hate-speech detection. The Hindi-English datasetFootnote 1 contains CMCS content written in the Latin script, which has been annotated for the humour detection task. Each language possesses its unique script for writing (Latin for English, Sinhala for Sinhala, Kannada for Kannada, and Devanagari for Hindi).

Altogether, these corpora exhibit six distinct CMCS variations with respect to the script used in training instances, as shown in Table 2. In the first two variants, the text is exclusively composed in one language, employing characters from the same language. Conversely, in the next two variants, the text is written in one language, utilizing characters from a different language. The fifth variant comprises sentences that alternate between languages, with each sentence written in the script corresponding to the language, while the last variant involves sentences that blend elements from two or more of the aforementioned types.

To better analyze the extent of language mixing, we systematically classify sentences in each corpus based on the percentage of characters from each scriptFootnote 2 as outlined in Algorithm 1. Opting to examine sentences at the character level, as opposed to the word level, enables us to capture finer details of language mixing. In CMCS sentences, particularly in informal communication, individual words can seamlessly blend characters from multiple scripts.

Algorithm 1
figure a

Instance Classification Based on Script. The term [other] represents the script of the language combined with English in the CMCS context.

A threshold of 100% is considered, implying that if all characters in a sentence belong to one script, it is categorized under that script; otherwise, it is labelled as a mixed-script sentence. The following examples illustrate this algorithm:

  • Latin Script Example:

    Sentence: “Mama wibhagaya samath una” (Sinhala written in Latin script) Script Label: Latin (100% of characters are in Latin script)

  • [Other] Script Example:

    Sentence: “ ” (Sinhala written in Sinhala script) Script Label: Sinhala (100% of characters are in Sinhala script)

  • Mixed-Script Example:

    Sentence: “I passed the ” Script Label: Mixed (a combination of Latin and Sinhala script characters)

As shown in Table 2, note that the Hindi-English dataset does not have content in Devanagari script - instead, Hindi words have been written in Latin script. Comprehensive statistics for all three datasets are provided in the Appendix A.

4 Baselines

For our experiments, we use a random baseline (assigning class labels to instances without any predetermined criteria) and a majority/minority class baseline (where classification is based on the majority or minority class), along with three additional baselines associated with PMLMs. As mentioned earlier, only full fine-tuning and adapter-based fine-tuning of PMLMs have been employed for CMCS text classification [6]. Therefore we use these two techniques as our baselines. Basic prompting entails training artificial tokens while maintaining the frozen state of the PMLM. Since prompt-based learning has not been attempted previously for CMCS text classification, we utilize Soft Prompt + Soft Verbalizer as our baseline for prompting.

4.1 Full fine-tuning (Full FT)

We train the PLM by updating all parameters, including the task-dependent sequence classification head added on top, as proposed by Devlin et al. [9]. Throughout this process, the PLM weights are adjusted using task-specific data, which facilitates the learning of task-specific representations. We fine-tune the PLM separately for each downstream task (single-task fine-tuning).

4.2 Adapter-based fine-tuning (A-B FT)

We integrate randomly initialized adapters into the PLM. During the fine-tuning phase, we specifically train the introduced adapter parameters while keeping the original PLM parameters frozen. For each downstream classification task, we train distinct sets of adapters. We experiment with both Houlsby [29] and Pfeiffer [30] adapter architectures.

4.3 Prompt-based learning with soft prompt + soft verbalizer (SP+SV)

We employ soft prompt (SP), which comprises artificial token embeddings, and soft verbalizer (SV), which consists of artificial tokens in label words, with the PLM, as proposed by Hambardzumyan et al. [27]. SP and SV replace traditional discrete tokens with artificial ones. During the training phase, we fine-tune SP and SV while keeping the PLM parameters frozen.

5 Experimental setup

PMLMs excel in CMCS text classification by leveraging their contextual understanding and transfer learning capabilities. Their multilingual proficiency, derived from diverse training datasets, enables effective handling of language variations within the same text. In this study, we utilize the XLM-RoBERTa-base (XLM-R) [8] model as the PMLM for our experiments. This choice is motivated because the XLM-R model has been pre-trained on a range of languages, including the languages considered in our study. Moreover, it proves to be well-suited for our work, particularly within the constraints of a resource-efficient computing infrastructure. For full fine-tuning and adapter-based fine-tuning, we employ the code released by Rathnayake et al. [6]. We implement all the prompt-based learning models using the OpenPromptFootnote 3 [26] framework, which supports Hugging Face TransformersFootnote 4 and is built upon the PyTorch frameworkFootnote 5. For adapter-based implementations within OpenPrompt, we utilize the Adapter-TransformersFootnote 6 library, which is built on Hugging Face Transformers.

The datasets specified in Section 3 are partitioned into training, validation, and testing subsets in a stratified manner, with respective proportions of 80%, 10%, and 10% (statistics are provided in Appendix A). As suggested by Rathnayake et al. [6], we employ Random Oversampling (ROS) to address the class imbalance issue of the hate-speech detection task within the Sinhala-English CMCS dataset. Given the pronounced class imbalance in these datasets, the Macro F1-Score is opted as our primary evaluation metric, as it facilitates a more consistent and reliable comparison.

All models are tested across three different seeds (8, 42, 77), and the average results are reported. The maximum sequence length for the input sentence is set at 128. We conduct each experiment for 20 epochs with a batch size of 32. Early stopping is employed in the experiments with a patience of 5 epochs. An evaluation is conducted at the end of each epoch, and the best-performing model is chosen for testing. We use the Adam optimizer as the gradient optimizer, paired with a linear learning rate scheduler. Additionally, a grid search is conducted for hyperparameter tuning to boost the performance of each model. The optimized hyperparameters for each experiment are delineated in Appendix B. All the experiments are conducted using NVIDIA Tesla P100 GPU machines on the KaggleFootnote 7 platform.

6 Impact of script variation and code-mixing intensity on CMCS text classification

The variations in CMCS data, as outlined in Table 2, underscore the unique properties and characteristics inherent to CMCS data. To better understand the complexities in handling CMCS text, let’s consider an illustrative example: “I passed the ”. This sentence illustrates a classic instance of Sinhala-English code-mixing. The word “ ” is a Sinhala term for “examination”. When entirely transliterated into the Latin script, it might read: “I passed the wibhagaya”.

For a model primarily trained on English data, the term might be unfamiliar. Conversely, for a model with extensive Sinhala training, the transliterated version “wibhagaya” might pose confusion.

To explore the influence of scripts on CMCS text classification, we conduct training on baseline models using the training set outlined in Section 5, which encompasses training samples from all scripts. Table 3 illustrates the script-wise results of this experiment. In the Sinhala-English context, despite the Latin script demonstrating the best performance in full fine-tuning, the Sinhala script outperforms the other two scripts in adapter-based fine-tuning and SP+SV. Conversely, in the Kannada-English context, the Latin script yields the highest performance in full fine-tuning and SP+SV, while the Mixed script excels in adapter-based fine-tuning. The sentences in the Kannada script exhibit the lowest results. It is evident that in both CMCS contexts, significant performance variations exist based on the script of the training instance.

Table 3 Results by Script obtained through training using the entire training dataset: Sentiment Classification

To delve further into the impact of script on CMCS text classification, we create distinct training sets by considering the language script of the training instances. These training sets are employed to train the models, each focusing on a single script, to investigate the impact of the training script. For both the Sinhala-English and Kannada-English datasets, we first create separate training sets based on script type, resulting in distinct portions for each script (e.g., for the Sinhala-English corpus, we have portions for Latin script only, Sinhala script only, and mixed script). The first experiment involves selecting a subset of training data from each script-based portion, ensuring that the size of each subset equals 10% of the total training data.

We then expand our analysis to include a larger subset, comprising 20% of the overall training data for each script category. In this phase, 20% of the training data is selected from the Latin script instances and another 20% from a combined subset of Sinhala/Kannada and mixed script instances, because the sentences in Sinhala/Kannada and mixed scripts individually constitute less than 20% of the overall training data. Each subset is stratified based on the label distribution of the task, ensuring a balanced representation of task labels within each script category. Subsequently, we train the MPLM, utilizing the aforementioned training subsets.

The test dataset utilized in all experiments is as described in Section 5. Note that the Hindi-English dataset, consisting solely of Latin script instances, is excluded from this particular experiment. Also, note that this analysis is focused exclusively on the sentiment classification task.

Table 4 Script-Based Analysis: Sentiment Classification

Table 4 depicts that in the Sinhala-English context, training on Latin-script sentences leads to optimal results across all three baselines. With the 10% training dataset proportion, full fine-tuning shows similar outcomes when trained on either Sinhala-script or mixed-script sentences. With adapter-based fine-tuning, training on mixed-script sentences yields better performance compared to training on Sinhala-script sentences, whereas SP+SV shows better results with Sinhala-script compared to the mixed script. At a 20% training dataset proportion, training with a combination of Sinhala and mixed-script sentences results in lower performance compared to training with Latin-script sentences.

In the Kannada-English context, with the 10% training dataset proportion, training with mixed-script sentences excels over Latin and Kannada scripts in full fine-tuning and adapter-based fine-tuning, while SP+SV is most effective with Latin-script sentences. Training on Kannada-script sentences results in the lowest performance across all baselines. At a 20% training dataset proportion, the patterns between Latin-script and Kannada+mixed-script training resemble those at the 10% level.

The prompt-based learning baseline, SP+SV, reaches its highest performance with Latin-script training in both Sinhala-English and Kannada-English contexts. It can be observed that the performance of full fine-tuning, adapter-based fine-tuning, and SP+SV for both datasets exhibit significant fluctuations based on the training script.

Revisiting our example, when trained on Latin-script sentences, the model might proficiently classify the sentence “I passed the wibhagaya” due to the dominance of Latin content. However, the original sentence, “I passed the ”, which blends Sinhala and Latin scripts, might be more challenging, primarily contingent on the model’s familiarity with Sinhala characters.

Based on the discernible findings from the aforementioned experiments, it becomes evident that performance disparities are contingent on the script of the input sentence. Thus, a conclusive inference is drawn: CMCS text classification performance is significantly influenced by the inclusion of multiple scripts and the degree of code-mixing intensity.

7 Optimizing prompt-based learning through script-based adaptations

To address the limitations observed when employing soft prompts with small PLMs [33] as mentioned in Section 2.2, we conduct experiments with adapters in the context of prompt-based learning, referred to as AdapterPrompt. As elaborated in Section 6, the effectiveness of prompt-based learning for CMCS text classification depends on the script of the input text. To address this dependency, we believe that instead of using the same soft prompt and soft verbalizer for inputs of different scripts, dynamically determining the prompt and verbalizer based on the input script, could enhance the model’s capability. To achieve this, we propose DynamicPrompt. Finally, we combine DynamicPrompt with adapters forming a fusion of DynamicPrompt and AdapterPrompt, to leverage the strengths of both and present Dynamic+AdapterPrompt.

7.1 AdapterPrompt

As mentioned in Section 2.2, when utilizing soft prompts, the effectiveness of small pre-trained language models such as XLM-R diminishes, thereby reducing the efficacy of prompt-based learning [33]. To mitigate this, we utilize AdapterPrompt, while preserving the static state of the PLM parameters.

Fig. 2
figure 2

AdapterPrompt without Adapter Dropping

Fig. 3
figure 3

AdapterPrompt with Adapter Dropping

In AdapterPrompt, we integrate adapters with the SP+SV model to classify CMCS data, as depicted in Figure 2. Instead of solely querying the PLM using soft prompts as in the SP+SV approach, we incorporate task adapters into the PLM. This enhancement augments the task-specific representation for the underlying task.

We experiment with the two commonly used adapter architectures, Houlsby [29] and Peiffer [30], integrating both into SP+SV models. Additionally, employing the adapter-dropping technique [32], we progressively remove adapters, starting from higher layers of the PLM as illustrated in Figure 3. This iterative process aims to identify the optimal set of adapters necessary to effectively acquire the task-specific representations for the respective task associated with the adapters.

7.2 DynamicPrompt

Our DynamicPrompt approach consists of separate SP+SV modelsFootnote 8, each optimized for a specific script category. Each model is trained exclusively with sentences from its respective script, yielding script-specific representations. While the PLM remains frozen, all SP+SV models share this common PLM.

Fig. 4
figure 4

DynamicPrompt Architecture: A Sinhala-English Example. The input sentence translates into English as “I like to watch cricket matches”

The script of the input sentence is programmatically determined by the script identifier, as shown in Figure 4, based on the percentage of characters from each script, as elaborated in Section 3. A threshold of 100% is applied, meaning that if all characters in a sentence belong to one script, it is categorized under that script; otherwise, it is labelled as a mixed-script sentence.

Based on the identified script, the corresponding SP+SV model is selected dynamically. We then concatenate the soft prompt with the input sentence and feed it into the PLM, which predicts the masked token based on the surrounding context. Subsequently, the soft verbalizer maps the predicted answer tokens to the corresponding label using soft answer tokens.

7.3 Dynamic+AdapterPrompt

By combining the aforementioned approaches, we propose a novel prompt-based learning methodology termed Dynamic+AdapterPrompt. With the introduction of Dynamic+AdapterPrompt, we train separate SP+SV models, each augmented with adapters, for each script category. Each model is exclusively trained on sentences corresponding to its designated script, thereby providing a script-specific representation for enhanced classification.

The frozen PLM serves as the backbone shared across all the SP+SV models, with adapters being integrated into the PLM to encapsulate task-specific functionality. This strategy effectively capitalizes on the inherent strengths of both DynamicPrompt and AdapterPrompt.

Two architectural variants of Dynamic+AdapterPrompt can be implemented, employing distinct methods for integrating adapters to the PLM. For both variants, separate SP+SV models are employed for each script category. These variants are described in the following two sub-sections.

7.3.1 Dynamic+AdapterPrompt with shared adapters setting

As illustrated in Figure 5, we employ adapters that are shared across all SP+SV models. This means that each SP+SV model in Dynamic+AdapterPrompt shares the same set of adapters with the PLM. The goal of this approach is to allow the models to leverage common task-specific functionality through shared adapters while simultaneously benefiting from the script-specific representation provided by the separate SP+SV models.

Fig. 5
figure 5

Dynamic+AdapterPrompt Architecture with Shared Adapters Setting: A Sinhala-English Example. The input sentence translates into English as “I like to watch cricket matches”

7.3.2 Dynamic+AdapterPrompt with distinct adapters setting

As depicted in Figure 6, we integrate distinct adapters for each SP+SV model. This approach involves using a common PLM across all SP+SV models, but each SP+SV model employs a separate set of adapters that are not shared among them. When the script of input is identified, the set of adapters relevant to the script is activated along with the corresponding SP+SV model. The objective of this approach is to facilitate fine-tuning and adaptation that are specific to the characteristics of each script category.

Fig. 6
figure 6

Dynamic+AdapterPrompt Architecture with Distinct Adapters Setting: A Sinhala-English Example. The input sentence translates into English as “I like to watch cricket matches”

In our experiments, we explore the first architectural variant, employing separate SP+SV models for each script category, while applying shared adapters across all models. Subsequently, in an ablation study detailed in Section 8.3, we investigate the second variant to compare the effectiveness of both architectures and to determine the impact and viability of such architectural variants. Within these variations, we experiment with the Houlsby adapter architecture.

8 Evaluation and analysis

In this section, we evaluate the effectiveness of the proposed Dynamic+AdapterPrompt approach compared to the baselines. Subsequently, we conduct two ablation studies, with a particular focus on the sentiment classification task in the Sinhala-English context: (1) a script-based analysis to examine the impact of different scripts, and (2) a comparative study to explore the effectiveness of the two adapter integration architectures within Dynamic+AdapterPrompt - shared adapters and script-wise adapters. Following these analyses, an error analysis is conducted to identify common issues related to misclassified sentences within the sentiment classification task for the Sinhala-English context using the Dynamic+AdapterPrompt approach. The training and test datasets utilized in all experiments are as outlined in Section 5. Throughout this section, Ac. denotes Accuracy, while Precision (Pr.), Recall (Re.), and F1 correspond to macro averages.

8.1 Overall evaluation

We first conduct a detailed study to determine the optimal adapter architecture for prompt-based learning in CMCS text classification. In the results for the Sinhala-English sentiment analysis task, as depicted in Appendix C, Houlsby architecture, with activated layers from 0-10, yielded the highest performance in AdapterPrompt. Therefore, the results reported in Tables 5, 6, and 7 are with Houlsby adapter architecture.

Table 5 Overall Results: Sentiment Classification
Table 6 Overall Results: Hate-Speech Detection
Table 7 Overall Results: Humour Detection

In the baseline evaluation presented in Tables 5, 6, and 7, for the majority class, minority class, and random baselines, it is observed that the random baseline outperformed the majority and minority class baselines. However, all three of these baselines exhibit significantly lower performance in comparison to the baselines associated with PLMs.

Notably, the SP+SV approach outperforms the XLM-R full fine-tuning in all cases, except in the Hindi-English context. It achieves superior or competitive results compared to adapter-based fine-tuning, except for hate-speech detection in the Sinhala-English context. The enhanced performance of SP+SV over full fine-tuning can be traced back to the substantial discrepancy in objectives during the pre-training and fine-tuning phases within the fine-tuning paradigm, which hinders the full exploitation of knowledge within PLMs, as we discussed in Section 2.1.

AdapterPrompt consistently demonstrates superiority over the baseline results in sentiment classification, hate-speech detection, and humour detection as shown in Tables 5, 6, and 7 across all language contexts, except hate-speech detection in the Sinhala-English context. This observation is aligned with previous research that highlights the effectiveness of integrating adapters into the PLM within the fine-tuning paradigm for improving CMCS text classification [6]. Importantly, our findings reiterate this trend, emphasizing that even within the prompt-based learning paradigm, the integration of adapters into the PLM results in performance improvements. This strategic use of adapters significantly enhanced the SP+SV approach, showcasing a substantial improvement in the model’s understanding of specific task intricacies by providing task-specific representations.

DynamicPrompt exhibits lower performance compared to the baselines in all language contexts, except for sentiment classification in the Sinhala-English context. Notably, DynamicPrompt yields inferior results compared to AdapterPrompt across all tasks and language contexts. In Section 8.2, we delve into an in-depth analysis of the models’ performance across different script categories.

Despite the lower performance observed with DynamicPrompt, the Dynamic+AdapterPrompt approach outperforms the results of DynamicPrompt and AdapterPrompt in the majority of tasks across all language contexts (except for sentiment classification). This improvement can be attributed to the adapters’ ability to learn task-specific representations, while the soft prompt and soft verbalizer within each model acquire script-specific knowledge for classification. This underscores the proficiency of the combined approach in adeptly addressing challenges intrinsic to both script and task across various language combinations. Note that in the context of humour detection in Hindi-English, the DynamicPrompt and Dynamic+AdapterPrompt techniques are not explored, primarily because the dataset is entirely in the Latin script.

In summary, the SP+SV approach demonstrates superior or competitive results compared to XLM-R full fine-tuning and adapter-based fine-tuning for most tasks, with only a few exceptions. AdapterPrompt consistently outperforms baseline results, showcasing the effectiveness of integrating adapters into the PLM within the prompt-based learning paradigm. DynamicPrompt alone exhibits lower overall performance. However, the combination of DynamicPrompt and AdapterPrompt, Dynamic+AdapterPrompt, emerges as the most effective strategy. This underscores the benefits of leveraging both script-based prompts and adapters to address the intricacies of script and task variations in CMCS text classification.

8.2 Script-based analysis

Section 6 unveiled a significant variance in performance, particularly in the context of prompt-based learning, depending on the script of the input sentence. In this ablation study, we further analyze the impact of the script, employing the sentiment classification task in the Sinhala-English context as a case study.

Table 8 Results by Script: Sentiment Classification for Sinhala-English

Table 8 presents the script-wise results of the sentiment classification task for Sinhala-English, further highlighting the discrepancy introduced by the inclusion of multiple scripts. Despite the integration of adapters into SP+SV (AdapterPrompt), this variance persists. This can be ascribed to the adapters providing only task-specific representations, which may not fully rectify the script-related disparities. However, it is noteworthy that AdapterPrompt has demonstrated notable effectiveness when the input is in a single script.

Although DynamicPrompt has a relatively lower performance, it has reduced the script’s influence on CMCS text classification, as demonstrated in Table 8. This method is effective because DynamicPrompt provides script-specific representations, making it more robust against script variations compared to fine-tuning, SP+SV, and AdapterPrompt. This underscores its contribution to addressing challenges related to including multiple scripts in CMCS text classification.

When considering Dynamic+AdapterPrompt, the outcomes indicate that the integration of adapters led to a variance in performance, similar to the observations in AdapterPrompt. However, it narrows the gap between Latin and Mixed scripts compared to SP+SV and AdapterPrompt, due to the script-specific representations provided by DynamicPrompt.

In conclusion, DynamicPrompt helps reduce script variations, and AdapterPrompt enhan-ces performance with task-specific representations. The combination, Dynamic+AdapterPrompt, achieves even better results by leveraging the strengths of both approaches.

8.3 Evaluating the efficacy of the architecture in Dynamic+AdapterPrompt

In Section 7.3, we employ the shared adapter architecture; however, it is crucial to note that the script-wise adapter architecture remains a viable alternative. To explore the comparative effectiveness of these two adapter integration architectures, we conduct an ablation study, and the results are presented in Table 9. The superiority observed in the shared adapter architecture can be attributed to its capacity to develop a uniform task representation across all scripts. By sharing adapters across all SP+SV models, the shared adapters in the Dynamic+AdapterPrompt model can benefit from the knowledge obtained from the entire training dataset. Conversely, script-wise adapters are confined to learning solely from the samples of a specific script with which they are associated, resulting in a more circumscribed and script-dependent understanding. Consequently, in the Dynamic+AdapterPrompt, the shared adapter architecture enables the model to leverage a more extensive spectrum of training data, leading to a performance that is markedly superior to that of the script-wise adapter architecture.

Table 9 Dynamic+AdapterPrompt Results by Architecture: Sentiment Classification for Sinhala-English

8.4 Error analysis

To identify issues related to misclassified sentences, we conduct an error analysis on the Sinhala-English dataset in the sentiment classification task. For this analysis, we employ the Dynamic+AdapterPrompt approach for handling the intricacies introduced by the inclusion of multiple scripts in CMCS data. Based on our analysis, we have identified the following issues with misclassified sentences:

  1. 1.

    Context-specific sentiment in an input sentence: An input sentence may convey sentiment in a specific context that is not apparent to the model. For example, the sentence “Matews kiyanne poll buruwek” was labelled as Negative but predicted as Neutral. This is a Sinhala sentence written in Latin script, where “poll buruwek” means “idiot.” The model may not understand this context, especially since the word “poll" has an entirely different meaning in English.

  2. 2.

    Words highlighting polarity: Words that emphasize the positive or negative polarity of a sentence can impact the prediction. For example, the sentence “Nidokin ara dilshan ain wena ekamy hoda..." was labelled as Negative but predicted as Positive. The word “hoda”, a Sinhala term written in Latin script meaning “good”, highlights the positive sentiment of the sentence.

9 Conclusion and future work

In this paper, we explore the potential of prompt-based learning for CMCS text classification, providing a thorough investigation into the impact of the script on classifying CMCS text. Our comprehensive experiments reveal that the effectiveness of prompt-based CMCS text classification is significantly affected by the inclusion of multiple scripts and the intensity of code-mixing. In light of these findings, we introduce a novel prompt-based tuning method named Dynamic+AdapterPrompt. We employ separate models for each script category, integrated with adapters to encapsulate the script-specific representation and the task-oriented functionality of CMCS text. The experimental results prove that our proposed method outperforms strong baselines across various CMCS contexts and text classification tasks. This underscores its robustness and efficiency in classifying CMCS text, particularly involving low-resource languages.

Our proposed approach, Dynamic+AdapterPrompt, is suitable for CMCS text where the scripts are distinguishable from each other (such as Sinhala and English). Replacing or adding script identification with language identification will be effective for CMCS contexts, where scripts are similar (such as in German and English). We leave that for future work. Additionally, our approach requires balanced data for each script in the dataset to yield optimal results. However, balanced datasets, especially in low-resource settings, are typically challenging to obtain, potentially limiting the advancement of CMCS classification using this approach.

As part of our future work, we intend to delve into the application of multi-task learning for code-mixed text classification, leveraging prompt-based learning techniques. Future research could also validate the generalizability of Dynamic+AdapterPrompt by incorporating a wider range of language scripts and classification tasks. Additionally, exploring adapter fine-tuning variants, such as LoRA (Low-Ranked Adapters), in conjunction with the proposed approach could provide valuable insights. We have released our code to facilitate future researchFootnote 9.