Thailand

Pre-trained multilingual language models (LMs) have achieved state-of-the-art results in cross-lingual transfer, but they often lead to an inequitable representation of languages due to limited capacity, skewed pre-training data, and sub-optimal vocabularies. This has prompted the creation of an ever-growing pre-trained model universe, where each model is trained on large amounts of language or domain specific data with a carefully curated, linguistically informed vocabulary. However, doing so brings us back full circle and prevents one from leveraging the benefits of multilinguality. To address the gaps at both ends of the spectrum, we propose MergeDistill, a framework to merge pre-trained LMs in a way that can best leverage their assets with minimal dependencies, using task-agnostic knowledge distillation. We demonstrate the applicability of our framework in a practical setting by leveraging pre-existing teacher LMs and training student LMs that perform competitively or even outperform teacher LMs trained on several orders of magnitude more data and with a fixed model capacity. We also highlight the importance of teacher selection and its impact on student model performance.

ACL-IJCNLP 2021

{M}erge{D}istill: {M}erging Language Models using Pre-trained Distillation

task-agnostic

pre-trained language models

distillation

multilingual

**Welcome to ACL-IJCNLP 2021!**

The great event is jointly organized by the Association for Computational Linguistics (ACL) and Asian Federation of Natural Language Processing (AFNLP). 

As in previous years, the program of the conference includes a poster session, tutorials, workshops and demonstrations in addition to the main conference.


We were able to keep the registration fees similar to those charged for the virtual ACL 2020. The one fee allows attendance at the main conference and any/all tutorials and workshops. These fees would be $125 Regular Early and $175 Regular Late; $50 Student Early and $75 Student Late. Early registration closes at midnight July 11, 2021 (Eastern Daylight Time).

**Reminder:** It is ACL’s policy that at least one author of each accepted paper (including ACL Finding papers) must register for the conference.

**Reminder2:** Underline site will open closer to the event. If you already registered you will receive access detail

Registration is now open

The great event is jointly organized by the Association for Computational Linguistics (ACL) and Asian Federation of Natural Language Processing (AFNLP).

technical paper

Large pre-trained models such as BERT are known to improve different downstream NLP tasks, even when such a model is trained on a generic domain.
Moreover, recent studies have shown that when large domain-specific corpora are available, continued pre-training on domain-specific data can further improve the performance of in-domain tasks. 
However, this practice requires significant domain-specific data and computational resources which may not always be available. 
In this paper, we aim to adapt a generic pretrained model with a relatively small amount of domain-specific data.
We demonstrate that by explicitly incorporating multi-granularity information of unseen and domain-specific words via the adaptation of (word based) n-grams, the performance of a generic pretrained model can be greatly improved.
Specifically, we introduce a Transformer-based Domain-aware N-gram Adaptor, T-DNA, to effectively learn and incorporate the semantic representation of different combinations of words in the new domain.
Experimental results illustrate the effectiveness of T-DNA on eight low-resource downstream tasks from four domains.
We show that T-DNA is able to achieve significant improvements compared to existing methods on most tasks using limited data with lower computational costs.
Moreover, further analyses demonstrate the importance and effectiveness of both unseen words and the information of different granularities. Our code is available at https://github.com/shizhediao/T-DNA.

Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

Pre-trained Language Models (PLMs) have shown superior performance on various downstream Natural Language Processing (NLP) tasks. However, conventional pre-training objectives do not explicitly model relational facts in text, which are crucial for textual understanding. To address this issue, we propose a novel contrastive learning framework ERICA to obtain a deep understanding of the entities and their relations in text. Specifically, we define two novel pre-training tasks to better understand entities and relations: (1) the entity discrimination task to distinguish which tail entity can be inferred by the given head entity and relation; (2) the relation discrimination task to distinguish whether two relations are close or not semantically, which involves complex relational reasoning. Experimental results demonstrate that ERICA can improve typical PLMs (BERT and RoBERTa) on several language understanding tasks, including relation extraction, entity typing and question answering, especially under low-resource settings.

ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning

Neural models for automated fact verification have achieved promising results thanks to the availability of large, human-annotated datasets. However, for each new domain that requires fact verification, creating a dataset by manually writing claims and linking them to their supporting evidence is expensive. We develop QACG, a framework for training a robust fact verification model by using automatically generated claims that can be supported, refuted, or unverifiable from evidence from Wikipedia. QACG generates question-answer pairs from the evidence and then converts them into different types of claims. Experiments on the FEVER dataset show that our QACG framework significantly reduces the demand for human-annotated training data. In a zero-shot scenario, QACG improves a RoBERTa model's F1 from 50% to 77%, equivalent in performance to 2K+ manually-curated examples. Our QACG code is publicly available.

Zero-shot Fact Verification by Claim Generation

The Emotion Cause Extraction (ECE) task aims to identify clauses which contain emotion-evoking information for a particular emotion expressed in text. We observe that a widely-used ECE dataset exhibits a bias that the majority of annotated cause clauses are either directly before their associated emotion clauses or are the emotion clauses themselves. Existing models for ECE tend to explore such relative position information and suffer from the dataset bias. To investigate the degree of reliance of existing ECE models on clause relative positions, we propose a novel strategy to generate adversarial examples in which the relative position information is no longer the indicative feature of cause clauses. We test the performance of existing models on such adversarial examples and observe a significant performance drop. To address the dataset bias, we propose a novel graph-based method to explicitly model the emotion triggering paths by leveraging the commonsense knowledge to enhance the semantic dependencies between a candidate clause and an emotion clause. Experimental results show that our proposed approach performs on par with the existing state-of-the-art methods on the original ECE dataset, and is more robust against adversarial attacks compared to existing models.

Position Bias Mitigation: A Knowledge-Aware Graph Model for Emotion Cause Extraction

Previous work on review summarization focused on measuring the sentiment toward the main aspects of the reviewed product or business, or on creating a textual summary. These approaches provide only a partial view of the data: aspect-based sentiment summaries lack sufficient explanation or justification for the aspect rating, while textual summaries do not quantify the significance of each element, and are not well-suited for representing conflicting views. Recently, Key Point Analysis (KPA) has been proposed as a summarization framework that provides both textual and quantitative summary of the main points in the data. We adapt KPA to review data by introducing Collective Key Point Mining for better key point extraction; integrating sentiment analysis into KPA; identifying good key point candidates for review summaries; and leveraging the massive amount of available reviews and their metadata. We show empirically that these novel extensions of KPA substantially improve its performance. We demonstrate that promising results can be achieved without any domain-specific annotation, while human supervision can lead to further improvement.

Every Bite Is an Experience: Key Point Analysis of Business Reviews

Structured sentiment analysis attempts to extract full opinion tuples from a text, but over time this task has been subdivided into smaller and smaller sub-tasks, e.g., target extraction or targeted polarity classification. 
We argue that this division has become counterproductive and propose a new unified framework to remedy the situation. We cast the structured sentiment problem as dependency graph parsing, where the nodes are spans of sentiment holders, targets and expressions, and the arcs are the relations between them. We perform experiments on five datasets in four languages (English, Norwegian, Basque, and Catalan) and show that this approach leads to strong improvements over state-of-the-art baselines. Our analysis shows that refining the sentiment graphs with syntactic dependency information further improves results.

Structured Sentiment Analysis as Dependency Graph Parsing

Despite the recent advancement in NLP research, cross-lingual transfer for natural language generation is relatively understudied. In this work, we transfer supervision from high resource language (HRL) to multiple low-resource languages (LRLs) for natural language generation (NLG). We consider four NLG tasks (text summarization, question generation, news headline generation, and distractor generation) and three syntactically diverse languages, i.e., English, Hindi, and Japanese. We propose an unsupervised cross-lingual language generation framework (called ZmBART) that does not use any parallel or pseudo-parallel/back-translated data. In this framework, we further pre-train mBART sequence-to-sequence denoising auto-encoder model with an auxiliary task using monolingual data of three languages. The objective function of the auxiliary task is close to the target tasks which enriches the multi-lingual latent representation of mBART and provides good initialization for target tasks. Then, this model is fine-tuned with task-specific supervised English data and directly evaluated with low-resource languages in the Zero-shot setting. To overcome catastrophic forgetting and spurious correlation issues, we applied freezing model component and data argumentation approaches respectively. This simple modeling approach gave us promising results. We experimented with few-shot training (with 1000 supervised data points) which boosted the model performance further. We performed several ablations and cross-lingual transferability analyses to demonstrate the robustness of ZmBART.

ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation

Recent studies on the analysis of the multilingual representations focus on identifying whether there is an emergence of language-independent representations, or whether a multilingual model partitions its weights among different languages.
While most of such work has been conducted in a "black-box" manner, this paper aims to analyze individual components of a multilingual neural translation (NMT) model.
In particular, we look at the encoder self-attention and encoder-decoder attention heads (in a many-to-one NMT model) that are more specific to the translation of a certain language pair than others by (1) employing metrics that quantify some aspects of the attention weights such as "variance" or "confidence", and (2) systematically ranking the importance of attention heads with respect to translation quality.
Experimental results show that surprisingly, the set of most important attention heads are very similar across the language pairs and that it is possible to remove nearly one-third of the less important heads without hurting the translation quality greatly.

Do Multilingual Neural Machine Translation Models Contain Language Pair Specific Attention Heads?

We propose a model that learns both the sequential and the structural features of code for source code summarization.
We adopt the abstract syntax tree (AST) and graph convolution to model the structural information and the Transformer to model the sequential information.
We convert code snippets into ASTs and apply graph convolution to obtain structurally-encoded node representations.
Then, the sequences of the graph-convolutioned AST nodes are processed by the Transformer layers.
Since structurally-neighboring nodes will have similar representations in graph-convolutioned trees, the Transformer layers can effectively capture not only the sequential information but also the structural information such as sentences or blocks of source code.
We show that our model outperforms the state-of-the-art for source code summarization by experiments and human evaluations.

Learning Sequential and Structural Information for Source Code Summarization

The automated transcription of spoken language, and meetings, in particular, is becoming more widespread as automatic speech recognition systems are becoming more accurate. This trend has significantly accelerated
since the outbreak of the COVID-19 pandemic,
which led to a major increase in the number of online meetings. However, the transcription of spoken language has not received much attention from the NLP community compared to documents and other forms of written language.
In this paper, we study a variation of the summarization problem over the transcription of spoken language: given a transcribed meeting, and an action item (i.e., a commitment or request to perform a task), our goal is to generate a coherent and self-contained rephrasing of the action item. To this end, we compiled a novel dataset of annotated meeting transcripts, including human rephrasing of action items. We use state-of-the-art supervised text generation techniques and establish a strong baseline based on BART and UniLM (two pretrained transformer models). Due to the nature of natural speech, language is often broken and incomplete and the task is shown to be harder than an analogous task over email data. Particularly, we show that the baseline models can be greatly improved once models are provided with additional information. We compare two approaches: one incorporating features extracted by coreference-resolution. Additional annotations are used to train an auxiliary model to detect the relevant context in the text. Based on the systematic human evaluation, our best models exhibit near-human-level rephrasing capability on a constrained subset of the problem.

Downloads

Next from ACL-IJCNLP 2021

Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

Similar lecture

A Dataset and Baselines for Multilingual Reply Suggestion

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from ACL-IJCNLP 2021

Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

Similar lecture

A Dataset and Baselines for Multilingual Reply Suggestion

Downloads