降低语义差异(Bridge the gap between Pre-training and Fine-tuning) :预训练任务主要以Masked Language Modeling(MLM)为主,而下游任务则重新引入新的训练参数,因此两个阶段的目标通常有较大差异。因此需要解决如何缩小Pre-training和Fine-tuning两个阶段目标差距过大的问题;
避免过拟合(Overfitting of the head) :由于在Fine-tuning阶段需要新引入额外的参数以适配相应的任务需要,因此在样本数量有限的情况容易发生过拟合,降低了模型的泛化能力。因此需要面对预训练语言模型的过拟合问题。
本文的目标是介绍Prompt-Tuning的方法,而Prompt-Tuning的动机则是进一步拉近微调与预训练阶段的任务目标,因此本部分则以常用的BERT为主,简单介绍Pre-training的经典方法,更加详细的解读,可参考:【预训练语言模型】BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(BERT)。
因此以BERT为例,首先喂入一个文本It is very cold today, we need to wear more clothes. ,然后随机mask掉一个token,并结合一些特殊标记得到:[cls] It is very cold today, we need to [mask] more clothes. [sep] ,喂入到多层的Transformer结构中,则可以得到最后一层每个token的隐状态向量。MLM则通过在[mask]头部添加一个MLP映射到词表上,得到所有词预测的概率分布。
现如今有诸多针对MLM的改进版本,我们挑选两个经典的改进进行介绍:
Whole Word Masking(WWM) :来源于RoBERTa等,其认为BERT经过分词后得到的是word piece,而BERT的MLM则是基于word piece进行随机替换操作的,即Single-token Masking,因此被mask的token语义并不完整。而WWM则表示被mask的必须是一个完整的单词。
在BERT中,NSP任务则视为sentence-pair任务,例如输入两个句子S1:It is very cold today. 和 S2:We need to wear more clothes.,通过拼接特殊字符后,得到:[cls] It is very cold today. [sep] We need to wear more clothes. [sep],然后喂入到多层Transformer中,可以得到[cls]token的隐状态向量,同样通过MLP映射到一个3分类上获得各个类的概率分布:
Span Text Prediction(区间预测) :常见的任务类型有抽取式阅读理解、实体抽取、抽取式摘要等。给定一个passage和query,根据query寻找passage中可靠的字序列作为预测答案。通常该类任务需要模型预测区间的起始位置,因此在Transformer头部添加两个分类器以预测两个位置。
我们依然以二分类的情感分析作为例子,描述Prompt-tuning的工作原理。给定一个句子[CLS] I like the Disney films very much. [SEP] 传统的Fine-tuning方法是将其通过BERT的Transformer获得 [CLS]表征之后再喂入新增加的MLP分类器进行二分类,预测该句子是积极的(positive)还是消极的(negative),因此需要一定量的训练数据来训练。
而Prompt-Tuning则执行如下步骤:
构建模板(Template Construction) :通过人工定义、自动搜索、文本生成等方法,生成与给定句子相关的一个含有[MASK]标记的模板。例如It was [MASK].,并拼接到原始的文本中,获得Prompt-Tuning的输入:[CLS] I like the Disney films very much. [SEP] It was [MASK]. [SEP]。将其喂入BERT模型中,并复用预训练好的MLM分类器(在huggingface中为BertForMaskedLM),即可直接得到[MASK]预测的各个token的概率分布;
标签词映射(Label Word Verbalizer) :因为[MASK]部分我们只对部分词感兴趣,因此需要建立一个映射关系。例如如果[MASK]预测的词是“great”,则认为是positive类,如果是“terrible”,则认为是negative类。
In-context Learning :是Prompt的前身。其通过从训练集中挑选一些样本作为任务的提示提示(Natural Language Prompt),来实现免参数更新的模型预测;
Demonstration Learning :添加一些新的文本作为提示。例如在对“I like the Disney film. It was [MASK]”进行情感分析时,可以拼接一些相似场景的ground-truth文本“I like the book, it was great.”、“The music is boring. It is terrible for me.”等。此时模型在根据新添加的两个样例句子就可以“照葫芦画瓢”式地预测结果了。
GPT-3中提供的提示(Natural Language Prompt)过于简单,并不难使用在一些具体的任务场景,因此需要单独设计一套组件实现。
因此,大名鼎鼎的PET模型问世,PET(Pattern-Exploiting Training)出自《Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference》(EACL2021),根据论文题目则可以猜出,Prompt-Tuning启发于文本分类任务,并且试图将所有的分类任务转换为与MLM一致的完形填空。
Patterns Ensembling :同一个句子设计多个不同的pattern,例如It was [mask].,I think it is [mask].,This comment denotes as [mask]. 等,此时,原先只有一个句子,却可以生成多个不同的样本,也变相起到数据增强的作用。在训练时,可以当作单独的样本进行训练,推理时,则可以对所有Pattern的结果进行投票或加权。如下图所示:
给定一个具体的任务(例如分类任务),可以实现定义若干个模板(例如正则化工具),然后根据具体的句子内容,向模板中填充相关实体,以贴合句子实际的描述。例如清华大学刘知远团队提出的 PTR (PTR: Prompt Tuning with Rules for Text Classification)利用启发式的规则定义若干子模板(sub-prompt),并通过若干子模板的组合来形成最终的Pattern。
例如在关系抽取任务中,通常给定一个短文本,两个实体(记作subject和object),假如给定句子“Mark Twain was the father of Langdon. ”以及两个实体“Mark Twain”和“Landon”。那么可以定义3个子模板:
头实体 (subject entity) : the [mask] , 对应于: “the [mask] Mark Twain”, 可用于预测头实体的类型;
尾实体 (object entity) : the [mask] , 对应于: “the [mask] Landon”, 可用于尾实体的类型;
基于上述定义的 3 个规则, 则可以结合起来形成最终模板, 即 , 即 “the [mask] Mark Twain [mask] the [mask] Landon”。如图所示:
PTR的详细解读请参考博主的论文解读:论文解读:PTR: Prompt Tuning with Rules fo Text Classification:https://wjn1996.blog.csdn.net/article/details/120256178
因此不论给定哪个句子,模板不会完全固定不变,而是根据不同的实体而相应改变模板的字符序列。
相比之下, AutoPrompt 则是另一种典型的方法,其由加州大学提出《AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts(EMNLP2021),如下图所示,给定原始的输入,额外定义若干离散的字符作为trigger,并组成Template,喂入MLM中预测对应label word的概率。而这些trigger最终通过梯度搜索的方法进行挑选。
领域知识指导搜索离散的label word:《Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification》,代表方法为KPT;
原型网络动态生成label representations:《Prototypical Verbalizer for Prompt-based Few-shot Tuning》,代表方法为ProtoVerb。
KPT(Knowledgeable Prompt Tuning)
KPT的详细内容请参考博主的论文解读:论文解读:Knowledgeable Prompt-tuning: Incorporation Knowledge into Prompt Verbalizer for Text Classification:https://wjn1996.blog.csdn.net/article/details/120790512
例如在对电影评论进行二分类的时候,最简单的提示模板是“. It was [mask].”,但是其并没有突出该任务的具体特性,我们可以为其设计一个能够突出该任务特性的模板,例如“The movie review is . It was [mask].”,然后根据mask位置的输出结果通过Verbalizer映射到具体的标签上。这一类具备任务特性的模板可以称之为 指令(Instruction) 。
在转换范式的时候, 会发现输入的句子需要融合标签, 因此需要涉及到为不同标签设计对应的指令。如下图所示, 对于情感分析任务, 输入的句子是 “ total waste of time”, 给定一个标签“Positive”, 对应的指令则是“Is the review positive?”。整体的输入是 “ . Is the review positive? Answer: [mask].”。此时我们只需要约束 mask位置的输出是Yes 和 No 即可, 例如概例子中 No 的概率最大
回顾第一节我们介绍的几个预训练语言模型,我们发现目前绝大多数的双向预训练语言模型都包含Masked Language Modeling(MLM),单向预训练语言模型都包含Autoregressive Language Modeling(ALM),这些任务是预训练目标,本质上是预测被mask的位置的词,在训练时让模型理解语言的上下文信息。之所以设计Template和指令,就是希望在下游任务时能够复用这些预训练的目标,避免引入新的参数而导致过拟合。因此,我们可以将Prompt升华到一个新的高度,即 Prompt Tuning的本质是复用预训练语言模型在预训练阶段所使用的目标和参数 。
典型的方法例如:《UNIFIEDQA: Crossing format boundaries with a single QA system》、《ProQA- Structural Prompt-based Pre-training for Unified Question Answering》,其采用端到端的预训练语言模型(例如BART、T5),并复用预训练阶段的训练目标。
经典的方法比如《Unifying Question Answering, Text Classification, and Regression via Span Extraction》,苏剑林提出的Global Pointer。博主也运用该思想在2022年AIWIN春季赛“中文保险小样本”中获得第二名成绩。
介绍了上述的一些参数有效性方法,我们发现,Prompt-Tuning也符合其主旨。基于参数有效性的思想,也有许多工作致力于Prompt与参数有效性的结合,例如《Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models》、《LiST: Lite Prompted Self-training Makes Parameter-efficient Few-shot Learners》、《Making Parameter-efficient Tuning More Efficient: A Unified Framework for Classification Tasks》、《P-Adapters- Robustly Extracting Factual Information from Language Models with Diverse Prompts》、《Context-Tuning: Learning Contextualized Prompts for Natural Language Generation》,由于相关工作非常多而且更新频繁,这里不一一介绍。
更多分析可阅读博主的博文:【In-Context Learning】Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?:https://blog.csdn.net/qq_36426650/article/details/129818361?spm=1001.2014.3001.5501
作者认为,不同的实验设置(例如Template的不同、数据集的不同等),Random Label与No Label所产生的效果差异是不同的,因此不能直接做出“In-context example mapping does not affect in-context learning performance much”片面的判定。
横坐标(Training Set ID)表示10个不同的In-Context Example集合,用来观察不同的样本挑选对ICL的影响情况;对于每一个集合,4个样本可以有24种排列,每一个排列进行一次实验,对应图中的一个点,因此每一个集合都对应一共有24个点,采用箱式图来观察不同的排列对ICL的影响情况。纵坐标为准确率。
《What Makes Good In-Context Examples for GPT-3?》:代表方法KATE;
《Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity》:简称Fantastically
KATE
该工作也在SST-2的预实验中发现不同的In-Context Example会得到不同的准确率,说明样本的挑选很重要。另外作者在Natural Question数据集上进行测试,发现当挑选的In-Context Example如果在Embedding空间中与Test Example更近,将会带来更好的效果。因此提出KATE(Knn-Augmented in-conText Example selection),即基于近邻原则挑选In-Context Example。
关于KATE更详细的解读可参考博主的博文:【In-Context Learning】What Makes Good In-Context Examples for GPT-3?:https://wjn1996.blog.csdn.net/article/details/129816707?spm=1001.2014.3001.5502
训练阶段使用MOE进行预训练。预训练语料:BOOK-CORPUS plus Wikipedia, CC-NEWS, OPENWEB- TEXT, and STORIES。分别对每个语料抽取100k句子(STORIES只抽取10k)。最终大约有100w个句子,每个类型的self-supervised task平均25w个样本。
Scaling up model size alone has not proved sufficient for achieving high performance on challenging tasks such as arithmetic, commonsense, and symbolic reasoning.
期望探索如何对大模型进行推理的简单方法:
对于算术类推理任务,期望生成自然语言逻辑依据来指导并生成最终答案;但是获得逻辑依据是比较复杂昂贵的。It is costly to create a large set of high quality rationales, which is much more complicated than simple input–output pairs used in normal machine learning
对某个task,为大模型提供一些上下文in-context example作为prompt;简单的示例可能并非能够提升推理能力。It works poorly on tasks that require reasoning abilities, and often does not improve substantially with increasing language model scale
因此,提出 思维链(Chain-of-Thought) 。思维链的定义如下:A chain of thought is a series of intermediate natural language reasoning steps that lead to the final output, and we refer to this approach as chain-of-thought prompting.
(3)Least-to-Most Prompting Enables Complex Reasoning in Large Language Models:https://arxiv.org/abs/2205.10625
最近CoT的提出进一步拉近了人类与机器智能的距离,通过natural language rationales和self-consistency来提升大模型在推理任务上的性能。然而CoT依然存在一些不足:即其很难对超出demonstration example难度程度的问题进行解答。为此,该工作尝试将一个复杂的任务分解为若干简单的子任务。
[1] 【预训练语言模型】Attention Is All You Need(Transformer): https://blog.csdn.net/qq_36426650/article/details/112222115
[2] 【预训练语言模型】BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(BERT): https://blog.csdn.net/qq_36426650/article/details/112223838
[3] 《Language Models are Few-Shot Learners》(NIPS2020): https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
[4] 《Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference》(EACL2021): https://doi.org/10.18653/v1/2021.eacl-main.20
[6] PTR: Prompt Tuning with Rules for Text Classification: https://arxiv.org/abs/2105.11259
[7] 论文解读:PTR: Prompt Tuning with Rules fo Text Classification: https://wjn1996.blog.csdn.net/article/details/120256178
[8] 《AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts(EMNLP2021): https://aclanthology.org/2020.emnlp-main.346.pdf
[9] 《Making Pre-trained Language Models Better Few-shot Learners》(ACL2021): https://doi.org/10.18653/v1/2021.acl-long.295
[10] 论文解读:Making Pre-trained Language Models Better Few-shot Learners(LM-BFF): https://wjn1996.blog.csdn.net/article/details/115640052
[11] 《The Power of Scale for Parameter-Efficient Prompt Tuning》: https://aclanthology.org/2021.emnlp-main.243.pdf
[15] 《RLPROMPT: Optimizing Discrete Text Prompts with Reinforcement Learning》: https://arxiv.org/pdf/2205.12548.pdf
[16] 《Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification》: https://aclanthology.org/2022.acl-long.158.pdf
[17] 《Prototypical Verbalizer for Prompt-based Few-shot Tuning》: https://aclanthology.org/2022.acl-long.483.pdf
[18] 论文解读:Knowledgeable Prompt-tuning: Incorporation Knowledge into Prompt Verbalizer for Text Classification: https://wjn1996.blog.csdn.net/article/details/120790512
[19] 《PromptBERT: Improving BERT Sentence Embeddings with Prompts》: https://arxiv.org/pdf/2201.04337
[20] 《TransPrompt: Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification》: https://aclanthology.org/2021.emnlp-main.221.pdf
[21] 《Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections》: https://aclanthology.org/2021.findings-emnlp.244.pdf
[22] UNIFIEDQA: Crossing format boundaries with a single QA system: https://aclanthology.org/2020.findings-emnlp.171.pdf
[23] ProQA- Structural Prompt-based Pre-training for Unified Question Answering: https://aclanthology.org/2022.naacl-main.313.pdf
[24] 《Unifying Question Answering, Text Classification, and Regression via Span Extraction》: https://arxiv.org/pdf/1904.09286
[25] Global Pointer: https://spaces.ac.cn/archives/8373
[26] 《Entailment as Few-Shot Learner》(EFL): https://arxiv.org/pdf/2104.14690.pdf
[33] 《UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning》: https://aclanthology.org/2022.acl-long.433.pdf
[34] 《Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models》: https://aclanthology.org/2022.acl-long.433.pdf
[35] 《LiST: Lite Prompted Self-training Makes Parameter-efficient Few-shot Learners》: https://aclanthology.org/2022.findings-naacl.174.pdf
[36] 《Making Parameter-efficient Tuning More Efficient: A Unified Framework for Classification Tasks》: https://aclanthology.org/2022.findings-naacl.174.pdf
[37] 《P-Adapters- Robustly Extracting Factual Information from Language Models with Diverse Prompts》: https://openreview.net/forum?id=DhzIU48OcZh
[38] 《Context-Tuning: Learning Contextualized Prompts for Natural Language Generation》: https://aclanthology.org/2022.coling-1.552.pdf
[39] 《Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?》: https://aclanthology.org/2022.emnlp-main.759.pdf
[40] 《Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations》: https://aclanthology.org/2022.emnlp-main.155.pdf
[41] 【In-Context Learning】Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?: https://blog.csdn.net/qq_36426650/article/details/129818361?spm=1001.2014.3001.5501
[42] 《What Makes Good In-Context Examples for GPT-3?》: https://aclanthology.org/2022.deelio-1.10.pdf
[43] 《Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity》: https://aclanthology.org/2022.acl-long.556.pdf
[44] 【In-Context Learning】What Makes Good In-Context Examples for GPT-3?: https://wjn1996.blog.csdn.net/article/details/129816707?spm=1001.2014.3001.5502
[45] 《Improving In-Context Few-Shot Learning via Self-Supervised Training》: https://aclanthology.org/2022.naacl-main.260.pdf
[46] 《Meta-learning via Language Model In-context Tuning》: https://doi.org/10.18653/v1/2022.acl-long.53
[47] 《MetaICL: Learning to Learn In Context》: https://github.com/facebookresearch/MetaICL
[48] Self-consistency Improves Chain Of Thought Reasoning in Language Models: https://arxiv.org/abs/2203.11171
[49] Large Language Models are Zero-Shot Reasoners: https://arxiv.org/abs/2205.11916
[50] Automatic Chain of Thought Prompting in Large Language Models: http://arxiv.org/abs/2210.03493
[51] Least-to-Most Prompting Enables Complex Reasoning in Large Language Models: https://arxiv.org/abs/2205.10625