1 Training a new tokenizer from an old one
1.1 difference of model
1.1.1 Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm.
1.1.2 Model training uses stochastic gradient descent to make the loss a little bit smaller for each batch. It’s randomized by nature
1.2 Assembling a corpus-准备语料库
def get_training_corpus():return (raw_datasets["train"][i : i + 1000]["whole_func_string"]for i in range(0, len(raw_datasets["train"]), 1000))training_corpus = get_training_corpus()
def get_training_corpus():return (raw_datasets["train"][i : i + 1000]["whole_func_string"]for i in range(0, len(raw_datasets["train"]), 1000))training_corpus = get_training_corpus()
1.3 Training a new tokenizer
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)
1.3.3 Transformers that you can use to train a new tokenizer with the same characteristics as an existing one: AutoTokenizer.train_new_from_iterator().
1.3.4 Note that AutoTokenizer.train_new_from_iterator() only works if the tokenizer you are using is a “fast” tokenizer.
2 Fast tokenizers’ special powers
2.1 Batch encoding
2.1.1 we can map any word or token to characters in the original text, and vice versa, via the word_to_chars() or token_to_chars() and char_to_word() or char_to_token()
2.1.2 tokenizer.is_fast
2.1.3 encoding.is_fast
2.2 Inside the token-classification pipeline
2.3 Getting the base results with the pipeline
2.4 It can process inputs faster than a slow tokenizer when you batch lots of inputs together.
2.5 It has some additional features allowing you to map tokens to the span of text that created them.
3 Fast tokenizers in the QA pipeline
3.1 Using the question-answering pipeline
3.1.1 question_answerer = pipeline(“question-answering”,device=device)
3.2 Using a model for question answering
3.2.1 logit
但在深度学习中,logits就是最终的全连接层的输出,而非其本意。通常神经网络中都是先有logits,而后通过sigmoid函数或者softmax函数得到概率 的,所以大部分情况下都无需用到logit函数的表达式。
logit原本是一个函数,它是sigmoid函数
logit 替换: To convert those logits into probabilities, we will apply a softmax function — but before that, we need to make sure we mask the indices that are not part of the context
3.3 Handling long contexts
3.3.1 指定tokenizer参数
sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
inputs = tokenizer(question,long_context,stride=128,max_length=384,padding="longest",truncation="only_second",return_overflowing_tokens=True,return_offsets_mapping=True,
)
3.3.2 keys
print(inputs.keys())
dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])
The last key, overflow_to_sample_mapping, is a map that tells us which sentence each of the results corresponds to
4 Normalization and pre-tokenization
4.1 Normalization
4.1.1 The normalization step involves some general cleanup, such as removing needless whitespace, lowercasing, and/or removing accents(重音符合)
4.1.2 Transformers tokenizer has an attribute called backend_tokenizer that provides access to the underlying tokenizer from the 🤗Tokenizers library:
print(type(tokenizer.backend_tokenizer))
4.2 Pre-tokenization
4.2.1 To see how a fast tokenizer performs pre-tokenization, we can use the pre_tokenize_str() method of the pre_tokenizer attribute of the tokenizer object:
4.3 SentencePiece
4.3.1 SentencePiece is a tokenization algorithm for the preprocessing of text that you can use with any of the models we will see in the next three sections.
4.3.2 which is very useful for languages where the space character is not used (like Chinese or Japanese).
4.4 Algorithm overview
5 算法概述
5.1 Byte-Pair Encoding tokenization
5.1.1 It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa.
5.1.2 Training algorithm
-
BPE training starts by computing the unique set of words used in the corpus (after the normalization and pre-tokenization steps are completed), then building the vocabulary by taking all the symbols used to write those words BPE
-
训练首先计算语料库中使用的唯一单词集(在完成标准化和预标记化步骤之后),然后通过获取用于编写这些单词的所有符号来构建词汇表。
-
BPE is a subword tokenization algorithm that starts with a small vocabulary and learns merge rules. If an example you are tokenizing uses a character that is not in the training corpus, that character will be converted to the unknown token
-
如果您正在标记的示例使用不在训练语料库中的字符,则该字符将转换为未知标记
-
hat’s one reason why lots of NLP models are very bad at analyzing content with emojis
-
这就是为什么许多 NLP 模型在分析带有表情符号的内容方面非常糟糕的原因之一
-
After getting this base vocabulary, we add new tokens until the desired vocabulary size is reached by learning merges, which are rules to merge two elements of the existing vocabulary together into a new one.
-
获得这个基本词汇后,我们添加新的标记,直到通过学习合并达到所需的词汇量,这是将现有词汇表的两个元素合并为一个新元素的规则
-
the BPE algorithm will search for the most frequent pair of existing tokens (by “pair,” here we mean two consecutive tokens in a word). That most frequent pair is the one that will be merged, and we rinse and repeat for the next step.
-
BPE 算法都会搜索最常见的现有标记对 (“对”,这里我们指的是单词中的两个连续标记)。最频繁的一对将被合并,我们冲洗并重复下一步
5.1.3 Tokenization algorithm
inputs are tokenized by applying the following steps: -
Normalization
-
Pre-tokenization
-
Splitting the words into individual characters
-
Applying the merge rules learned in order on those splits
-
word/char will be tokenized as ["[UNK] when it not in vocabulary
5.2 WordPiece tokenization
5.2.1 WordPiece is the tokenization algorithm Google developed to pretrain BERT.
5.2.2 It has since been reused in quite a few Transformer models based on BERT, such as DistilBERT, MobileBERT, Funnel Transformers, and MPNET.
5.2.3 Training algorithm -
Like BPE, WordPiece starts from a small vocabulary including the special tokens used by the model and the initial alphabet.
-
WordPiece tokenizes words into subwords by finding the longest subword starting from the beginning that is in the vocabulary, then repeating the process for the rest of the text.
-
Instead of selecting the most frequent pair, WordPiece computes a score for each pair, using the following formula:
5.2.4 Tokenization algorithm
Tokenization differs in WordPiece and BPE in that WordPiece only saves the final vocabulary, not the merge rules learned.
5.3 Unigram tokenization
5.3.1 The Unigram algorithm is often used in SentencePiece, which is the tokenization algorithm used by models like AlBERT, T5, mBART, Big Bird, and XLNet.
5.3.2 Training algorithm
- starts from a big vocabulary and removes tokens from it until it reaches the desired vocabulary size.
- At each step of the training, the Unigram algorithm computes a loss over the corpus given the current vocabulary.
- Then, for each symbol in the vocabulary, the algorithm computes how much the overall loss would increase if the symbol was removed, and looks for the symbols that would increase it the least.
- Unigram tokenizes words into subwords by finding the most likely segmentation into tokens, according to the model.
6 Building a tokenizer, block by block
6.1 tokenization comprises several steps:
6.1.1 Normalization (any cleanup of the text that is deemed necessary, such as removing spaces or accents, Unicode normalization, etc.)
6.1.2 Pre-tokenization (splitting the input into words)
6.1.3 Running the input through the model (using the pre-tokenized words to produce a sequence of tokens)
6.1.4 Post-processing (adding the special tokens of the tokenizer, generating the attention mask and token type IDs)
6.2 submodules
6.2.1 normalizers contains all the possible types of Normalizer you can use (complete list here).
6.2.2 pre_tokenizers contains all the possible types of PreTokenizer you can use (complete list here).
6.2.3 models contains the various types of Model you can use, like BPE, WordPiece, and Unigram (complete list here).
6.2.4 trainers contains all the different types of Trainer you can use to train your model on a corpus (one per type of model; complete list here).
6.2.5 post_processors contains the various types of PostProcessor you can use (complete list here).
6.2.6 decoders contains the various types of Decoder you can use to decode the outputs of tokenization (complete list here).
6.2.7 https://huggingface.co/docs/tokenizers/components
6.3 using train_new_from_iterator() to avoid loading the whole dataset into memory at once.
6.4 token-classification pipeline handle entities that span over several tokens
6.4.1 There is a label for the beginning of an entity and a label for the continuation of an entity.
6.4.2 In a given word, as long as the first token has the label of the entity, the whole word is considered labeled with that entity.
6.4.3 When a token has the label of a given entity, any other following token with the same label is considered part of the same entity, unless it’s labeled as the start of a new entity.