在线3d建模网站_天津市建设工程信息网官网首页_关于seo的行业岗位有哪些_网页制作与设计教程

时间:2025/7/12 6:32:41来源：https://blog.csdn.net/chengyq116/article/details/144252778 浏览次数:0次

Large Language Model {LLM} Tokenizers - bos_token - eos_token - unk_token

1. NVIDIA NeMo Framework
- 1.1. Tokenizers
2. PyTorch Module code
- 2.1. `torchtune.modules.tokenizers._tiktoken`
References

1. NVIDIA NeMo Framework

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

NVIDIA NeMo Framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech).

It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pre-trained model checkpoints.

NeMo Framework provides end-to-end support for developing Large Language Models (LLMs) and Multimodal Models (MMs).

1.1. Tokenizers

class nemo.collections.common.tokenizers.AutoTokenizer(pretrained_model_name: str,vocab_file: str | None = None,merges_file: str | None = None,mask_token: str | None = None,bos_token: str | None = None,eos_token: str | None = None,pad_token: str | None = None,sep_token: str | None = None,cls_token: str | None = None,unk_token: str | None = None,additional_special_tokens: List | None = [],use_fast: bool | None = False,trust_remote_code: bool | None = False,
)

pretrained_model_name - corresponds to HuggingFace-AutoTokenizer’s ‘pretrained_model_name_or_path’ input argument.

vocab_file - path to file with vocabulary which consists of characters separated by newlines.

mask_token - mask token

bos_token - the beginning of sequence token

eos_token - the end of sequence token. Usually equal to sep_token

pad_token - token to use for padding

sep_token - token used for separating sequences

cls_token - class token. Usually equal to bos_token

unk_token - token to use for unknown tokens

additional_special_tokens - list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)

use_fast - whether to use fast HuggingFace tokenizer

2. PyTorch Module code

https://pytorch.org/torchtune/0.1/_modules/index.html

2.1. `torchtune.modules.tokenizers._tiktoken`

https://pytorch.org/torchtune/0.1/_modules/torchtune/modules/tokenizers/_tiktoken.html

        path (str): Path to pretrained tokenizer checkpoint file.name (str): Name of the tokenizer (used by tiktoken for identification).pattern (str): Regex pattern used to for string parsing.all_special_tokens (Optional[List[str]]): List of all special tokens. First element must be bos token, second element must be eos token, final element must be python tag. All elements must be unique. Length must be at most 256. Default: None (will use ALL_SPECIAL_TOKENS)bos_token (str): Beginning of sequence token. Defaults to BEGIN_OF_TEXT.eos_token (str): End of sequence token. Defaults to END_OF_TEXT.start_header_id (str): Start header token. Defaults to START_HEADER_ID.end_header_id (str): End header token. Defaults to END_HEADER_ID.step_id (str): Step token. Defaults to STEP_ID.eom_id (str): End of message token. Defaults to EOM_ID.eot_id (str): End of turn token. Defaults to EOT_ID.python_tag (str): Python tag token. Defaults to PYTHON_TAG.

References

[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/
[2] How do LLMs process text data - A deep dive into Tokenization (Part-1), https://gdevakumar.medium.com/how-do-llms-process-text-data-a-deep-dive-into-tokenization-part-1-342bd365c6dc

关键字：在线3d建模网站_天津市建设工程信息网官网首页_关于seo的行业岗位有哪些_网页制作与设计教程

本网仅为发布的内容提供存储空间，不对发表、转载的内容提供任何形式的保证。凡本网注明“来源：XXX网络”的作品，均转载自其它媒体，著作权归作者所有，商业转载请联系作者获得授权，非商业转载请注明出处。

我们尊重并感谢每一位作者，均已注明文章来源和作者。如因作品内容、版权或其它问题，请及时与我们联系，联系邮箱：809451989@qq.com，投稿邮箱：809451989@qq.com