Let's take a look at a few examples of tokenization. In tokens, Ġ represents a whitespace, and in the following we separate tokens with dashes -. Common words are usually single tokens with a whitespace preceding them: Ġword, Ġcivilization, ĠEarth, etc. Complex words and uncommon proper nouns will be made of multiple tokens: Ġhom - onym, ĠKam - ala, ĠSuper - cal - if - rag - il - ist - ice - xp - iral - id - ocious. This sentence will be tokenized as: This - Ġsentence - Ġwill - Ġbe - Ġtoken - ized - Ġas - :.
On occasion, you have to be mindful of tokens. For instance, in ✍️ Create, the model can only generate a fixed number of tokens, which may cause it to stop generation in the middle of a complex word. Similarly, features such as word biases used for complex words can only influence the first token provided: if setting word_biases = {'ticketing': +5}, this will be effectively equivalent to setting word_biases = {'ticket': +5}, because "ticketing" is tokenized as Ġticket - ing.