✂️ Tokens
*Muse models don't actually process text as a sequence of characters or words, but as a sequence of tokens.** Tokens are to our models what syllables are to us: they are building blocks, which can be combined into words or sentences. Tokens are constructed to be sequence of characters with a useful semantic, but are sensitive to whitespace and capitalization.
Let's take a look at a few examples of tokenization. In tokens, Ġ
represents a whitespace,
and in the following we separate tokens with dashes -
. Common words are usually single tokens with a
whitespace preceding them: Ġword
, Ġcivilization
, ĠEarth
, etc. Complex words and uncommon proper
nouns will be made of multiple tokens: Ġhom
- onym
, ĠKam
- ala
, ĠSuper
- cal
- if
- rag
-
il
- ist
- ice
- xp
- iral
- id
- ocious
. This sentence will be tokenized as:
This
- Ġsentence
- Ġwill
- Ġbe
- Ġtoken
- ized
- Ġas
- :
.
On average, a token equals 3/4 words, or 4 characters in English. This will also vary with the style of the text: "simple" writing will use less tokens (on average one per word), whereas complex technical writing will use more. You can play around with the ✂️ Tokenizer on the Muse Playground pricing page to get a better feeling for how this works and how it differs across languages.
On occasion, you have to be mindful of tokens. For instance, in ✍️ Create, the model can only
generate a fixed number of tokens, which may cause it to stop generation in the middle of a complex word.
Similarly, features such as word biases used for complex words can only influence the first token provided:
if setting word_biases = {'ticketing': +5}
, this will be effectively equivalent to setting
word_biases = {'ticket': +5}
, because "ticketing" is tokenized as Ġticket
- ing
.