AI’s neural network
THE OPENING sequences in the movie ‘Heart of Stone’ is the latest of many movies I’ve seen where tokens played a special role in either solidifying a scene or as an excuse of escape. In that scene, suspicious that Rachel Stone, played by Gal Gadot picks up her tokens and sashays off the table while showing a phone screen of “Blackjack” to the suspicious casino security head.
Tokens are used as a form of currency. In casinos at least.
These small discs that come in various denominations used in table games like poker, blackjack, and roulette represent money and are more convenient to carry around rather than large bundles of cash. Token are broken down into many denominations and thus one token can be worth hundreds of dollars while a similar looking one with different markings and colors may only represent 100 dollars.
In artificial intelligence, tokens have the same characteristics. They represent something, have value (or weight) and may look the same but mean differently. Tokenization is a process of breaking down text into smaller units called tokens, which can be words, characters, sub-words, or symbols or even signals, depending on the chosen tokenization method or scheme.
According to Microsoft AI, tokens are the basic units of text or code that an LLM AI uses to process and generate language. They are assigned numerical values or identifiers, and are arranged in sequences or vectors, and are fed into or outputted from the model, and are the building blocks of language for the model.
It is an essential step in natural language processing (NLP) and machine learning (ML) tasks such as text classification, sentiment analysis, and language translation. It is designed to make things easier for the machine–just like a token–to carry out complex tasks–reducing its complexity and make the machine understand the text in a logical way to process it.
There is another term related to tokenization. Granularity. In the breaking down of text instructions, called prompts, the AI sees common sequences of characters found in text. In Open AI’s ChatGPT, the models “understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens,” according to the Open AI platform discussion.
Tokenization’s crucial role
Tokenization plays a crucial role in generative AI and large language models like ChatGPT.
These models are trained on large amounts of text data and use tokenization to break the text into individual tokens or words. Tokenization is a fundamental pre-processing step for most natural language processing (NLP) applications.
For instance, Bing’s new AI-powered search engine and Edge browser use tokenization to deliver better search results, more complete answers, a new chat experience, and the ability to generate content. The new Bing gives you an improved version of the familiar search experience, providing more relevant results for simple things like sports scores, stock prices, and weather, along with a new sidebar that shows more comprehensive answers if you want them.
Bing reviews results from across the web to find and summarize the answer you’re looking for. For more complex searches — such as for planning a detailed trip itinerary or researching what TV to buy — the new Bing offers new, interactive chat.
The tokenization process
From a process experience this is how tokenization works in relation to creating an answer (or output) to a prompt (or question). The sequence happens below:
- Text Input: The AI receives a piece of text as input. This text can be a sentence, a paragraph, or even an entire document.
- Segmentation: The text is divided into segments, which can be sentences, paragraphs, or other logical chunks. Each segment is then processed independently, which can improve the performance of tokenization.
- Tokenization: Within each segment, the text is broken down into tokens. Tokens can be individual words (e.g., “cat”), subwords (e.g., “unhappi” as part of “unhappy”), or characters, depending on the tokenization method used.
- Special Tokens: In NLP, special tokens are often added to provide context or instructions. For instance, a “start of sequence” token or an “end of sequence” token might be added to indicate the beginning and end of a sentence.
- Vocabulary Mapping: Each token is mapped to an index in a predefined vocabulary. This vocabulary consists of a list of unique tokens that the AI model understands. If a token is not in the vocabulary, it might be split into subwords or characters that are in the vocabulary.
- Encoding: The AI model typically represents tokens as numerical values (indices) corresponding to their positions in the vocabulary. This numerical representation enables the model to process and analyze the text.
- Adding Special Information: Information such as positional embeddings (to indicate the position of a token in the sequence) and attention masks (to determine which tokens should receive more attention during processing) might be added.
- Input Format: Finally, the tokenized text is converted into a suitable format for input to the AI model. This format often involves arranging the tokens in a sequence, adding special tokens, and padding if necessary.
“Tokenization allows AI models to process and understand human language by breaking it down into manageable units,” Vitaly Kalmuk. Head of Kaspersky’s Global Research and Analytics Team (GReAT) said concluding that the choice of tokenization method and vocabulary can significantly impact the performance of language models, as they affect how well the model can capture the nuances of text data.
Why AI needs to tokenize
Artificial Intelligence (AI) isn’t a software package. What we see or experience in Bard or Chat GPT is the culmination of the processes listed above. AI is actually a broad field of computer science, and one of its aims is to create machines that attempt to mimic the human brain to perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation.
AI mimics the most simplest actuations of the brain. Thus it needs Machine Learning (ML) to put this all together through algorithms. ML is a subset of AI that involves training algorithms to learn from data and make predictions or decisions without being explicitly programmed. In other words, ML is a way to achieve AI.
In order for AI to respond to queries and provide answers, a Large Language Models (LLMs) are needed. LLMs are a type of AI system that works with language. They are the algorithmic basis for chatbots like OpenAI’s ChatGPT and Google’s Bard. LLMs are currently trained on a massive trove of articles, Wikipedia entries, books, internet-based resources and other input to produce human-like responses to natural language queries.
Thus LLMs use tokens as the basic units of text or code to process and generate language and are the building blocks of language for the model.