A Gigamodel in a porcelain shop… Part 1

Draw me a Gigamodel

First playground: language

Data without limits with self-supervised learning

The cement of "foundation models"

LLMs, gigantic and generative language models

A Gigamodel is computing... with or without a license

Gigamodels are displayed

Speech is not left behind

Conference: ViaDialog participates in the Artificial Intelligence Festival on November 15, 2022!

Summary

For the past two or three years, Gigamodels such as BERT, GPT3, wave2vec, or DALL-E have been a major trend in Artificial Intelligence (AI). These models, with impressive performance, exceed a billion parameters, require specific infrastructure, and are produced only by a small handful of players. How is this new landscape disrupting both AI research and industry?

Draw me a Gigamodel

What is a Gigamodel? A model that counts hundreds of millions of parameters, or even hundreds of billions of parameters for the largest ones currently. We think of BERT, GPT-3, Wav2vec (pronounced in English as "wave to vec"), DALL-E (pronounced as "Dali")... Some call them "Foundation models" or "Pre-trained models", others speak of "Large Language Models" (LLM). They are generally models that include self-supervised learning (self-supervised learning). But what do these terms encompass?

First playground: language

The first Gigamodels to gain attention are natural language processing models (NLP = Natural Language Processing): the BERT model (Bidirectional Encoder Representations from Transformers), released by Google in 2018, which includes several hundred million parameters, followed by the highly publicized, but much more closed, GPT3 model trained by OpenAI, released in 2020, the first to reach one hundred billion parameters.

BERT, like GPT3, takes text – or rather "embeddings" – named word2vec, which encode words as vectors based on the distribution of neighboring words and perform language processing tasks. They share a so-called "transformer" architecture, a deep neural architecture typically including an encoder, a decoder, and an attention layer, which makes them capable of considering an extended context. They especially have the ability to learn from very large volumes of unlabeled raw data, which is called "self-supervised" learning or "self-supervised learning".

Unlimited data with self-supervised learning

Traditional models obtained through machine learning use supervised learning: input data, such as sentences, are presented to the model, which seeks to predict an output, such as a classification, which it then compares with the reference label. The model then adapts as needed to get as close as possible to the desired output. The availability of sufficiently written and labeled data for each targeted task, each language, each domain, is a limiting factor of this type of approach.

Self-supervised learning, on the other hand, involves learning a generic model from unlabeled data, optimizing it on self-supervised tasks, which depend only on the data itself: for example, as illustrated in the diagram below, guessing masked words, arbitrarily masked in a text, based on all the other surrounding words, or guessing whether two sentences follow each other or do not follow at all in the corpus.

‍

This type of self-supervised approach frees itself from the need for labeled data and opens up the possibility of learning generic models on very large volumes of raw data collected broadly. The self-supervised model implicitly captures information about the very structure of the data, here the distribution of words in the sentences of the language, their synonyms, antonyms, analogies, agreements, etc.

The generic model, also called a pre-trained model, can then be complemented with a specific "fine-tuning" for the targeted task (classification, sentiment analysis, question-answering, language comprehension tasks, automatic summarization, translation...), and a much smaller volume of labeled data may be sufficient, as the generic model is automatically endowed with strong generalization capacity.

The cement of "foundation models"

Upon its release in 2018, the BERT model revolutionized the world of natural language processing by allowing "fine-tuned" models to dethrone previous approaches in most NLP benchmarks, particularly in semantic analysis tasks. BERT is notably integrated into Google's search engine to provide structured answers when possible to semantically analyzed queries.

‍

The fact that a pre-trained model can be adapted into numerous tailored models led to the term "Foundation models" or "foundation models", serving as a basis for models built atop them.

‍

LLMs, gigantic and generative language models

After BERT, GPT3 amazed with its native text generation capabilities and its abilities as a "few-shot learner" or even a "zero-shot learner", meaning the ability to learn a new task with just a few examples, or even without any prior example given.

These models are called "language models" and for some time now "large language models" (LLM), due to their gigantic size.

All these LLMs: GPT3, T5, LaMDA, Bloom, OPT-175B, Megatron-Turing NLG… have generative capabilities.

‍

Initially, a language model is simply one that assigns a probability to sequences of words or characters. Do recent LLMs exhibit unique language capabilities, going beyond this initial scope? Despite their prowess, the debate remains open...

The generative capability of LLMs is used not only to produce natural language (human language) but also other types of production, the most emblematic currently being computer code and images.

‍

A Gigamodel does computing... with or without a license

In 2021, Microsoft, in partnership with OpenAI – in which the Mountain View firm invested 1 billion in 2019 – and its subsidiary GitHub, a very popular code management platform, released "GitHub Copilot", a paid application that suggests lines of code to developers. A survey among users indicates that about a third of their code would already be produced by Copilot.

‍

GitHub Copilot was trained on enormous amounts of opensource code collected from the web. A collective has recently filed a lawsuit against Microsoft, GitHub, and OpenAI, claiming that entire portions of opensource code can be "spit out" as is by Copilot without attribution to their author, in violation of opensource licenses. The proceedings are just beginning. Whatever the outcome, it will have an impact on the rights of all Gigamodels.

‍

The Gigamodels are visible

The most visible Gigamodels currently are certainly image generation models from text such as Dall-E, released in 2021 by OpenAI, followed by Stable Diffusion, MidJourney, or Parti/Imagen (Google).

These models, trained on enormous quantities of images and texts collected from the Web, produce based on textual statements, called "prompts", images that are generally aesthetic and highly realistic, like the famous "A teddy bear on a skateboard in Times Square", produced by DALL-E 2.

‍

We can also mention this image produced by MidJourney to illustrate an article in The Economist titled "Huge Foundation models are turbo-charging AI progress".

‍

A work produced with MidJourney recently won an award in Colorado, in the "digital art" category, which certainly sparked controversy.

‍

A new feature called "outpainting" also allows the imagining of continuations to existing works beyond their frame.

‍

Speech is not left behind

The revolution of deep learning began in 2012 with image recognition and speech recognition, as evidenced by the Turing Award awarded in 2019 jointly to Yann Le Cun (image recognition), Yoshua Bengio (speech recognition), and Geoff Hinton (their mentor to both).

Natural language processing quickly took hold of these new approaches and was the first to give rise to self-supervised Gigamodels. It was natural that speech processing and image processing in turn fed off the advances made in language processing.

It was in 2019 that wav2vec was released, followed by wav2vec2 in 2020 from Meta's laboratories — then Facebook. Wav2vec is the audio equivalent of BERT, a self-supervised model learned from large amounts of unlabeled raw audio.

The speech signal being continuous by nature, wav2vec2 must overcome this difficulty by first learning discrete audio tokens, marked by the letter "q" in the diagram below, to characterize sounds, before playing, like BERT, the game of guessing masked data portions.

‍

Since 2020, the top places in classic speech recognition benchmarks have been swept by systems based on wav2vec2, as shown on the leaderboard below, for the "Librispeech" benchmark, a corpus of speech read by numerous speakers...

‍

But then, what radical changes are Gigamodels poised to bring to the world of research and industry?

To be continued…

Conference: ViaDialog participates in the Festival of Artificial Intelligence on November 15, 2022!

The Computer Laboratory of Avignon (LIA) and the chair in artificial intelligence LIAvignon are hosting this year the 3rd edition of the AI festival from November 14 to 16.

ViaDialog will participate alongside Google and HugginFace by running a conference on November 15. Ariane Nabeth-Halber, Director of the AI Pole, will join the conference and debates surrounding Gigamodels in AI.

Location: Campus Hannah Arendt – Avignon University AT06 starting from 4:30 PM.