Meta, the parent company of Facebook, Instagram and WhatsApp, has entered the Artificial Intelligence competition by launching CM3leon, a productive artificial intelligence (AI) tool that can now generate images from text and images from text.
The company announced that the CM3leon is the first multimodal model to be trained with a recipe adapted from text-only language models. including an enhanced pre-exercise phase of full-scale recovery and the second phase of multitasking supervised fine-tuning (SFT).
“This recipe produces a simple, robust model and further demonstrates that reagent-based transformers can be trained as efficiently as current generative diffusion-based models,” he says.
The CM3leon’s state-of-the-art performance goes from text to image, despite being trained with five times less computation than previous transformer-based methods.
It maintains the versatility and effectiveness of autoregressive models by keeping training costs low and inference efficiency low. “It is a mixed-mode masked causal model (CM3) because it can generate sequences of text and images conditioned into arbitrary sequences. other image and text content. This greatly expands the functionality of previous models, either text-to-image or image-to-text only.”
Options include rendering images, for example: typing ‘Small cactus in the Sahara with straw hat and neon sunglasses’ will generate an image with that description.
Another function is to edit an image from a legend with text such as “change the color of the sky” in a photo or “put a mustache” on Johannes Vermeer’s painting “Girl with a Pearl Earring”.
Additionally, the tool may be asked to describe a photo in words.
According to Meta, the CM3Leon architecture uses a decoder-only converter, similar to well-established text-based models.
“With the goal of building high-quality generative models, we believe CM3leon’s robust performance across a variety of tasks is a step towards higher fidelity rendering and understanding. As a result, models like CM3leon can help support creativity and better practices. Push the limits of multimodal language models. “We look forward to exploring and launching more models in the future.”
Source: Exame
