AI

Google Unveils Gemini Omni: AI Creates Video From Any Input

Google's new Gemini Omni AI models can now generate video from images, audio, and text, integrating multiple inputs to create coherent outputs. This marks a significant step towards a fully multimodal AI.

Lisa Thomas
Lisa Thomas covers biotech & health for Techawave.
3 min read0 views
Google Unveils Gemini Omni: AI Creates Video From Any Input
Share

At its Google I/O developer conference on May 19, 2026, Google announced Gemini Omni, a new family of multimodal artificial intelligence models designed to generate content from a combination of inputs including text, images, and audio. Google CEO Sundar Pichai stated that Omni aims to “create anything from any input,” with its initial application focusing on video creation. Unlike previous models that might simply stitch together different media types, Gemini Omni reasons across all provided inputs to produce a unified, consistent video output that demonstrates an understanding of physical and cultural contexts.

Google DeepMind director of product management Nicole Brichtova highlighted that Gemini Omni represents a significant evolution beyond existing tools like Veo, which primarily converts text and images into videos. "It’s the next step towards the progression of combining the intelligence of Gemini with the rendering capabilities of our media models," Brichtova explained. During a media briefing, DeepMind's chief technologist Koray Kavukcuoglu demonstrated Omni's capabilities by showing a claymation video explaining protein folding, generated from a simple text prompt. The AI produced a stop-motion explainer complete with accurate narration.

Advancing Multimodal AI and Reality Simulation

The long-term vision for Gemini Omni extends beyond video generation. Google envisions the model being capable of generating images from audio or audio from video, showcasing a truly comprehensive multimodal understanding. "When we first announced Gemini, it was our first AI model to be natively multimodal," Pichai said. "We knew that training it on a combination of text, code, audio, images, and video would give it a deeper understanding of the world. With world models, AI is moving from predicting text to simulating reality. Gemini Omni is the next step in that direction." This ambition positions Google as a leader in developing AI that can process and generate content across a wide spectrum of modalities.

Addressing concerns about misuse, Google has implemented safeguards for its new avatar generation feature, a capability popularized by OpenAI's Sora. Users will undergo a dedicated onboarding process, including recording themselves speaking a series of numbers, to verify their identity before an avatar can be created. Furthermore, all videos produced by Gemini Omni will feature Google's SynthID digital watermark, allowing for easy verification of AI-generated content. This commitment to transparency and security aims to combat the proliferation of deepfakes.

The first model within the Omni family, Gemini Omni Flash, is set to roll out immediately to the Gemini app, YouTube Shorts, and the AI creative studio Flow. Flash will initially generate 10-second videos. Brichtova clarified that this duration is a deliberate choice to facilitate wider user adoption and cater to current user preferences, with longer video generation capabilities planned for future updates. Google is positioning Flash as a consumer-focused tool, exemplified by personal use cases like creating videos of oneself receiving awards or simulating travel experiences. Research engineer Gabe Barth-Maron described these as akin to "personalized memes."

Despite the initial consumer focus, the potential for Gemini Omni in enterprise and creative sectors is substantial. Google plans to make Omni available via an API in the coming weeks, enabling developers and businesses to integrate its advanced capabilities into their own applications. Content creators are expected to leverage the avatar-generation tool, available on Shorts starting today. For industries like advertising and filmmaking, an end-to-end multimodal workflow could be transformative. Startups like Luma AI are developing similar unified models for generating ad campaigns. "We’re actually pretty proud of the model’s text-rendering capabilities, which is really useful for things like advertising," Brichtova noted. "If you want a product somewhere, or even just a slogan, it needs to be accurate..." The Omni Pro model, intended for more demanding professional use cases, is also in development, with Google promising a release when it offers a distinct advancement over the Flash version.

Share