Google’s AI just got ears

The Google Gemini AI logo.Google

AI chatbots are already capable of “seeing” the world through images and video. But now, Google has announced audio-to-speech functionalities as part of its latest update to Gemini Pro. In Gemini 1.5 Pro, the chatbot can now “hear” audio files uploaded into its system and then extract the text information.

The company has made this LLM version available as a public preview on its Vertex AI development platform. This will allow more enterprise-focused users to experiment with the feature and expand its base after a more private rollout in February when the model was first announced. This was originally offered only to a limited group of developers and enterprise customers.

Recommended Videos

1. Breaking down + understanding a long video

I uploaded the entire NBA dunk contest from last night and asked which dunk had the highest score.

Gemini 1.5 was incredibly able to find the specific perfect 50 dunk and details from just its long context video understanding! pic.twitter.com/01iUfqfiAO

— Rowan Cheung (@rowancheung) February 18, 2024

Related

  • OpenAI needs just 15 seconds of audio for its AI to clone a voice

  • Gemini Advanced vs. Copilot Pro: which is the better deal?

  • Reddit seals $60M deal with Google to boost AI tools, report claims

Google shared the details about the update at its Cloud Next conference, which is currently taking place in Las Vegas. After calling the Gemini Ultra LLM that powers its Gemini Advanced chatbot the most powerful model of its Gemini family, Google is now calling Gemini 1.5 Pro its most capable generative model. The company added that this version is better at learning without additional tweaking of the model.

Gemini 1.5 Pro is multimodal in that it can interpret different types of audio into text, including TV shows, movies, radio broadcasts, and conference call recordings. It’s even multilingual in that it can process audio in several different languages. The LLM may also be able to create transcripts from videos; however, its quality may be unreliable, as mentioned by TechCrunch.

When first announced, Google explained that Gemini 1.5 Pro used a token system to process raw data. A million tokens equate to approximately 700,000 words or 30,000 lines of code. In media form, it equals an hour of video or around 11 hours of audio.

There have been some private preview demos of Gemini 1.5 Pro that demonstrate how the LLM is able to find specific moments in a video transcript. For example, AI enthusiast Rowan Cheung got early access and detailed how his demo found an exact action shot in a sports contest and summarized the event, as seen in the tweet embedded above.

However, Google noted that other early adopters, including United Wholesale Mortgage, TBS, and Replit, are opting for more enterprise-focused use cases, such as mortgage underwriting, automating metadata tagging, and generating, explaining, and updating code.

Related posts

Latest posts

Nvidia adds DLSS to six more games strengthening its lead in upscaling wars

Nvidia’s Deep Learning Super Sampling (DLSS) technology continues to expand rapidly with company bringing support to six additional games. The latest titles to incorporate DLSS features include Steel Seed, The Talos Principle: Reawakened, RuneScape: Dragonwilds, Tempest Rising, Clair Obscur: Expedition 33, and Commandos: Origins. ​ Steel Seed, a stealth-action adventure title, now includes support for […]

Google might have to sell Chrome — and OpenAI wants to buy it

We don't know yet whether Google will really be forced to sell its Chrome business -- but OpenAI is already trying to put in a bid.

Rick and Morty season 8 gets a wild new trailer

Get ready for more sci-fi comedy as Rick and Morty return in a new trailer for season 8.

Impressive OnePlus 13T display specs announced ahead of launch

OnePlus China President Li Jie has unveiled the display specifications for the OnePlus 13T, which will launch tomorrow, April 24. This news comes from Android Headlines. The mid-year refresh will feature a 6.32-inch flat OLED panel with 1.5K resolution (2640 x 1216) and 460 PPI. The display supports 10-bit colors, an adaptive 1-120Hz refresh rate, […]

Samsung’s One UI 7 had a rocky start, but One UI 8.5 could be a major upgrade

Samsung is reportedly already working on its next big update, and the plans might surprise you.

Tesla reaffirms timeline for affordable EV despite reports of delays

Tesla says that production of an affordable EV remains on track for June.

Cadillac offers first glimpse of upcoming Optiq-V performance EV

Cadillac releases teaser images of the 2026 Optiq-V, the brand’s second all-electric model.

If you don’t buy the Google Pixel 9a today, you probably never should

Tonight at 11:59 p.m. Pacific Time, something changes. You quit getting $100 in Google Store credit if you buy the Google Pixel 9a via the Google Store. That’s pretty big, as the credit can pull off a big chunk of the price of the next budget Google phone you buy. Google’s Pixel 9a costs just […]

Synology DiskStation DS925+ review: A terrific NAS ruined by baffling limitations

Synology launched the DiskStation DS925+, and the NAS has considerable upgrades, notably 2.5 Gigabit Ethernet. But limitations with hard drive

Samsung Wallet’s installment payments and tap-to-send features show up in the app

Samsung Wallet is trialing buy-now-pay-later and tap-to-send.