Wire

5/30 BUILD DAY DeepMind & Gemini · Hillsborough

VEO3 enters second Hollywood production pipeline

GEMINI retention curve outperforms Q1 projections

EU signals multimodal-specific AI Act provisions

SAG-AFTRA expands likeness enforcement into multimodal scope

SINGAPORE launches sovereign multimodal compute program

CAPITAL multimodal infra rounds overtake LLM-only

DINNER 04/27 thirteen at table · three convictions sharpened

AGI HOUSE

VOL. 01 · NO. 01 · MAY 2026

Biweekly Intelligence Briefing · Issue 01

The Merging
of Modalities.

Image, video, text, and world-knowledge are collapsing into single general-purpose models. What that means for the frontier, the application layer, and the institutions absorbing the shift — ahead of the DeepMind & Gemini Build Day on May 30.

Move & click — the modalities converge with you

Inside this issue

01The FoyerPerspectives on the merging frontier 02Field Notes from the FrontierVoices and numbers from the room 03The Dinner TableWorld models, agents, and the path to AGI 04The Reading RoomMemos and papers worth the time 05The Great HallAn institutional event in disguise 06LandscapingEconomic indicators and the legal terrain 07The YardClosing word and the wire

Vision

AGI House exists because artificial general intelligence is the most consequential technology of our generation, and the people building it should be in rooms with each other often. The Intelligence Report translates what happens in those rooms into a recurring publication for builders, operators, and institutions. We publish biweekly, AI-natively, with names on the bylines.

Contributors

Reported by the AGI House team and community speakers. Analytics and insight by the AGI House Platform. Produced by Katherina Nguyen.

§ 01 · The Foyer

The Foyer.

Two perspectives on the theme from the editors, and the history underneath it.

Rocky Yu

Editor · CEO & Founder, AGI House

May 25, 2026

The merging of modalities is a story about teams, not models.

When Nicole and Naina from the Nanobanana team came on the AGI House podcast, the line that stayed with me wasn't about diffusion or training data. It was an aside about org structure: the Imagen team and the Gemini team had been folded into one, and the model came out of the merged team. The merged team came before the merged model. I've now heard a version of that sentence in three different rooms at AGI House — from frontier-lab researchers, from a Veo PM, from one of our portfolio founders shipping at the seams between image and video. The pattern is too consistent to ignore.

So here's the thing I'd bet on. The companies that ship coherent multimodal products over the next two years won't be the ones with the best image model, or the best video model, or the deepest audio bench. They'll be the ones whose research and product orgs were structured to converge first. Everyone else will keep shipping incoherent surfaces glued onto strong components, and they'll wonder why the experience never feels like one thing. The Build Day on May 30 is, partly, a wager on this — we're putting the people who work at the seams in one room and seeing what they ship in eight hours. If I'm right about the team thesis, the winning demos will look like the teams that built them.

— RY

Katherina Huong Nguyen

Editor · Frontier Tech Strategist

May 25, 2026

When modalities merge, so do the legal and labor categories built around them.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident.

Sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus, nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor.

Ullamcorper est augue ac eros volutpat efficitur. Nam consequat ipsum a velit volutpat, in volutpat magna euismod. Proin vehicula mauris vitae purus vulputate, nec gravida purus accumsan. Mauris in arcu nec ligula tempor placerat. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae. Aliquam erat volutpat. Cras malesuada nec velit eget hendrerit.

— KN

A short history of the modality boundary

The technical separation of modalities is older than computing. Text and image had different printing presses. Image and film had different studios. Film and television had different broadcast regulators. Each separation produced its own infrastructure (publishing, photography, cinema, broadcasting), its own labor force (writers, photographers, directors, producers), and its own legal regime. When digital media arrived in the 1990s, each modality kept its inherited category — JPEG was an image, MP3 was audio, MP4 was video, HTML was text, even though all four were now the same underlying bits. The merging that Gemini, GPT-4o, Claude, and the new multimodal frontier models are doing is the first time the modality boundary has dissolved at the production layer, not just the delivery layer. The institutional consequences are still being worked out.

Five threads worth following.

Signposts

Image-to-Everything Models

The Nano Banana team's explicit thesis: modalities lift each other up as the base model scales.

Field Notes →

Video as Frontier Modality

Veo3's positioning as "the world's best video model" — and what "best" means across consumer, advertiser, and filmmaker.

Portrait Gallery →

World Models & Agents

From the April dinner: do explicit world models matter for AGI, or do scaled end-to-end systems learn them implicitly?

Dinner Table →

Diversification of Use

The Anthropic Economic Index shows model use diversifying, not concentrating. Why that matters for multimodal procurement.

Landscaping →

The Institutional Vacuum

When modalities merge, legal categories built around them stop mapping cleanly. Whoever sets de facto policy first wins.

Great Hall →

§ 02 · Field Notes from the Frontier

Field Notes from the Frontier.

Voices and numbers from the room.

Fig. 01 · Frontier Multimodal Capability Map · 2024 → 2026 AGI House Research · normalized capability index

Four modalities, four trajectories — and the merge point.

Public benchmark scores normalized to a 0–100 capability index, plotted across eight quarters. Text plateaus first; image and audio climb steadily; video makes the largest gain and pulls the others into a shared envelope by Q2 2026.

Text Image Audio Video Source: AGI House Research · normalized from MMLU, GPQA, MMMU, AudioBench, VideoBench-2026

Field Notes · On the Record

From founders and frontier-lab researchers.

Yang Tang

Founder & CEO · Opus Clip

It's a combination of science and art. A combination of rationality and feeling. Everyone in our company watches a lot of videos. The model curation, the editing, the polish — all of it ingests human taste.

Mustafa Bhuiyan

Founder · Nomadic ML

Visual reasoning systems have to cater to your stack, not to general, not well-fitting solutions out there. Synthetic data doesn't survive contact with the long tail of physical environments.

Karan Vaidya

Cofounder · Composio

Integrations are going to probably become a commodity a year or two years down the line. The differentiation comes in through the data loop — knowing which three parameters out of ten thousand actually get used.

David Chen

GM, Robotics · LiveKit

Robot models will follow the LLM trajectory — growing too large for the edge, moving to cloud inference, requiring a realtime streaming layer between the robot and the model.

Introductions: research@agihouse.org

Portrait Gallery

Frontier figures.

Two from the labs shaping how multimodal models reach the world.

Nicole & Naina · 2025

Nicole & Naina

Product team, Nanobanana — the image model that helped Gemini overtake ChatGPT as the #1 free app.

AtGoogle DeepMind

RecordedAGI House Podcast · Sep 2025

The merged team came before the merged model. Capabilities lift each other up as the base scales.

— On how Nanobanana came together inside DeepMind

Read or watch the full AGI House podcast interview

Ricky Wong · 2025

Ricky Wong

Lead PM, Veo — Google DeepMind's frontier video generation model.

AtGoogle DeepMind

RecordedAGI House Interview Series · Jul 2025

Video is the hardest modality because every other modality lives inside it. Getting it right pulls everything else up.

— On why Veo's role spans consumer, advertiser, and filmmaker

Read or watch the full Veo3 interview

Frontier Figures · On the Record

Inside the lab with Coframe.

A working playback of Kat's interview with Josh Payne — the AI-native interview format the AGI House podcast has been refining. Listen, jump between moments, pull up a chair and ask the room.

By the Numbers

Portfolio spotlight, 5/16 retro, 5/30 preview.

A look at the portfolio shaping the issue's theme, the numbers from the room on May 16, and what's on the horizon.

AGI Ventures · Portfolio Spotlight

Five portfolio companies, framed against the Gemini-class merge thesis — auto-playing broadcast, click any segment to focus.

5/16 · Internet of Agents Build Day · As-built

Six numbers from the day.

Teams built

312

Attendees on site

11%

International teams

7 / 10

Top teams below L4

43%

Demos with voice

$8.2M

Sponsor credits used

Looking Forward · May 30

DeepMind & Gemini Build Day

Friday, May 30, 2026 · AGI House, Hillsborough · In collaboration with Google DeepMind

Five live metrics from the room.

Teams Building

~40 expected

Modality Coverage

Image · Video · Audio · Text

Demos Using Veo

Target: 50%+

Demos Using Nano Banana

Target: 60%+

International Teams

Target: 20%+

§ 03 · The Dinner Table

The Dinner Table.

A column from the host's chair at the AGI House Dinner Series.

Image Placeholder Hillsborough · April 27, 2026

At the table: Oriol Vinyals (VP Research, Google DeepMind; Gemini co-architect) · Andrew Dai (CEO, Elorean AI; 12-year DeepMind veteran) · Jiajun Wu (Stanford professor, vision & robotics) · Fan-yun Sun (Cofounder, Moonlight AI) · Zayd Enam (CEO, EnamCo; Cofounder, Cresta) · Nick Oupurov (CEO, Fleet AI) · Nazneen Rajani (CEO, Collinear; ex-Hugging Face) · Xiang Deng (Cofounder, NeoCognition) · Alex Wang (Stanford PhD) · Brian Zhan (Partner, Striker Venture Partners) · Bill Sun (CEO, GAlpha; early Google Brain attention researcher) · Andrew Ma (Director, Turing) · Rocky Yu (Host).

There is no single path forward. But there are clear fault lines shaping the future. This evening's question — what actually gets us to AGI? — surfaced a set of disagreements worth keeping, on world models, on gaming versus robotics, on systems versus weights, and on the compute bottleneck that quietly dominates everything else.

Invite Only AGI House Dinners are exclusive events. Attendance is by invitation of the host. To request an introduction, write to research@agihouse.org.

The dinner unfolded across four courses — appetizer, main, dessert, digestif — whose disagreements appear in their counterfactual run below.

A Counterfactual Dinner

Run it back, in your seat.

Same room, same guests, new script. Sit at the head of the table, listen to the courses unfold, and pull up a chair when you want to push back. Generated live.

§ 04 · The Reading Room

The Reading Room.

Memos and papers worth the time.

The Reading Room A room of memos, papers, and field-notes for the merging frontier. Move your cursor · the field follows

From the AGI House Blog

Three posts most relevant to the merging-of-modalities theme.

AGI House Research · Multimodality

Native Multimodal Architectures: Why Cross-Modal Fusion Defines the Next Defensible Moat

Jessica Chen · AGI House Blog

A structural look at where fusion happens — encoder, latent, decoder — and which architectures defend value as the modality boundary collapses. The most direct companion piece to this issue's central thesis.

AGI House Research · Multimodality

Google Gemini 3: Launch Notes & Project Guide

Jessica Chen · AGI House Blog

The capability surface of Gemini 3, what the multimodal API now supports, and the project ideas that fit best — ahead of the May 30 DeepMind & Gemini Build Day.

AGI House Research · Dinner Series

World Models, Agents, and the Path to AGI

Rocky Yu · AGI House Blog

The full recap from the April 27 dinner referenced throughout §03. World models vs. scaled end-to-end, gaming vs. robotics, system vs. weights — and where AGI actually lands.

Suggested White Papers

From the academic and enterprise community.

ANTH · 2026.03

Anthropic · Economic Index

Anthropic Economic Index Report: Learning Curves & Diversification

Anthropic Research · March 2026

Empirical study of Claude.ai and API usage patterns, documenting consumer-side diversification, API-side concentration, and uneven geographic adoption.

anthropic.com / economic-index-march-2026-report →

ARXIV · 2026.04

arXiv · DeepMind

Scaling Laws for Multimodal Foundation Models

Google DeepMind · Apr 2026 · forthcoming

Empirical scaling laws for joint image-video-text-audio models, showing capability lifting effects as the base model scales.

arxiv.org / forthcoming →

STAN · 2026.AI

Stanford HAI · AI Index

AI Index Report 2026 · Multimodal Capability Tracking

Stanford Institute for Human-Centered AI · 2026 Annual

Annual capability benchmarks now include multimodal-specific tracking — a useful reference for procurement teams looking past text-only metrics.

aiindex.stanford.edu →

A16Z · 2025.12

a16z & OpenRouter

State of AI Usage · 100T-token Empirical Study

a16z · OpenRouter · Dec 2025

Empirical analysis of real-world LLM usage at scale, including the rise of Chinese open-weight models in coding and technical workloads.

a16z.com / state-of-ai-2025 →

MIT · 2025.NB

MIT NANDA · Industry Report

The Generative AI Divide: State of AI in Business 2025

MIT NANDA · 2025

The 95% enterprise pilot failure finding that has anchored much of the agent and multimodal-deployment discourse since publication.

nanda.media.mit.edu →

§ 05 · The Great Hall · Science, Technology & Society

The Merging of Modalities Is an Institutional Event in Disguise.

An STS reading of multimodal AI, against the institutional history of media. What the technology actually breaks, who has to rebuild what, and where the value will land.

By Katherina Huong Nguyen Reading time · 12 min Broadcast · 16 min

Every previous expansion of the media stack arrived with the institutions it needed already built. The printing press inherited copyright and publishing. Photography inherited likeness rights and the studio. Cinema inherited unions, guilds, and broadcast regulators. The multimodal frontier is the first time the production of media has expanded without the institutions to absorb it — and that absence, not the capability itself, is the story this issue is tracking.

I. What the merging actually does.

The technical story is simple to describe and easy to under-read. A frontier multimodal model is a single set of weights that ingests text, image, audio, and video, and emits any of the four. That is the surface change. The deeper change is that, for the first time, the production of media is no longer separated from its understanding. A model that can read a frame can write one. A model that can listen to a take can produce one. The pipeline collapses, and so does the rationale for having separate tools, separate teams, and — over a longer horizon — separate institutions to govern them.

The merge also breaks a working assumption that has been quietly load-bearing for two centuries: that each modality is its own object of regulation, its own labor market, and its own asset class. JPEGs were governed differently from MP3s not because the bits were different, but because the institutions producing them were. When the same model produces both, the institutions producing them are the same. That is the merger this issue is actually tracking.

The more you're building these general purpose models, they benefit everyone. Nano Banana benefits a lot from our image understanding capabilities, our video understanding capabilities. — Nicole, Google DeepMind Nanobanana team

The lift effect Nicole describes is not a marketing line. It is the empirical claim that capability in one modality raises capability in the others once the base model is shared. If true at scale, it inverts the procurement logic of the last cycle: instead of picking a best-in-class vendor per modality, the dominant strategy becomes consolidating on whichever vendor's base model is improving fastest across all four. The lab-to-product transmission belt narrows. So does the moat.

II. The institutional history this merger is colliding with.

Each modality acquired its institutional infrastructure at a different historical moment, in response to a different production technology, and around a different labor structure. Reading the merger against that backdrop is what tells you which institutions will absorb it and which will simply break.

Fig. II · Institutional regimes per modality · 1450 – 2026

Each modality inherited its own institutional infrastructure. The merged model has none.

Source: AGI House Research synthesis.

Three observations follow from reading the diagram literally. First, every regime in the chart is older than the digital-era institutions written on top of it — publishing and copyright are five and a half centuries old, photography and likeness rights almost two, broadcast roughly one. Second, every regime is structured around a labor force, not a product: writers, photographers, musicians, performers. Third, none of them anticipated a single producer that traverses all four.

That third observation is where the institutional vacuum lives. The merged-output row at the bottom of the chart has no inherited regime to absorb it, because no prior production technology produced anything that crossed those boundaries in a single artifact. The closest analog is the silent-to-talkie transition in cinema, which collapsed image and audio production into a single shoot — and which took a decade of labor and copyright realignment to settle. Multimodal compresses an analogous reshuffling across four lanes at once.

III. Where value accrues during an institutional vacuum.

Institutional vacuums are not neutral. Whoever sets the de facto rules during the gap usually keeps them. Three layers are doing the rule-setting right now, and the order in which they congeal is the most predictable thing about the next eighteen months.

Platform-level policy is being set by the frontier labs and their distribution partners. Watermarking standards, training-data opt-outs, prompt-disclosure conventions, age-gating — these are being decided by Google, Anthropic, OpenAI, and the cloud platforms that distribute them, in dialogue with a small number of large customers. By the time the EU AI Act's implementing rules and the US Copyright Office's guidance arrive, the practical defaults will already be in place. Regulation will codify what the labs negotiated.

Litigation infrastructure is consolidating. The performer-rights settlement at landmark scale this quarter (see Landscaping) is the early signal: a small number of plaintiffs' firms are positioning themselves as the de facto enforcement layer for likeness, voice, and now multimodal-output cases. Their settlements set the licensing structure other producers will adopt, not the statutes that follow.

It's a combination of science and art. Combination of rationality and feeling. Everyone in our company watches a lot of videos. — Yang Tang, Opus Clip

The taste-and-curation layer is the most under-discussed and most economically interesting. Once any team can produce technically competent output across four modalities, the binding constraint stops being capability and starts being judgment. Yang Tang's observation about Opus Clip is the company-scale version of it. The same dynamic governs which Veo-rendered shot makes it into a feature, which Nano-Banana output ships into a product surface, which voice clone reads which line. That layer is human, low-throughput, and very hard to commoditize. Its economic value is rising in inverse proportion to the falling cost of the underlying model output.

Great Hall · Broadcast Edition · Visual Brief 16 min · Video + Audio available · 5 chapters

I. What the merging actually does

Same weights. Four modalities. One pipeline.

For the first time, the production of media is no longer separated from its understanding. A model that can read a frame can write one. A model that can listen to a take can produce one.

“The more you’re building these general purpose models, they benefit everyone.” — Nicole, Google DeepMind

II. The institutional history

Each modality inherited its own institutional infrastructure.

Five centuries of separate regimes — copyright, photography, broadcast, cinema — built around separate production technologies. The merged model has none.

1450TextCopyright & Publishing
1839ImageLikeness & Licensing
1877AudioBroadcast Rights
1895VideoGuilds & Regulators
2024Merged— no regime —

III. Where value accrues

Whoever sets the de facto rules during the gap usually keeps them.

Three layers are doing the rule-setting right now. The order in which they congeal is the most predictable thing about the next eighteen months.

01Platform PolicyLabs & distribution partners

02Litigation InfrastructurePlaintiffs’ firms

03Taste & CurationUnder-discussed, durable

IV. Where specialization survives

A general model lifts the floor. It does not raise the ceiling on judgment.

Specialization survives where human production capital — domain expertise, taste, on-set craft — does not transfer to the new medium for free.

Vertical reasoning Realtime & on-device Taste · curation · brand

V. Four predictions

California sets the de facto US baseline.

The next 18 months will be decided by who sets the defaults during the gap — and on that, the smart bet is on whoever is shipping product at the seams.

01Procurement goes multi-vendor & modality-agnostic.
02First coherent regime lands in California, not federal or EU.
03Labor disputes shift from strikes to opt-in licensing pools.
04Labs publicly restructure around multimodal product orgs.

01 / 05

Audio Preview · Podcast Edition

Audio Embed · Spotify · Apple Podcasts · AGI House Podcast

16:00

IV. Where specialization survives the merger.

Not everything merges. The lesson from the cinema-to-television transition is that specialization survives where the human production capital — domain expertise, taste, on-set craft — does not transfer to the new medium for free. A general-purpose model lifts the floor; it does not raise the ceiling on judgment. Three places where this is already visible.

Vertical reasoning systems. Mustafa Bhuiyan's observation from §02 — that visual reasoning has to be conditioned on the deployment stack, not bought off the shelf — is the enterprise version of this. Long-tail physical environments, regulated industries, and sovereign-data domains do not absorb a frontier multimodal model cleanly. They absorb a specialized layer on top of one.

Realtime and on-device. David Chen's robotics-streaming thesis applies here too. The frontier multimodal model is moving toward cloud inference; the realtime layer between that model and the world is becoming its own specialization. Whoever owns that seam — voice, robotics, AR overlays, telepresence — owns a moat the foundation model cannot eat.

Taste, curation, brand. The Coframe interview and the Veo Hollywood pipelines are both bets that the human curation layer is where durable margin will live. As output volume goes vertical, the scarce resource is the judgment that says this one ships, that one doesn't. That is not a capability you scale by adding parameters.

V. Four predictions and the closing observation.

One. The 2026–2027 procurement pattern in enterprise will be explicitly multi-vendor and modality-agnostic. The Anthropic Economic Index's diversification finding will hold; single-vendor multimodal lock-in will not be the dominant pattern, regardless of which lab leads on benchmarks. Procurement teams should model two-to-three frontier-vendor stacks and a specialization layer on top.

Two. The first coherent legal regime to land on multimodal output will be a California statute, not a federal one and not an EU one. The combination of the existing performer-rights framework (AB 2602, AB 1836), the active disclosure bill, and the concentration of frontier labs in the state means California will set the de facto US baseline. Other states will harmonize within 18 months.

Three. By Q4 2026 the labor disputes around multimodal output will look less like SAG-AFTRA strikes and more like guild-licensing structures — voluntary opt-in pools that license likeness, voice, and motion to specific named uses, administered by a small number of intermediaries. The settlement structure of the landmark voice-clone case is the template.

Four. Within the labs, the org-structure consolidation Rocky names in The Foyer continues. The teams that ship coherent multimodal product will be teams whose research and product reporting lines already converged. The teams that ship incoherent product will be teams that did not. Expect at least one major lab to publicly restructure around this thesis in the next two issues of this report.

The closing observation is the one this issue exists to make: the merging of modalities is not, primarily, a model-quality story. It is a story about which institutions, labor structures, and legal categories survive contact with a production technology that does not respect the boundaries they were built around. The labs are not waiting for those institutions to catch up, and neither is the capital. The next eighteen months will be decided by who sets the defaults during the gap — and on that, the smart bet is on whoever is shipping product at the seams, not whoever is filing briefs about it.

— KN

§ 06 · Landscaping

Landscaping.

The economic and legal terrain around the merger.

Economics · Reading the Index

What the Anthropic Economic Index says about model-use diversification.

Issue 01 · Diversification of Model Use

Use cases are diversifying. So is the case for multiple models.

The Anthropic Economic Index (March 2026) reports that Claude.ai consumer use has become less concentrated since November 2025 — the top 10 tasks fell from 24% to 19% of conversations. The same report notes that API usage moved the opposite direction, becoming more concentrated as enterprises specialize.

19%

Share of Claude.ai conversations from top 10 tasks, Feb 2026 (down from 24% in Nov 2025)

33%

Share of API traffic from top 10 O*NET tasks, up from 28% — API usage concentrating as Claude.ai diversifies

21.6%

US share of global Claude.ai usage; India 7.2%, Brazil 3.7% — geographic concentration persists

39%

Share of Claude.ai conversations classified as directive (delegated tasks), up from 27% in late 2024

Read in combination with this issue's central thesis — that frontier multimodal models are absorbing capability across image, video, audio, and text — the diversification pattern tells a useful operational story. As single models become capable across more modalities, consumer use spreads across a longer tail of tasks rather than concentrating on the model's marquee capability. The Nanobanana team's observation that users came for the figurine trend but stayed for education is the consumer-side version of this. The same dynamic is now visible in the aggregate data.

For enterprise procurement, the implication runs in the other direction. As API usage concentrates and enterprises specialize, the case for standardizing on a single multimodal vendor weakens — the marginal task moves between vendors, modalities, and use cases too quickly to lock down. Procurement teams should expect to operate multi-vendor multimodal stacks for at least the next 18 months. The single-vendor procurement pattern that worked for cloud and SaaS does not transfer cleanly. This is consistent with the Dinner Table position that the system matters more than the model.

Source: Anthropic Economic Index, March 2026 Report. Analysis by AGI House Research.

Legal · Rulings and Active Policy

The legal terrain around multimodal AI.

Three developments in the foreground, and a wider view of where policy currently sits.

Ruling · Federal · United States

Voice clone likeness case settles at landmark scale

A federal performer-rights case targeting unauthorized voice cloning of named talent has reached settlement with damages reportedly exceeding the next-highest AI likeness case by an order of magnitude. The structure of the settlement — opt-in licensing rather than blanket prohibition — is expected to shape industry norms.

Source: Reuters, Hollywood Reporter

Ruling · EU · Frontier Model Scope

EU clarifies "general purpose AI" applies to multimodal frontier

The European Commission has issued formal guidance that frontier multimodal models fall within the General Purpose AI provisions of the AI Act, triggering specific transparency, documentation, and risk-assessment requirements that the previous model-by-modality reading had left ambiguous.

Source: European Commission, Politico EU

Enterprise & Science · Multimodal Deployment

Mayo–DeepMind diagnostic pilot clears Phase II under multimodal governance review

A joint multimodal diagnostic pilot — pairing imaging, dictated notes, and patient-history text inside a single inference call — has cleared its Phase II review under an FDA framework that treats merged-modality clinical models as a new evaluation category. Procurement and trial-design notes from the pilot are now circulating as a de facto template for hospital-system contracts.

Source: FDA Digital Health Center, Nature Medicine

Active Policy · Snapshot

Where things stand by jurisdiction.

California

Multimodal output disclosure for political and commercial use

State-level disclosure bills advancing through legislature. Expected signing decisions late summer.Source: CA Legislative tracker

Pending

California

Performer likeness in AI-generated video

Existing AB 2602 / AB 1836 framework now being tested in active litigation involving multimodal outputs.Source: CA Department of Industrial Relations

Active

United States · Federal

NO FAKES Act (voice / likeness)

Bipartisan federal proposal targeting unauthorized AI replicas. Multimodal scope clarification pending in committee.Source: Congress.gov

Pending

United States · Federal

Iterative guidance continues. Multimodal output co-authorship and registrability remain unresolved.Source: US Copyright Office

Signal

European Union

AI Act · GPAI provisions

Frontier multimodal models confirmed within GPAI scope. Transparency, documentation, and downstream-disclosure obligations active.Source: European Commission

Active

European Union

Performer and likeness rights · adaptation

Existing performer rights directives being reinterpreted to cover multimodal outputs. Member-state implementation varies.Source: European Parliament research service

Pending

United Kingdom

AI regulation framework

Sector-specific regulator approach maintained, with cross-cutting principles. Multimodal output not yet specifically scoped.Source: UK DSIT

Signal

Singapore

Model AI Governance Framework v2 · GenAI

Sector-agnostic principles framework updated for multimodal generative AI. Guidance non-binding but referenced in procurement.Source: IMDA

Active

China

Generative AI service regulation

Existing GenAI rules apply to multimodal outputs. Labeling and watermarking enforcement increasing.Source: CAC

Active

South Korea

AI Framework Act

Comprehensive AI act passed; implementation rules being drafted with attention to multimodal scope.Source: National Assembly of Korea

Pending

Legend: Active — currently in force or under enforcement. Pending — proposed, in legislative process, or in active drafting. Signal — early signal of position from government or regulator without enforceable action.

Interactive · California & the Modality Frontier

The California timeline, by modality.

California is moving faster than any other US jurisdiction on multimodal regulation. The chart maps every active or pending CA bill, executive action, and regulator signal from 2022 to 2026 against the four output modalities each one targets. Bubble area is proportional to the estimated CA-nexus entities in scope — labs, platforms, studios, ad networks, production tools. Hover any bubble for the detail; filter by modality.

Text Image Audio Video Bubble area · est. CA-nexus entities in scope ~50 ~250 ~700

Hover or tap a bubble to read the detail. Source: CA legislative tracker, CA DIR, calmatters.org, official agency notices.

§ 07 · The Yard

The Yard.

A closing word, and the wire from the week.

Closing Word

See you on May 30.

The DeepMind & Gemini Build Day is where the merging-of-modalities thesis becomes demoable products at the seams between modalities. Issue 02 will cover what shipped, expand the regulatory-vacuum thread with reporting from the EU and Singapore, carry the next Portrait Gallery profiles, and set the next question for the Dinner Table on the perception-to-action merger.

Introductions: research@agihouse.org.

— Katherina Huong Nguyen & Rocky Yu, Editors

The Briefing

The stories that mattered, May 15–25.

A curated wire from the week — frontier models, capital, policy, international, consumer.

Frontier Models · Lead

Gemini multimodal release sets new benchmarks across image, video, and audio

Coverage of the latest Gemini family release shows the merged modality approach now leading on major multimodal evaluation benchmarks. The same model winning on image generation is winning on image understanding, video, and audio simultaneously.

Source: DeepMind blog, The Verge, Bloomberg

Frontier Models

Open-weight multimodal models close the gap on closed frontier

Multiple open-weight releases (including from Chinese labs) are now within striking distance of frontier closed models on multimodal benchmarks.

Source: HuggingFace, Stanford HAI

Frontier Models

Veo enters Hollywood production pipelines for second feature

Trade press confirms a second major studio production using Veo for previz and selected post-production.

Source: Variety, Hollywood Reporter

Capital

Multimodal AI infrastructure rounds overtake LLM-only rounds

Q2 2026 funding tracking shows multimodal infrastructure companies raising at higher valuations than LLM-only peers.

Source: PitchBook, Crunchbase

Anthropic Economic Index

Claude.ai usage diversifies; top 10 tasks fall from 24% to 19% of conversations

The latest Anthropic Economic Index reports use cases diversifying across Claude.ai as adoption matures. Geographic adoption remains uneven, with the US at 21.6% of global usage.

Source: Anthropic Economic Index, March 2026

Policy & Regulation

EU advances multimodal-specific provisions in AI Act enforcement

The European Commission has signaled that the multimodal capability of frontier models will trigger specific AI Act provisions not previously applied.

Source: European Commission, Politico EU

Policy & Regulation

SAG-AFTRA and likeness protection move into multimodal scope

Performer-rights enforcement actions are expanding to address multimodal outputs that combine likeness, voice, and gesture in single generations.

Source: SAG-AFTRA, Variety

International

Singapore launches sovereign multimodal AI compute program

Singapore's IMDA announced a dedicated multimodal compute allocation program. The HK-Singapore corridor continues to set its own procurement and infrastructure agenda.

Source: IMDA, Straits Times

Consumer

Gemini app retention metrics outperform Q1 expectations

Public data on Gemini app usage shows the post-Nanobanana retention curve holding stronger than initial projections.

Source: SensorTower, Google blog

The Mergingof Modalities.

The Foyer.

The merging of modalities is a story about teams, not models.

When modalities merge, so do the legal and labor categories built around them.

Five threads worth following.

Field Notes from the Frontier.

Frontier figures.

Inside the lab with Coframe.

Portfolio spotlight, 5/16 retro, 5/30 preview.

AGI Ventures · Portfolio Spotlight

5/16 · Internet of Agents Build Day · As-built

The Dinner Table.

Run it back, in your seat.

The Reading Room.

From the academic and enterprise community.

The Merging of Modalities Is an Institutional Event in Disguise.

I. What the merging actually does.

II. The institutional history this merger is colliding with.

III. Where value accrues during an institutional vacuum.

Same weights. Four modalities. One pipeline.

IV. Where specialization survives the merger.

V. Four predictions and the closing observation.

Landscaping.

Use cases are diversifying. So is the case for multiple models.

The legal terrain around multimodal AI.

Active Policy · Snapshot

The California timeline, by modality.

The Yard.

See you on May 30.

The stories that mattered, May 15–25.

Gemini multimodal release sets new benchmarks across image, video, and audio

Open-weight multimodal models close the gap on closed frontier

Veo enters Hollywood production pipelines for second feature

Multimodal AI infrastructure rounds overtake LLM-only rounds

Claude.ai usage diversifies; top 10 tasks fall from 24% to 19% of conversations

EU advances multimodal-specific provisions in AI Act enforcement

SAG-AFTRA and likeness protection move into multimodal scope

Singapore launches sovereign multimodal AI compute program

Gemini app retention metrics outperform Q1 expectations

The Merging
of Modalities.