Science

The Science Behind VidCognition: How AI Predicts Brain Engagement in Video

VidCognition predicts how the human brain responds to video content — second by second, across the full cortical surface — without requiring participants in a scanner. This is made possible by TRIBE v2, a neural encoding model developed by Meta AI Research, trained on fMRI data from approximately 720 participants across more than 1,000 hours of naturalistic video stimuli, and validated against 7T fMRI recordings with roughly twice the predictive accuracy of prior state-of-the-art models. This page explains the science: what fMRI measures, how neural encoding models work, what TRIBE v2 predicts, and what the output means for creators optimizing video content.

1. What fMRI Measures (and Why It Matters for Video)

The BOLD Signal

Functional magnetic resonance imaging measures neural activity indirectly, via the BOLD (Blood Oxygen Level Dependent) signal. When neurons in a brain region fire more actively, local blood flow increases to deliver more oxygenated hemoglobin. Oxygenated and deoxygenated hemoglobin have different magnetic properties — fMRI detects this difference, producing a spatial map of relative neural activity across the brain volume.

For video research: participants watch video stimuli inside an fMRI scanner while researchers record BOLD signal changes across the full brain, time-locked to the video. The result is a 4D dataset: three spatial dimensions (the brain volume) plus time (one measurement per approximately 1–2 seconds of video). This produces ground-truth data on which brain regions activated, how strongly, and when — as each moment of the video played.

Why 7T fMRI Is Significant

Standard clinical fMRI scanners operate at 1.5T or 3T field strength. 7T scanners (Tesla) produce significantly higher signal-to-noise ratio and spatial resolution — resolving cortical features at sub-millimeter scale rather than the 3–4mm voxels typical of 3T scanning. This matters for neural encoding research because it allows mapping to finer cortical regions and distinguishing between adjacent functional areas. TRIBE v2's zero-shot generalization performance was validated on 7T fMRI recordings from the Human Connectome Project — demonstrating that the model's learned representations transfer to higher-resolution imaging data it was not trained on.

The Challenge of Video fMRI Research

Studying naturalistic video in an fMRI scanner is technically demanding. Participants must remain still inside a loud, enclosed scanner while watching content. The temporal resolution of fMRI (typically 1–2 second TR — repetition time) is coarser than the stimulus (video at 24–60fps). Hemodynamic response function lag adds approximately 4–6 seconds of delay between neural event and BOLD signal peak, requiring deconvolution. Despite these constraints, fMRI video research has produced the most comprehensive datasets on how the human brain processes dynamic visual content — datasets that neural encoding models can be trained on.

2. Neural Encoding Models — How Machines Learn to Predict Brain Response

What a Neural Encoding Model Is

A neural encoding model learns the mapping between a stimulus (in this case, video) and the brain response it produces. Given a new stimulus the model has never seen, it predicts the brain response that stimulus would evoke.

Training procedure:

Collect fMRI data from participants watching video stimuli
Extract features from the video (visual, motion, temporal, semantic, audio)
Train a model to predict the fMRI response at each brain location from those features
Validate on held-out participants watching held-out videos

The model captures the statistical relationship between video content and cortical activation — essentially learning "when a video has feature X, region Y activates at strength Z."

From Early Visual Models to TRIBE v2

Neural encoding models for video have a research history going back to early work on primary visual cortex. Early models predicted V1 responses from simple features (orientation, spatial frequency, contrast). Later models extended to higher visual areas using more complex feature representations. Early neural encoding approaches relied on linear regression mapping from handcrafted features — a severe limitation on both expressivity and the number of brain locations that could be modeled simultaneously. Deep neural network-based encoders dramatically expanded predictive accuracy across the full cortical hierarchy. TRIBE v2 represents the current state of the art: a multimodal, whole-brain video encoding model trained across hundreds of participants, with demonstrated zero-shot generalization to new subjects, content, and imaging protocols.

3. TRIBE v2 — The Model Powering VidCognition

What TRIBE v2 Is

TRIBE v2 (Temporal Representation in Brain Encoding, version 2) is Meta AI Research's neural encoding model for video, published in March 2026.

d'Ascoli S, Rapin J, Benchetrit Y, Brooks T, Begany K, Raugel J, Banville H, King J-R. "A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience." Meta AI Research (FAIR), March 2026. ai.meta.com/research · GitHub · HuggingFace

Key Technical Specifications

Training data: fMRI recordings from approximately 720 participants, over 1,000 hours of naturalistic video and audio stimuli (films, TV episodes, podcasts)
Output: Predicted BOLD signal across the fsaverage5 cortical surface mesh — approximately 20,484 vertices covering the full cortical surface
Temporal resolution: Approximately 1-second predictions synchronized to video playback
Feature extractors: V-JEPA-2-Gigantic-256 (video), LLaMA 3.2-3B (language), Wav2Vec-BERT-2.0 (audio) — fused via a Transformer backbone
Validation: Roughly twice the predictive accuracy of prior state-of-the-art models on held-out participants and stimuli; near noise ceiling in auditory and language cortex regions

What the Model Predicts

TRIBE v2 outputs a prediction of cortical activation at each vertex of the fsaverage5 surface mesh, for each second of video. The fsaverage5 mesh is the standard cortical surface template used in neuroscience research — it maps to anatomical brain regions in a standardized coordinate system, allowing comparison across participants and studies. The prediction covers the full cortical surface: early visual regions, higher visual areas, motion-selective regions, face-selective regions, attention networks, auditory regions, language regions, and limbic-adjacent regions involved in emotional processing.

What Makes TRIBE v2 Different from Prior Models

Prior state-of-the-art video encoding models used linear regression from handcrafted or single-modality features, trained on a small number of participants, predicting activity in a limited set of brain parcels. TRIBE v2 improves on this in five ways: (1) nonlinear deep learning rather than ridge regression; (2) multi-subject training enabling generalization to new individuals; (3) true multimodal fusion of video, audio, and language features with modality dropout regularization; (4) whole-brain prediction across approximately 20,484 cortical vertices rather than hundreds of parcels; (5) zero-shot generalization to new subjects, imaging protocols, and languages not seen during training.

4. Brain Regions and What They Mean for Creators

Each major region mapped to creator-relevant language. Neuroscience cited to primary sources.

Primary Visual Cortex (V1, V2, V3)

Felleman DJ, Van Essen DC. "Distributed Hierarchical Processing in the Primate Cerebral Cortex." Cerebral Cortex, 1(1):1–47, 1991.

Function: Processes basic visual features — edges, contrast, luminance, color, orientation, spatial frequency.

Activation means: The content is generating visual signal at the most fundamental level.

Creator relevance: Low-contrast, visually flat content produces minimal early visual cortex activation. Sharp edges, high contrast, strong color, and motion produce strong activation. A hook that doesn't clear this threshold loses viewers before any higher-level processing occurs.

Motion-Selective Regions (MT/V5, MST)

Function: Process motion direction, speed, and optical flow across the visual field.

Activation means: Directional motion is present and being processed.

Creator relevance: Static shots, slow pans, and talking-head content with no movement produce low MT activation. Fast cuts, camera movement, whip pans, and on-screen action spike it. Motion-selective activation is one of the fastest-responding signals in the first 400ms — directly relevant to pattern interrupt effectiveness.

Fusiform Face Area (FFA)

Kanwisher N, McDermott J, Chun MM. "The Fusiform Face Area: A Module in Human Extrastriate Cortex Specialized for Face Perception." Journal of Neuroscience, 17(11):4302–4311, 1997.

Function: Responds selectively to faces, especially eyes and expressions. Response is strongest for faces oriented toward the viewer (direct gaze).

Activation means: Human faces are registering — the social cognition system is engaged.

Creator relevance: Direct-to-camera eye contact is one of the strongest FFA activators in video. Looking away from camera, talking to an off-screen subject, or using graphics-only content without faces reduces FFA activation and social engagement signal.

Anterior Cingulate Cortex (ACC)

Posner MI, Petersen SE. "The Attention System of the Human Brain." Annual Review of Neuroscience, 13:25–42, 1990. · Botvinick MM, Braver TS, Barch DM, Carter CS, Cohen JD. "Conflict Monitoring and Cognitive Control." Psychological Review, 108(3):624–652, 2001.

Function: Sustained attention, behavioral relevance monitoring, conflict processing. The ACC signals to the rest of the brain that a stimulus is worth continued processing.

Activation means: The brain is maintaining active engagement with the content — it has been flagged as relevant and worth continued attention allocation.

Creator relevance: Open loops — questions, incomplete statements, unresolved setups — keep the ACC active. Resolved information gaps drop ACC activation. High, sustained ACC activation through the hook window (0–3s) is the single strongest predictor of continued viewing in neural engagement data.

Amygdala and Limbic-Adjacent Regions

McGaugh JL. "The Amygdala Modulates the Consolidation of Memories of Emotionally Arousing Experiences." Annual Review of Neuroscience, 27:1–28, 2004. · LeDoux JE. "Emotion Circuits in the Brain." Annual Review of Neuroscience, 23:155–184, 2000.

Function: Emotional salience processing, threat and reward detection, arousal. The amygdala activates for emotionally significant stimuli — both positive and negative — and drives the encoding of emotionally salient memories into long-term storage.

Activation means: The content is generating emotional signal — visceral response is present.

Creator relevance: Emotionally flat content produces minimal amygdala activation. High-arousal moments — surprise, awe, amusement, social threat — spike it. Amygdala activation during content correlates with sharing behavior and memory recall, which is why emotionally resonant content spreads more readily than informationally equivalent but emotionally flat content.

Hippocampal Regions (Memory Encoding)

Function: Encoding new information into long-term memory. Hippocampal activation during learning is associated with subsequent recall.

Activation means: The content is being encoded — the brain is storing it, not just processing and discarding.

Creator relevance: Brand mentions, CTAs, or key messages that coincide with moments of high emotional activation (amygdala) and attention (ACC) are more likely to be remembered. This is the neural basis of the "peak-end rule" in content design — emotional peaks drive encoding.

5. From Model Output to Creator Tool

TRIBE v2 outputs a high-dimensional activation vector — approximately 20,484 cortical vertices × seconds of video. Raw cortical activation data is not interpretable by creators without neuroscience training. VidCognition translates this output into three usable formats:

Brain engagement score (per second)

A weighted aggregate of activation across the regions most predictive of sustained viewing — ACC, early visual cortex, motion-selective regions, FFA. Displayed as a 0–100 index per second.

Brain engagement timeline

The per-second score plotted across the full video duration. Shows the hook window, retention checkpoints, mid-video floor, and payoff delivery — all as predicted neural engagement rather than post-hoc behavioral data.

3D brain heatmap

The full cortical surface rendered with activation overlaid, synchronized to video playback. Shows which specific regions are active at each moment — allowing creators to see whether a drop in engagement is coming from visual processing failure, attention loss, or emotional signal drop.

What VidCognition Adds Beyond the Meta Demo

Meta's hosted TRIBE v2 demo (aidemos.atmeta.com/tribev2) plays pre-loaded video stimuli with a cortical activation overlay. VidCognition productizes this research capability:

Custom video upload (any creator's content, not pre-loaded stimuli)
Engagement score aggregation (single interpretable number per second)
Timeline visualization with drop-off markers
Pre-publish workflow (analyze before posting, not after)
Creator-facing UX with no neuroscience training required

6. How VidCognition Compares to Other Neuromarketing Tools

Approach	What it measures	Limitation	Example tools
AI-predicted fMRI (TRIBE v2)	Full-brain cortical response	Predicts average response, not individual variation	VidCognition
AI-predicted eye-tracking	Gaze fixation and attention	Attention only — no emotion, no memory encoding	Neurons Inc
Real EEG	Electrical brain activity	Requires hardware; lab setting; lower spatial resolution	iMotions, Emotiv
Real fMRI	Ground-truth brain activation	$500–$1,500/hr scanner time; lab-only	Neurensics, NIQ BASES
Wearable biometrics	Heart rate, skin conductance	Arousal proxy only; requires real participants wearing devices	Immersion Neuroscience
Webcam facial coding	Facial expression micro-movements	Emotion only; requires participant consent and webcam	Realeyes, Adverteyes

The key distinction: VidCognition is the only creator-accessible tool that predicts full-brain fMRI cortical response. Eye-tracking-based tools (like Neurons Inc at approximately €15,000/year) predict where the eye points — an attention signal only. fMRI prediction captures attention, emotion, memory encoding, and social processing simultaneously.

7. Glossary

fMRI (Functional Magnetic Resonance Imaging): A brain imaging technique that measures neural activity by detecting changes in blood oxygenation. When neurons fire, they consume more oxygen, causing increased oxygenated blood flow to the active region. fMRI detects the magnetic difference between oxygenated and deoxygenated hemoglobin (the BOLD signal), producing a spatial + temporal map of brain activity. For video research, fMRI reveals which brain regions activate — and how strongly — at each second of a video.
BOLD Signal (Blood Oxygen Level Dependent): The physiological signal measured by fMRI. Neural activity causes local increases in oxygenated blood flow; fMRI detects the magnetic properties of this change. The BOLD signal is an indirect measure of neural activity with approximately 1–2 second temporal resolution and millimeter-scale spatial resolution at high field strengths.
Neural Encoding Model: A computational model trained to predict brain responses from stimulus features. For video: a model that learns the mapping between video content (visual features, motion, temporal dynamics, audio, language) and fMRI cortical activation patterns. Given a new video, a neural encoding model predicts the cortical response that video would produce in a human viewer without requiring live scanning.
TRIBE v2 (Temporal Representation in Brain Encoding, version 2): Meta AI Research's neural encoding model for video. Trained on fMRI data from approximately 720 participants across over 1,000 hours of naturalistic stimuli, TRIBE v2 predicts whole-brain cortical responses to any video at approximately 1-second temporal resolution, achieving roughly twice the predictive accuracy of prior state-of-the-art models on held-out participants and stimuli. Published March 2026.
Cortical Surface / fsaverage5: The outer layer of the brain (cerebral cortex) is where most cognitive processing occurs. For computational modeling, the cortical surface is represented as a mesh of approximately 20,484 vertices (in the fsaverage5 standard template), each corresponding to a small patch of cortical tissue. TRIBE v2 predicts activation at each vertex, enabling whole-brain coverage.
Anterior Cingulate Cortex (ACC): A brain region involved in sustained attention, behavioral relevance monitoring, and conflict processing. In video engagement research, ACC activation is a key predictor of continued viewing — it signals that the brain has flagged the content as worth sustained attention allocation.
Fusiform Face Area (FFA): A cortical region in the temporal lobe that responds selectively to faces, particularly eyes and facial expressions oriented toward the viewer. Direct eye contact produces strong FFA activation and drives social engagement signals in video content.
Amygdala: A subcortical structure involved in processing emotional salience, threat and reward signals, and arousal. Amygdala activation during content viewing is associated with emotional response intensity and — through its interaction with the hippocampus — with memory encoding of emotionally significant events.
Engagement Curve: A second-by-second graph of audience attention or neural engagement over the duration of a video. Platform analytics produce behavioral engagement curves (percentage of viewers remaining). VidCognition produces brain engagement curves (predicted neural activation per second).
Hook: The opening 1–3 seconds of a short-form video, designed to recruit viewer attention before the brain's pre-conscious go/no-go attention decision is made. An effective hook triggers the visual salience check (pattern interrupt) and opens a cognitive loop that sustains attention (ACC engagement) through the first retention checkpoint.

8. Frequently Asked Questions

What is TRIBE v2?+

TRIBE v2 (Temporal Representation in Brain Encoding, version 2) is a neural encoding model developed by Meta AI Research. Published in March 2026, it predicts how the human brain responds to video content — second by second, across the full cortical surface — without requiring participants in an fMRI scanner. The model was trained on fMRI data from approximately 720 participants across over 1,000 hours of naturalistic video and audio stimuli, and achieves roughly twice the predictive accuracy of prior state-of-the-art models on held-out participants and stimuli. VidCognition uses TRIBE v2 to generate brain engagement timelines for any video before posting.

How does VidCognition predict brain engagement?+

VidCognition runs your video through TRIBE v2, which extracts video, audio, and language features using pretrained foundation models and maps them to predicted fMRI cortical activation patterns. For each second of your video, the model predicts which brain regions activate and how strongly — based on what it learned from fMRI recordings of human participants watching naturalistic video stimuli. This prediction is aggregated into an engagement score per second and rendered as a brain engagement timeline and 3D cortical heatmap.

Does VidCognition measure my viewers' actual brain activity?+

No. VidCognition predicts the brain response a typical viewer would have to your video, based on TRIBE v2's neural encoding model. These are computational predictions — not live measurements. The predictions are validated against actual fMRI responses measured from held-out participants watching held-out videos, making them a reliable proxy for real neural engagement, but they represent an expected average response rather than any individual viewer's brain activity.

What's the difference between fMRI prediction and eye-tracking prediction?+

Eye-tracking-based tools (such as Neurons Inc) predict where a viewer's eyes will fixate — an attention signal only. fMRI prediction captures a much broader picture: attention, emotional processing (amygdala activation), memory encoding (hippocampal regions), social engagement (fusiform face area), and sustained relevance signaling (anterior cingulate cortex). Eye-tracking tells you where the viewer looked; fMRI prediction tells you what the brain was doing cognitively and emotionally at each moment.

How accurate is AI fMRI video prediction?+

TRIBE v2 achieves roughly twice the predictive accuracy of prior state-of-the-art video encoding models on held-out participants and stimuli, including zero-shot generalization to new subjects watching content not seen during training. Accuracy varies by brain region — early visual cortex and auditory/language cortex predictions approach the noise ceiling of what can be predicted from any model; higher-order association cortex predictions are somewhat lower. VidCognition's engagement score focuses on the regions where prediction accuracy is highest and predictive validity for viewing behavior is best established.

Why does VidCognition focus on specific brain regions?+

Not all brain regions are equally predictive of viewing behavior. VidCognition's engagement score weights activation in the regions with the strongest established links to attention and engagement: anterior cingulate cortex (sustained attention and relevance), early visual cortex and motion regions (sensory salience), fusiform face area (social engagement), and amygdala-adjacent regions (emotional salience). The 3D heatmap shows full cortical surface activation for users who want region-level detail.

See your video's brain engagement timeline

Upload any video. Get a second-by-second cortical activation prediction before you post.

Analyze your video →

← VidCognition