Functional magnetic resonance imaging measures neural activity indirectly, via the BOLD (Blood Oxygen Level Dependent) signal. When neurons in a brain region fire more actively, local blood flow increases to deliver more oxygenated hemoglobin. Oxygenated and deoxygenated hemoglobin have different magnetic properties — fMRI detects this difference, producing a spatial map of relative neural activity across the brain volume.
For video research: participants watch video stimuli inside an fMRI scanner while researchers record BOLD signal changes across the full brain, time-locked to the video. The result is a 4D dataset: three spatial dimensions (the brain volume) plus time (one measurement per approximately 1–2 seconds of video). This produces ground-truth data on which brain regions activated, how strongly, and when — as each moment of the video played.
Standard clinical fMRI scanners operate at 1.5T or 3T field strength. 7T scanners (Tesla) produce significantly higher signal-to-noise ratio and spatial resolution — resolving cortical features at sub-millimeter scale rather than the 3–4mm voxels typical of 3T scanning. This matters for neural encoding research because it allows mapping to finer cortical regions and distinguishing between adjacent functional areas. TRIBE v2's zero-shot generalization performance was validated on 7T fMRI recordings from the Human Connectome Project — demonstrating that the model's learned representations transfer to higher-resolution imaging data it was not trained on.
Studying naturalistic video in an fMRI scanner is technically demanding. Participants must remain still inside a loud, enclosed scanner while watching content. The temporal resolution of fMRI (typically 1–2 second TR — repetition time) is coarser than the stimulus (video at 24–60fps). Hemodynamic response function lag adds approximately 4–6 seconds of delay between neural event and BOLD signal peak, requiring deconvolution. Despite these constraints, fMRI video research has produced the most comprehensive datasets on how the human brain processes dynamic visual content — datasets that neural encoding models can be trained on.
A neural encoding model learns the mapping between a stimulus (in this case, video) and the brain response it produces. Given a new stimulus the model has never seen, it predicts the brain response that stimulus would evoke.
Training procedure:
The model captures the statistical relationship between video content and cortical activation — essentially learning "when a video has feature X, region Y activates at strength Z."
Neural encoding models for video have a research history going back to early work on primary visual cortex. Early models predicted V1 responses from simple features (orientation, spatial frequency, contrast). Later models extended to higher visual areas using more complex feature representations. Early neural encoding approaches relied on linear regression mapping from handcrafted features — a severe limitation on both expressivity and the number of brain locations that could be modeled simultaneously. Deep neural network-based encoders dramatically expanded predictive accuracy across the full cortical hierarchy. TRIBE v2 represents the current state of the art: a multimodal, whole-brain video encoding model trained across hundreds of participants, with demonstrated zero-shot generalization to new subjects, content, and imaging protocols.
TRIBE v2 (Temporal Representation in Brain Encoding, version 2) is Meta AI Research's neural encoding model for video, published in March 2026.
d'Ascoli S, Rapin J, Benchetrit Y, Brooks T, Begany K, Raugel J, Banville H, King J-R. "A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience." Meta AI Research (FAIR), March 2026. ai.meta.com/research · GitHub · HuggingFace
Key Technical Specifications
TRIBE v2 outputs a prediction of cortical activation at each vertex of the fsaverage5 surface mesh, for each second of video. The fsaverage5 mesh is the standard cortical surface template used in neuroscience research — it maps to anatomical brain regions in a standardized coordinate system, allowing comparison across participants and studies. The prediction covers the full cortical surface: early visual regions, higher visual areas, motion-selective regions, face-selective regions, attention networks, auditory regions, language regions, and limbic-adjacent regions involved in emotional processing.
Prior state-of-the-art video encoding models used linear regression from handcrafted or single-modality features, trained on a small number of participants, predicting activity in a limited set of brain parcels. TRIBE v2 improves on this in five ways: (1) nonlinear deep learning rather than ridge regression; (2) multi-subject training enabling generalization to new individuals; (3) true multimodal fusion of video, audio, and language features with modality dropout regularization; (4) whole-brain prediction across approximately 20,484 cortical vertices rather than hundreds of parcels; (5) zero-shot generalization to new subjects, imaging protocols, and languages not seen during training.
Each major region mapped to creator-relevant language. Neuroscience cited to primary sources.
Felleman DJ, Van Essen DC. "Distributed Hierarchical Processing in the Primate Cerebral Cortex." Cerebral Cortex, 1(1):1–47, 1991.
Function: Processes basic visual features — edges, contrast, luminance, color, orientation, spatial frequency.
Activation means: The content is generating visual signal at the most fundamental level.
Creator relevance: Low-contrast, visually flat content produces minimal early visual cortex activation. Sharp edges, high contrast, strong color, and motion produce strong activation. A hook that doesn't clear this threshold loses viewers before any higher-level processing occurs.
Function: Process motion direction, speed, and optical flow across the visual field.
Activation means: Directional motion is present and being processed.
Creator relevance: Static shots, slow pans, and talking-head content with no movement produce low MT activation. Fast cuts, camera movement, whip pans, and on-screen action spike it. Motion-selective activation is one of the fastest-responding signals in the first 400ms — directly relevant to pattern interrupt effectiveness.
Kanwisher N, McDermott J, Chun MM. "The Fusiform Face Area: A Module in Human Extrastriate Cortex Specialized for Face Perception." Journal of Neuroscience, 17(11):4302–4311, 1997.
Function: Responds selectively to faces, especially eyes and expressions. Response is strongest for faces oriented toward the viewer (direct gaze).
Activation means: Human faces are registering — the social cognition system is engaged.
Creator relevance: Direct-to-camera eye contact is one of the strongest FFA activators in video. Looking away from camera, talking to an off-screen subject, or using graphics-only content without faces reduces FFA activation and social engagement signal.
Posner MI, Petersen SE. "The Attention System of the Human Brain." Annual Review of Neuroscience, 13:25–42, 1990. · Botvinick MM, Braver TS, Barch DM, Carter CS, Cohen JD. "Conflict Monitoring and Cognitive Control." Psychological Review, 108(3):624–652, 2001.
Function: Sustained attention, behavioral relevance monitoring, conflict processing. The ACC signals to the rest of the brain that a stimulus is worth continued processing.
Activation means: The brain is maintaining active engagement with the content — it has been flagged as relevant and worth continued attention allocation.
Creator relevance: Open loops — questions, incomplete statements, unresolved setups — keep the ACC active. Resolved information gaps drop ACC activation. High, sustained ACC activation through the hook window (0–3s) is the single strongest predictor of continued viewing in neural engagement data.
McGaugh JL. "The Amygdala Modulates the Consolidation of Memories of Emotionally Arousing Experiences." Annual Review of Neuroscience, 27:1–28, 2004. · LeDoux JE. "Emotion Circuits in the Brain." Annual Review of Neuroscience, 23:155–184, 2000.
Function: Emotional salience processing, threat and reward detection, arousal. The amygdala activates for emotionally significant stimuli — both positive and negative — and drives the encoding of emotionally salient memories into long-term storage.
Activation means: The content is generating emotional signal — visceral response is present.
Creator relevance: Emotionally flat content produces minimal amygdala activation. High-arousal moments — surprise, awe, amusement, social threat — spike it. Amygdala activation during content correlates with sharing behavior and memory recall, which is why emotionally resonant content spreads more readily than informationally equivalent but emotionally flat content.
Function: Encoding new information into long-term memory. Hippocampal activation during learning is associated with subsequent recall.
Activation means: The content is being encoded — the brain is storing it, not just processing and discarding.
Creator relevance: Brand mentions, CTAs, or key messages that coincide with moments of high emotional activation (amygdala) and attention (ACC) are more likely to be remembered. This is the neural basis of the "peak-end rule" in content design — emotional peaks drive encoding.
TRIBE v2 outputs a high-dimensional activation vector — approximately 20,484 cortical vertices × seconds of video. Raw cortical activation data is not interpretable by creators without neuroscience training. VidCognition translates this output into three usable formats:
Brain engagement score (per second)
A weighted aggregate of activation across the regions most predictive of sustained viewing — ACC, early visual cortex, motion-selective regions, FFA. Displayed as a 0–100 index per second.
Brain engagement timeline
The per-second score plotted across the full video duration. Shows the hook window, retention checkpoints, mid-video floor, and payoff delivery — all as predicted neural engagement rather than post-hoc behavioral data.
3D brain heatmap
The full cortical surface rendered with activation overlaid, synchronized to video playback. Shows which specific regions are active at each moment — allowing creators to see whether a drop in engagement is coming from visual processing failure, attention loss, or emotional signal drop.
Meta's hosted TRIBE v2 demo (aidemos.atmeta.com/tribev2) plays pre-loaded video stimuli with a cortical activation overlay. VidCognition productizes this research capability:
| Approach | What it measures | Limitation | Example tools |
|---|---|---|---|
| AI-predicted fMRI (TRIBE v2) | Full-brain cortical response | Predicts average response, not individual variation | VidCognition |
| AI-predicted eye-tracking | Gaze fixation and attention | Attention only — no emotion, no memory encoding | Neurons Inc |
| Real EEG | Electrical brain activity | Requires hardware; lab setting; lower spatial resolution | iMotions, Emotiv |
| Real fMRI | Ground-truth brain activation | $500–$1,500/hr scanner time; lab-only | Neurensics, NIQ BASES |
| Wearable biometrics | Heart rate, skin conductance | Arousal proxy only; requires real participants wearing devices | Immersion Neuroscience |
| Webcam facial coding | Facial expression micro-movements | Emotion only; requires participant consent and webcam | Realeyes, Adverteyes |
The key distinction: VidCognition is the only creator-accessible tool that predicts full-brain fMRI cortical response. Eye-tracking-based tools (like Neurons Inc at approximately €15,000/year) predict where the eye points — an attention signal only. fMRI prediction captures attention, emotion, memory encoding, and social processing simultaneously.
See also: VidCognition vs Neurons Inc · VidCognition vs AttentionInsight · VidCognition vs Brainsight · VidCognition vs GoViral
TRIBE v2 (Temporal Representation in Brain Encoding, version 2) is a neural encoding model developed by Meta AI Research. Published in March 2026, it predicts how the human brain responds to video content — second by second, across the full cortical surface — without requiring participants in an fMRI scanner. The model was trained on fMRI data from approximately 720 participants across over 1,000 hours of naturalistic video and audio stimuli, and achieves roughly twice the predictive accuracy of prior state-of-the-art models on held-out participants and stimuli. VidCognition uses TRIBE v2 to generate brain engagement timelines for any video before posting.
VidCognition runs your video through TRIBE v2, which extracts video, audio, and language features using pretrained foundation models and maps them to predicted fMRI cortical activation patterns. For each second of your video, the model predicts which brain regions activate and how strongly — based on what it learned from fMRI recordings of human participants watching naturalistic video stimuli. This prediction is aggregated into an engagement score per second and rendered as a brain engagement timeline and 3D cortical heatmap.
No. VidCognition predicts the brain response a typical viewer would have to your video, based on TRIBE v2's neural encoding model. These are computational predictions — not live measurements. The predictions are validated against actual fMRI responses measured from held-out participants watching held-out videos, making them a reliable proxy for real neural engagement, but they represent an expected average response rather than any individual viewer's brain activity.
Eye-tracking-based tools (such as Neurons Inc) predict where a viewer's eyes will fixate — an attention signal only. fMRI prediction captures a much broader picture: attention, emotional processing (amygdala activation), memory encoding (hippocampal regions), social engagement (fusiform face area), and sustained relevance signaling (anterior cingulate cortex). Eye-tracking tells you where the viewer looked; fMRI prediction tells you what the brain was doing cognitively and emotionally at each moment.
TRIBE v2 achieves roughly twice the predictive accuracy of prior state-of-the-art video encoding models on held-out participants and stimuli, including zero-shot generalization to new subjects watching content not seen during training. Accuracy varies by brain region — early visual cortex and auditory/language cortex predictions approach the noise ceiling of what can be predicted from any model; higher-order association cortex predictions are somewhat lower. VidCognition's engagement score focuses on the regions where prediction accuracy is highest and predictive validity for viewing behavior is best established.
Not all brain regions are equally predictive of viewing behavior. VidCognition's engagement score weights activation in the regions with the strongest established links to attention and engagement: anterior cingulate cortex (sustained attention and relevance), early visual cortex and motion regions (sensory salience), fusiform face area (social engagement), and amygdala-adjacent regions (emotional salience). The 3D heatmap shows full cortical surface activation for users who want region-level detail.
Upload any video. Get a second-by-second cortical activation prediction before you post.
Analyze your video →