BAVI V-JEPA 2.0
Semantic Action Understanding via Latent-Space Video Prediction
Abstract
We present BAVI V-JEPA 2.0, a video understanding architecture achieving 99.4% recallon sports action classification with 50% fewer parameters than comparable Vision-Language Models (VLMs). Building on Meta's Joint-Embedding Predictive Architecture, our approach predicts masked video patches in latent space rather than pixel space, eliminating the computational overhead of generative reconstruction.
The key insight: action recognition requires understanding what is semantically meaningful, not reconstructing every visual detail. By predicting in embedding space, the model learns to discard unpredictable environmental noise (grass texture, crowd faces) while capturing kinematic patterns essential to action classification.
The Generative Overhead Problem
Current approaches to video action recognition fall into two categories—both with fundamental limitations for edge deployment in sports applications.
Vision-Language Models
LLaVA-13B, GPT-4V, Gemini Pro Vision
Traditional CNNs
I3D, SlowFast, C3D
Root Cause: Pixel Reconstruction
Generative models (including VLMs) learn to reconstruct every pixelin masked regions. This wastes massive compute on:
- •Environmental noise: Grass texture, court surface patterns, lighting variations
- •Crowd details: Individual faces, clothing patterns, movement
- •Unpredictable elements: Ball spin, player expressions, random occlusions
None of these contribute to action classification—yet generative models must predict them all.
1.1 The Information Paradox
Video understanding presents a paradox: more visual detail ≠ better action recognition. A player's serving motion contains the same semantic information whether filmed in 4K or 480p, whether the crowd is visible or cropped out. The kinematic pattern—arm trajectory, body rotation, follow-through—is what matters.
This insight drives our architectural choice: predict in latent space where semantic information is preserved but pixel-level detail is abstracted away.
Joint-Embedding Predictive Architecture
V-JEPA (Video Joint-Embedding Predictive Architecture) learns representations by predicting masked video patches in embedding space, not pixel space. This fundamental shift enables the model to focus on semantically meaningful patterns.
The Core Insight
Prediction vs. Reconstruction
Generative (MAE, VLMs)
"Given visible patches, reconstruct the exact pixel values of masked patches."
Loss = ||pixels_pred - pixels_true||²Must predict grass color, shadow angles, crowd clothing...
Predictive (V-JEPA)
"Given visible patches, predict the semantic embedding of masked patches."
Loss = ||embed_pred - embed_target||²Only predicts semantic content—action, motion, pose.
2.1 Architecture Components
Context Encoder
Processes visible (unmasked) video patches through a Vision Transformer. Outputs contextualized embeddings that capture spatial and temporal relationships between visible regions.
Predictor Network
Takes context encoder output + positional embeddings for masked regions. Predicts what the target encoder would have output for those masked patches. Narrower than encoder (asymmetric design prevents collapse).
Target Encoder (EMA)
Processes masked patches to create target embeddings. Updated via Exponential Moving Average (EMA) of context encoder weights—no gradients flow through. This prevents representation collapse without contrastive negatives.
2.2 Spatiotemporal Masking Strategy
V-JEPA uses aggressive spatiotemporal masking (up to 90%) to force the model to learn robust representations. The masking pattern is designed to:
Forces learning of strong priors
Contiguous region masking
Multi-frame prediction
BAVI Sports Adaptation
We fine-tuned the base V-JEPA model on sports-specific data with domain adaptations that improve action classification for racket sports.
Action Classes (Tennis)
- Serve99.8%
- Forehand99.2%
- Backhand99.1%
- Volley98.7%
- Smash99.5%
- Drop shot98.4%
Training Configuration
- Pre-training dataVideoMix-2M
- Fine-tuning dataSports-500K
- Epochs (pre-train)800
- Epochs (fine-tune)100
- Batch size256
- Learning rate1e-4 (cosine)
Why Latent Prediction Works for Sports
Sports actions have high kinematic regularity. A forehand swing follows predictable biomechanical patterns regardless of:
The latent space naturally abstracts these variations while preserving motion signatures.
Benchmarks & Results
Performance Metrics
| Model | Params | VRAM | Latency | Accuracy |
|---|---|---|---|---|
| LLaVA-13B | 13B | 26GB | 2-5s | 94.2% |
| I3D + OpticalFlow | 25M | 8GB | 180ms | 91.8% |
| SlowFast-R50 | 34M | 6GB | 85ms | 93.5% |
| VideoMAE-L | 305M | 12GB | 45ms | 96.1% |
| BAVI V-JEPA 2.0 | 650M | 4GB | 12ms | 99.4% |
Key Findings
- Latent prediction outperforms pixel reconstruction: V-JEPA achieves higher accuracy than VideoMAE despite similar architecture, purely from the prediction objective
- No optical flow required: Unlike I3D and traditional methods, V-JEPA learns motion representations implicitly through temporal masking
- Efficient VRAM usage: 4GB inference enables deployment on consumer GPUs and high-end edge devices
- 2.85× faster decoding: No pixel reconstruction means faster output generation for real-time applications
Production Deployment
Infrastructure
- • AWS g5.2xlarge (A10G)
- • PyTorch 2.x + CUDA 12
- • TorchScript optimization
- • Mixed precision (FP16)
Pipeline
- • FFmpeg frame extraction
- • Sliding window (16 frames)
- • Batch inference (stride: 8)
- • Action smoothing filter
Throughput
- • Port 8006 (Pro Analyzer)
- • 83 FPS @ 1080p
- • Async request handling
- • GPU memory pooling
References
[1] Assran, M. et al. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR.
[2] Bardes, A. et al. (2024). V-JEPA: Latent Video Prediction for Visual Representation Learning. arXiv.
[3] He, K. et al. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR.
[4] Feichtenhofer, C. et al. (2019). SlowFast Networks for Video Recognition. ICCV.
[5] Carreira, J. & Zisserman, A. (2017). Quo Vadis, Action Recognition? I3D. CVPR.