EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild

Jeffri Murrugarra-Llerena1, Pranav Chitale1, Zicheng Liu1, Kai Ao1, Yujin Ham2, Guha Balakrishnan2, Paola Cascante-Bonilla1
1Stony Brook University   ·   2Rice University
TBD 2026

Full Annotation Example
A complete walkthrough of EgoGroups annotations, showing person bounding boxes, track IDs, and social group assignments in a first-person view scene.
A
Coarse Annotations
High-level group labels assigned to people in first person view scenes.
Coarse annotation example 7 Coarse annotation example 34 Coarse annotation example 51 Coarse annotation example 1557
B
Finegrained Annotations
Dense per-person labels with detailed social group membership and interaction type.
Finegrained annotation example 7 Finegrained annotation example 34 Finegrained annotation example 51 Finegrained annotation example 1557

Abstract

Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 64 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.

EgoGroups concept overview

Method

Building EgoGroups
Dataset construction pipeline
Building EgoGroups pipeline diagram
Approach
We implement a pipeline to obtain diverse metadata from bounding boxes within track IDs and develop an effective prompting strategy for LLMs and VLMs to detect groups in first-person view videos.
Approach diagram

Results

1
Group Detection by Crowd Density
Average Precision (AP) across group sizes G1–G5 under Scattered, Moderate, and Crowded settings. We show that EgoGroups is a challenging benchmark. Crowded setting is the hardest across all models.
Model Params Type Scattered Moderate Crowded All AP
G1G2G3G4G5AP G1G2G3G4G5AP G1G2G3G4G5AP
Cosmos-Reason2 8B VLM 57.1874.4276.3675.3580.672.78 28.9245.2859.5468.0268.5554.06 12.830.1250.0561.5373.1845.53 51.07
LLM 60.7271.0372.8470.4975.070.02 32.4749.6165.8162.5476.0257.29 20.5840.5655.959.6965.1548.38 53.38
Qwen2.5 32B VLM 58.9483.3786.9281.692.2780.62 30.2862.7277.5584.2179.3866.83 19.1152.4273.2676.0571.5658.48 63.37
LLM 63.7185.2882.3879.5162.7374.72 40.4973.1475.2873.0270.6266.51 30.4564.6672.6968.2764.3860.09 63.54
Qwen2.5 72B VLM 60.5385.185.1179.1784.5478.89 37.1866.9179.1478.4582.7668.89 25.8959.7973.7977.8174.1862.29 66.00
LLM 63.9785.8480.2475.059.5572.92 38.7373.6876.471.8569.2865.99 28.0265.9373.1868.6461.9259.54 62.93
Qwen3 30B VLM 71.5980.2874.2675.059.0972.04 50.2966.0271.3371.0770.3665.81 36.8957.2663.760.8258.3955.41 63.23
LLM 69.4183.5679.3778.4762.7374.71 54.9469.9471.1772.1470.1567.67 47.5962.8469.3563.2559.2560.46 64.23
Gemini-3-Pro VLM 82.6780.8573.9276.3968.7976.52 70.3775.568.2467.2457.0767.69 63.0969.7864.1654.4347.959.87 64.05
LLM 77.6681.8575.3375.3559.5573.95 60.2974.1873.2370.962.1268.14 50.9569.068.6363.7154.8861.44 64.85
Legend: Best per column All AP (overall) VLM Vision-Language Model LLM Language-only Model
2
Group Detection by World Region
AP broken down by geographic region. Models show consistent gaps of 10–15 AP points across regions, with underrepresented areas like Africa (AF) and the Middle East (ME) present low performance.
Model ParamsType AFANCAEUGE LALEMENESAO
Cosmos-Reason2 8B VLM 47.1950.4954.0151.6940.07 51.1252.9847.0549.8652.8658.89
LLM 45.9556.8955.2554.59 50.9554.1754.4049.1851.2152.9553.37
Qwen2.5 32B VLM 61.9964.6363.5758.8260.48 66.5564.0957.9871.7565.7550.00
LLM 61.8765.0162.1661.6962.84 66.8963.8459.7569.3663.2861.63
Qwen2.5 72B VLM 58.6467.2465.7864.5467.06 70.0667.6959.4271.4567.3065.80
LLM 61.9465.2261.8862.3264.28 65.0462.6157.3367.3164.4954.21
Qwen3 30B VLM 56.0764.7658.4960.2665.85 59.9263.4253.9456.3064.2063.10
LLM 62.6465.6164.4365.9566.34 66.7363.3057.7067.6666.3155.60
Gemini-3-Pro VLM 62.4166.7064.0168.2365.53 64.7964.3260.2471.8466.0362.31
LLM 60.8764.5964.6968.6164.37 62.5566.4759.5162.4466.6964.70
Regions: AF=Africa · AN=Andean · CA=Central Asia · EU=Europe · GE=Germanic · LA=Latin America · LE=Levant · ME=Middle East · NE=Northeast Asia · SA=South Asia · O=Oceania Row max Row min
3
Group Detection from Panoramic Views
Full benchmark results including Group Detection AP (G1–G5) and Group ID Prediction (Precision, Recall, F1). VLMs and LLMs outperform supervised baselines in a zero-shot setting.
Model Params Type Group Detection (AP) Group ID Prediction
G1G2G3G4G5AP PrecisionRecallF1
JLSG 8.029.3037.565.4067.041.4
JRDB-Act 81.4064.8049.1063.2037.2059.2
DVT3 61.1631.0641.19
Cosmos-Reason2 8B VLM 44.8834.730.9636.8355.040.47 19.835.839.02
LLM 48.4736.1929.0531.7156.2640.34 24.76.9310.82
Qwen2.5 32B VLM 25.5447.4447.2354.0369.348.71 22.3916.8119.2
LLM 35.1356.7854.1156.9669.4954.49 28.4523.8825.97
72B VLM 28.445.4443.6251.2671.9948.14 23.5417.2819.93
LLM 32.9555.4255.3460.9966.8654.31 28.9825.4627.11
Qwen3 30B VLM 30.544.241.544.7172.4846.68 30.1917.4522.12
LLM 47.1557.3854.4557.9766.6256.71 40.2527.8432.91
235B VLM 31.5444.2841.3650.5977.2449.0 36.8021.5127.15
LLM 31.051.2960.469.6379.9558.46 34.0325.5329.17
Gemini-3-Pro VLM 56.6671.7979.3379.757.4968.99 42.9946.3844.62
LLM 51.6569.9782.5281.864.0870.0 40.8943.9742.37
Baselines: Italic rows = supervised baselines (JLSG, JRDB-Act, DVT3) Best per column

Q
Qualitative Results
Visual examples of group detection predictions across different crowd densities.
Qualitative results

SA
Social Activities Results
Cultural samples and activity-level heatmaps showing how social interaction patterns vary across world regions, as detected by Qwen2.5-72B.
Cultural samples Activity heatmap

BibTeX

@article{Murrugarra_2026_egogroups,
  author    = {Murrugarra-Llerena, Jeffri and Chitale, Pranav and Liu, Zicheng and Ao, Kai and Ham, Yujin and Balakrishnan, Guha and Cascante-Bonilla, Paola},
  title     = {EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild},
  journal   = {TBD},
  year      = {2026},
}