EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild

▶

Full Annotation Example

A complete walkthrough of EgoGroups annotations, showing person bounding boxes, track IDs, and social group assignments in a first-person view scene.

A

Coarse Annotations

High-level group labels assigned to people in first person view scenes.

B

Finegrained Annotations

Dense per-person labels with detailed social group membership and interaction type.

Abstract

Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 64 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.

Method

①

Building EgoGroups

Dataset construction pipeline

②

Approach

We implement a pipeline to obtain diverse metadata from bounding boxes within track IDs and develop an effective prompting strategy for LLMs and VLMs to detect groups in first-person view videos.

Results

1

Group Detection by Crowd Density

Average Precision (AP) across group sizes G1–G5 under Scattered, Moderate, and Crowded settings. We show that EgoGroups is a challenging benchmark. Crowded setting is the hardest across all models.

Model	Params	Type	Scattered						Moderate						Crowded						All AP
Model	Params	Type	G1	G2	G3	G4	G5	AP	G1	G2	G3	G4	G5	AP	G1	G2	G3	G4	G5	AP	All AP
Cosmos-Reason2	8B	VLM	57.18	74.42	76.36	75.35	80.6	72.78	28.92	45.28	59.54	68.02	68.55	54.06	12.8	30.12	50.05	61.53	73.18	45.53	51.07
Cosmos-Reason2	8B	LLM	60.72	71.03	72.84	70.49	75.0	70.02	32.47	49.61	65.81	62.54	76.02	57.29	20.58	40.56	55.9	59.69	65.15	48.38	53.38
Qwen2.5	32B	VLM	58.94	83.37	86.92	81.6	92.27	80.62	30.28	62.72	77.55	84.21	79.38	66.83	19.11	52.42	73.26	76.05	71.56	58.48	63.37
Qwen2.5	32B	LLM	63.71	85.28	82.38	79.51	62.73	74.72	40.49	73.14	75.28	73.02	70.62	66.51	30.45	64.66	72.69	68.27	64.38	60.09	63.54
Qwen2.5	72B	VLM	60.53	85.1	85.11	79.17	84.54	78.89	37.18	66.91	79.14	78.45	82.76	68.89	25.89	59.79	73.79	77.81	74.18	62.29	66.00
Qwen2.5	72B	LLM	63.97	85.84	80.24	75.0	59.55	72.92	38.73	73.68	76.4	71.85	69.28	65.99	28.02	65.93	73.18	68.64	61.92	59.54	62.93
Qwen3	30B	VLM	71.59	80.28	74.26	75.0	59.09	72.04	50.29	66.02	71.33	71.07	70.36	65.81	36.89	57.26	63.7	60.82	58.39	55.41	63.23
Qwen3	30B	LLM	69.41	83.56	79.37	78.47	62.73	74.71	54.94	69.94	71.17	72.14	70.15	67.67	47.59	62.84	69.35	63.25	59.25	60.46	64.23
Gemini-3-Pro	—	VLM	82.67	80.85	73.92	76.39	68.79	76.52	70.37	75.5	68.24	67.24	57.07	67.69	63.09	69.78	64.16	54.43	47.9	59.87	64.05
Gemini-3-Pro	—	LLM	77.66	81.85	75.33	75.35	59.55	73.95	60.29	74.18	73.23	70.9	62.12	68.14	50.95	69.0	68.63	63.71	54.88	61.44	64.85

Legend: Best per column All AP (overall) VLM Vision-Language Model LLM Language-only Model

2

Group Detection by World Region

AP broken down by geographic region. Models show consistent gaps of 10–15 AP points across regions, with underrepresented areas like Africa (AF) and the Middle East (ME) present low performance.

Model	Params	Type	AF	AN	CA	EU	GE	LA	LE	ME	NE	SA	O
Cosmos-Reason2	8B	VLM	47.19	50.49	54.01	51.69	40.07	51.12	52.98	47.05	49.86	52.86	58.89
Cosmos-Reason2	8B	LLM	45.95	56.89	55.25	54.59	50.95	54.17	54.40	49.18	51.21	52.95	53.37
Qwen2.5	32B	VLM	61.99	64.63	63.57	58.82	60.48	66.55	64.09	57.98	71.75	65.75	50.00
Qwen2.5	32B	LLM	61.87	65.01	62.16	61.69	62.84	66.89	63.84	59.75	69.36	63.28	61.63
Qwen2.5	72B	VLM	58.64	67.24	65.78	64.54	67.06	70.06	67.69	59.42	71.45	67.30	65.80
Qwen2.5	72B	LLM	61.94	65.22	61.88	62.32	64.28	65.04	62.61	57.33	67.31	64.49	54.21
Qwen3	30B	VLM	56.07	64.76	58.49	60.26	65.85	59.92	63.42	53.94	56.30	64.20	63.10
Qwen3	30B	LLM	62.64	65.61	64.43	65.95	66.34	66.73	63.30	57.70	67.66	66.31	55.60
Gemini-3-Pro	—	VLM	62.41	66.70	64.01	68.23	65.53	64.79	64.32	60.24	71.84	66.03	62.31
Gemini-3-Pro	—	LLM	60.87	64.59	64.69	68.61	64.37	62.55	66.47	59.51	62.44	66.69	64.70

Regions: AF=Africa · AN=Andean · CA=Central Asia · EU=Europe · GE=Germanic · LA=Latin America · LE=Levant · ME=Middle East · NE=Northeast Asia · SA=South Asia · O=Oceania Row max Row min

3

Group Detection from Panoramic Views

Full benchmark results including Group Detection AP (G1–G5) and Group ID Prediction (Precision, Recall, F1). VLMs and LLMs outperform supervised baselines in a zero-shot setting.

Model	Params	Type	Group Detection (AP)						Group ID Prediction
Model	Params	Type	G1	G2	G3	G4	G5	AP	Precision	Recall	F1
JLSG		—	8.0	29.30	37.5	65.40	67.0	41.4	—	—	—
JRDB-Act		—	81.40	64.80	49.10	63.20	37.20	59.2	—	—	—
DVT3		—	—	—	—	—	—	—	61.16	31.06	41.19
Cosmos-Reason2	8B	VLM	44.88	34.7	30.96	36.83	55.0	40.47	19.83	5.83	9.02
Cosmos-Reason2	8B	LLM	48.47	36.19	29.05	31.71	56.26	40.34	24.7	6.93	10.82
Qwen2.5	32B	VLM	25.54	47.44	47.23	54.03	69.3	48.71	22.39	16.81	19.2
	32B	LLM	35.13	56.78	54.11	56.96	69.49	54.49	28.45	23.88	25.97
	72B	VLM	28.4	45.44	43.62	51.26	71.99	48.14	23.54	17.28	19.93
	72B	LLM	32.95	55.42	55.34	60.99	66.86	54.31	28.98	25.46	27.11
Qwen3	30B	VLM	30.5	44.2	41.5	44.71	72.48	46.68	30.19	17.45	22.12
	30B	LLM	47.15	57.38	54.45	57.97	66.62	56.71	40.25	27.84	32.91
	235B	VLM	31.54	44.28	41.36	50.59	77.24	49.0	36.80	21.51	27.15
	235B	LLM	31.0	51.29	60.4	69.63	79.95	58.46	34.03	25.53	29.17
Gemini-3-Pro	—	VLM	56.66	71.79	79.33	79.7	57.49	68.99	42.99	46.38	44.62
Gemini-3-Pro	—	LLM	51.65	69.97	82.52	81.8	64.08	70.0	40.89	43.97	42.37

Baselines: Italic rows = supervised baselines (JLSG, JRDB-Act, DVT3) Best per column

Q

Qualitative Results

Visual examples of group detection predictions across different crowd densities.

SA

Social Activities Results

Cultural samples and activity-level heatmaps showing how social interaction patterns vary across world regions, as detected by Qwen2.5-72B.

BibTeX

@article{Murrugarra_2026_egogroups,
  author    = {Murrugarra-Llerena, Jeffri and Chitale, Pranav and Liu, Zicheng and Ao, Kai and Ham, Yujin and Balakrishnan, Guha and Cascante-Bonilla, Paola},
  title     = {EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild},
  journal   = {TBD},
  year      = {2026},
}