Grounded Teacher — IJCV 2026

Abstract

Source-free object detection (SFOD) faces persistent challenges due to class imbalance–driven context bias and instability in teacher–student training under noisy pseudo-labels. Existing techniques tend to ignore the context bias and class imbalance shift, especially in medical data. To tackle this, we propose Grounded Teacher (GT), a bias-aware source-free framework that grounds the teacher model through relational and semantic regularization. GT introduces a Relational Context Module (RCM), which maintains an EMA estimate of cross-domain contextual bias, modeling directional inter-class confusions. Building upon this, a Semantic Augmentation (SA) strategy selectively augments minority and confusable classes through adaptive MixUp in both source-similar and source-dissimilar target regions. A Semantic-Aware Loss (SAL) applies diagonally normalized weights, preventing gradient explosion while emphasizing minority–majority corrections. Additionally, a frozen Expert branch derived from large vision foundation models (LVFMs) serves as a supervisory reference during training, refining pseudo-label quality without adding inference overhead. Evaluations on Cityscapes→Foggy (50.8 mAP) and medical transfers (+5.9 AP50 on DDSM→INBreast) show consistent gains and improved minority-class detection, with less than 12% additional training cost.

Why SFOD Remains Hard

Three compounding problems existing methods fail to address simultaneously

⚠️

Mode Collapse via Pseudo-label Noise

Biased teacher models produce skewed pseudo-labels. As EMA propagates student weights back, error accumulates and the teacher collapses—predicting only dominant classes.

🔀

Severe Domain Shift

Source-free setting prohibits access to source data during adaptation. Models must transfer knowledge blindly, compounding context bias under large source-target gaps.

⚖️

Class Imbalance in Medical Data

Breast-cancer datasets like DDSM have severe imbalance; most images lack majority-class instances. Conventional image-level resampling is ineffective for detection.

How Grounded Teacher Works

Three interlocking modules that break the bias-collapse cycle in source-free settings

MODULE 01

Relational Context Module (RCM)

Maintains an EMA-updated global confusion matrix across training. Models directional inter-class confusions so downstream augmentation and loss focus on genuinely problematic pairs.

MODULE 02

Semantic Augmentation (SA) + Semantic-Aware Loss (SAL)

RCM-guided relation-MixUp blends minority instances with their most-confused majority neighbours from a FIFO Cropbank. SAL applies diagonal-normalized, bounded weights to classification loss—preventing gradient explosion while amplifying minority corrections.

MODULE 03

Expert Foundational Branch

A frozen Large Vision Foundation Model (LVFM — BioMedParse for medical, GroundingDINO for natural) supervises the student via pseudo-label regression + SAL. Training-only: zero inference overhead.

EMA update Teacher ← Student weights Pseudo-label feedback Filtered by τ = 0.8 confidence Expert supervision Training-only · no inference cost Domain split Variance-based source-similar / dissimilar

Framework Architecture

The GT framework consists of a student–teacher pair connected by EMA weight updates. The Relational Context Module sits between them, maintaining a running confusion matrix that guides both the Cropbank sampling strategy and the Semantic-Aware Loss.

A frozen Expert Branch (BioMedParse / GroundingDINO) provides an additional supervision signal to the student during training. At inference, only the student runs, the expert adds zero latency.

The variance-based domain splitter partitions target images into source-similar and source-dissimilar subsets, allowing tailored augmentation strength in each region.

State-of-the-Art Performance

Evaluated on natural and medical domain adaptation benchmarks

Table 4 — Cityscapes → Foggy Cityscapes

Method	SF	Venue	Person	Rider	Car	Truck	Bus	Train	M.cycle	Bicycle	mAP
Source Only	✗	—	22.4	26.6	28.5	9.0	16.0	4.3	15.2	25.3	18.4
DA-Faster	✗	CVPR'18	29.2	40.4	43.4	19.7	38.3	28.5	23.7	32.7	32.0
EPM	✗	ECCV'20	44.2	46.6	58.5	24.8	45.2	29.1	28.6	34.6	39.0
AT	✗	CVPR'22	43.7	54.1	62.3	31.9	54.4	49.3	35.2	47.9	47.4
MRT	✗	ICCV'23	52.8	51.7	68.7	35.9	58.1	54.5	41.0	47.1	51.2
SFOD-Mosaic	✓	AAAI'21	25.5	44.5	40.7	33.2	22.2	28.4	34.1	39.0	33.5
HCL	✓	NeurIPS'21	38.7	46.0	47.9	33.0	45.7	38.9	32.8	34.9	39.7
LODS	✓	CVPR'22	34.0	45.7	48.8	27.3	39.7	19.6	33.2	37.8	35.8
IRG	✓	CVPR'23	37.4	45.2	51.9	24.4	39.6	25.2	31.5	41.6	37.1
PETS	✓	ICCV'23	46.1	52.8	63.4	21.8	46.7	5.5	37.4	48.4	40.3
SF-UT	✓	ECCV'24	40.9	48.0	58.9	29.6	51.9	50.2	36.2	44.1	45.0
LPLD	✓	ECCV'24	39.7	49.1	56.6	29.6	46.3	26.4	36.1	43.6	40.9
OursGT	✓	IJCV'26	42.7	55.4	61.7	40.7	62.0	54.6	39.1	53.0	50.8
Oracle	✗	—	66.3	61.1	80.8	45.6	68.8	52.0	49.1	54.9	59.8

Medical Imaging Benchmarks (FROC Recall & F1)

Table 1 — DDSM → INBreast (large-to-small, digitised film → full-field digital)

Method	SF	Venue	R@0.05	R@0.30	R@0.50	R@1.00	AUC	F1
IRG	✓	CVPR'23	0.05	0.05	0.07	0.09	0.110	0.120
LPLD	✓	ECCV'24	0.09	0.15	0.35	0.35	0.548	0.635
D-MASTER	✗	MICCAI'24	0.25	0.61	0.70	0.82	0.808	0.524
OursGT	✓	IJCV'26	0.06	0.45	0.65	0.92	0.589	0.758

Table 2 — DDSM → RSNA-BSD1K (cross-geography, US acquisition)

Method	SF	Venue	R@0.05	R@0.30	R@0.50	R@1.00	AUC	F1
IRG	✓	CVPR'23	0.16	0.25	0.37	0.39	0.308	0.235
MRT	✗	ICCV'23	0.32	0.52	0.69	0.72	0.741	0.352
OursGT	✓	IJCV'26	0.10	0.43	0.58	0.91	0.781	0.530

Table 3 — RSNA → INBreast (cross-machine acquisition)

Method	SF	Venue	R@0.05	R@0.30	R@0.50	R@1.00	AUC	F1
IRG	✓	CVPR'23	0.05	0.05	0.07	0.09	0.110	0.120
MRT	✗	ICCV'23	0.03	0.09	0.12	0.17	0.739	0.587
OursGT	✓	IJCV'26	0.01	0.28	0.49	0.90	0.638	0.605

Qualitative Results

Visual comparisons of GT against prior SFOD methods across natural and medical domain transfers. Detection boxes are colour-coded:

Green — True Positive
Blue — Misclassified
Red — False Negative
Pink — False Positive

GT consistently recovers minority-class detections, trucks and buses in fog, and lesions in mammograms, that source-free baselines miss entirely.

Foggy Cityscapes — GT vs MRT / IRG / LPLD. GT recovers trucks and buses missed by baselines.