Scale Where It Matters:
Training-Free Localized Scaling for Diffusion Models

1Stony Brook University    2Nanyang Technological University    3UT Austin
4Johns Hopkins University    5Texas A&M University    6SparcAI Research
LoTTS Teaser — Global vs. Localized Test-Time Scaling

Where should extra inference go? Typical TTS perturbs or resamples the whole image, even when only a small region is wrong. LoTTS uses quality-aware attention to find those weak regions and runs test-time scaling only there, leaving high-quality pixels fixed—training-free, and a much smaller search space.

News


event [Mar 2026] Code and project page released.
event [Nov 2025] Paper available on arXiv.

Overview


Test-time scaling for diffusion models usually perturbs the entire image, yet quality is often uneven across the canvas.

Defects are typically localized: additional compute is better spent on weak regions than on restarting the whole sample.

LoTTS is training-free: attention-derived masks for localization, masked resampling with consistency controls—see Method; SD2.1 / SDXL / FLUX experiments in Results.

Method


Overview of LoTTS

Given a text prompt, LoTTS first generates candidate images from different noise seeds. It then localizes defective regions using high-/low-quality prompt contrast and constructs a quality-aware mask. Noise is injected only inside the masked regions, followed by localized denoising with spatial and temporal consistency. A verifier finally selects the best refined sample.

LoTTS Overview

Defect Localization

Defect Localization Overview. From extracted cross-attention maps, the pipeline produces coherent defect masks through prompt-driven discrimination, context-aware propagation, semantic-guided reweighting, and quality-aware mask generation.

Defect Localization Pipeline

Consistency Maintenance

Localized Resampling Process. Initial Denoising: standard denoising produces \(\mathbf{x}_0\) with localized artifacts (IR = 0.54). Localized Resample: LoTTS injects noise within the defect mask \(\mathbf{M}\) at step \(t_0\), then performs Masked Refinement followed by Global Integration, yielding \(\tilde{\mathbf{x}}_0\) with improved quality (IR = 1.03).

Localized Resampling Process

Results


Quantitative Comparison

LoTTS consistently outperforms Resampling, Best-of-N, and Particle Sampling under matched NFE budgets across three architectures (SD2.1, SDXL, FLUX) and three benchmarks.

Model Method Pick-a-Pic DrawBench COCO2014
HPS↑AES↑Pick↑IR↑ HPS↑AES↑Pick↑IR↑ FID↓CLIP↑
SD2.1 Resampling 20.445.37720.320.236 21.345.45620.230.244 15.330.201
Best-of-N 21.565.53421.040.470 22.455.58920.590.446 13.210.252
Particle Sampling 23.445.98021.300.530 22.195.79021.230.520 12.340.260
LoTTS (Ours) 24.525.80521.320.680 23.295.91121.470.698 10.890.263
SDXL Resampling 23.446.01121.180.680 23.846.03421.090.657 9.560.234
Best-of-N 24.546.19822.010.790 25.276.23822.230.756 8.340.268
Particle Sampling 25.336.23522.050.865 26.466.23322.310.844 7.990.271
LoTTS (Ours) 28.236.30422.301.102 28.906.32122.381.111 7.330.297
FLUX Resampling 29.346.29822.071.038 29.286.22322.051.100 7.010.282
Best-of-N 30.236.29922.891.235 30.466.29022.331.221 6.340.306
Particle Sampling 31.566.53223.311.450 32.286.52322.901.445 6.020.332
LoTTS (Ours) 33.336.50123.041.605 33.906.89023.211.623 5.310.351

Qualitative Comparison

Qualitative results on challenging text-to-image prompts. Compared to Resampling, Best-of-N, and Particle Sampling, LoTTS better follows complex prompts. Green borders indicate high-quality generations, red mark lower-quality ones, and brown shows the initial image used as the starting point of localized refinement.

Qualitative comparison

Localized Refinement

Localized Refinement on SD2.1. LoTTS corrects diverse localized artifacts (e.g., distorted hands and faces). Heatmaps show per-region artifact scores, and binary masks indicate the regions selected for refinement.

Localized refinement examples

Scaling Efficiency

(1) Parameter sensitivity analysis — LoTTS maintains stable improvements across varying \(k\), \(r\), and \(t_0\). (2) Scaling Comparison — LoTTS reaches the same HPS as Best-of-N with far fewer samples, yielding 2.8–4× speedup.

Parameter sensitivity and scaling comparison

Ablation Studies


(1) Ablation of Mask Generation. Removing any component consistently degrades all four metrics. (2) Prompt Design. Radar chart comparing two prompt strategies: LoTTS is robust to prompt phrasing.

Ablation studies

Comparison with MUSIQ quality maps. Each triplet shows the original image, our attention heatmap, and the MUSIQ heatmap (red = lower quality). Both methods consistently highlight similar low-quality regions — despite ours being entirely training-free.

Mask comparison with MUSIQ

Citation


@article{ren2025lotts,
  title   = {Scale Where It Matters: Training-Free Localized
             Scaling for Diffusion Models},
  author  = {Ren, Qin and Wang, Yufei and Guo, Lanqing
             and Zhang, Wen and Fan, Zhiwen and You, Chenyu},
  journal = {arXiv preprint arXiv:2511.19917},
  year    = {2025}
}