SlideGen

Collaborative Multimodal Agents for Scientific Slide Generation

Xin Liang¹ Xiang Zhang² Yiwei Xu³ Siqi Sun⁴ Chenyu You¹

¹Stony Brook University ²University of British Columbia
³University of California, Los Angeles ⁴Fudan University

Paper Code WeChat

Input: Paper.pdf

Output: Slide.pptx

Abstract

Generating academic slides from scientific papers is a challenging multimodal reasoning task that requires both long-context understanding and deliberate visual planning. Existing approaches largely reduce it to text-only summarization, overlooking the visual component and design-intensive nature of slide creation. In this paper, we introduce SlideGen, an agentic, modular, and visual-in-the-loop framework for scientific paper-to-slide generation. SlideGen orchestrates a group of vision–language agents that reason collaboratively over the document structure and semantics, producing editable PPTX slides with logical flow and compelling visual presentation. By integrating coordinated outlining, mapping, arrangement, note synthesis, and iterative refinement, our system consistently delivers slides of expert-level quality. Across diverse benchmarks and strong baselines, SlideGen outperforms existing methods in visual quality, content faithfulness, and readability, positioning it as the new state of the art in automated slide generation. Our work establishes a foundation for design-aware multi-modal slide generation, demonstrating how agentic collaboration can bridge understanding and presentation in complex multimodal reasoning tasks.

Method Overview

Overview

SlideGen is a modular multimodal-agentic framework that converts scientific papers into structured, editable PPTX slides. It extracts paper content into an explicit outline, then plans each slide with visual reasoning for readability and compactness. The pipeline includes six specialized agents covering the full workflow.

Preprocessing

We parse the PDF into a Markdown-like format and build a shared asset library: text assets align headings with paragraph content, and visual assets align figure/table captions with extracted images.

Outliner

Outliner reads the document and produces a two-level presentation outline: section ordering plus slide-level plans with titles and concise summaries. It outputs a JSON-like structure with global metadata and an ordered slide list.

Mapper

Mapper assigns figures and tables to the slide(s) they best support and records a short rationale for each placement. Assets can be reused across slides, while irrelevant visuals are left unassigned.

Formulizer

Formulizer extracts formulas and links them to the most relevant sections, storing a normalized representation (LaTeX or crop) plus a brief explanation from surrounding context. It supports both automatic extraction and human-in-the-loop selection.

Arranger

Arranger selects layouts from an extensible template library (e.g., text-only, image-left/right, multi-image, formula strip) based on element types and image sizes. This separation improves visual balance and consistency across the deck.

Refiner & Speaker

Refiner improves global coherence by merging redundant slides, choosing better text templates, and deriving a readable theme color from figures. Speaker then generates short presenter notes per slide, producing a talk-ready script aligned with the final deck.

Aesthetic-Aware VLMs

SlideGen integrates visual design as a first-class objective in automated slide generation, reducing visual fatigue and producing more polished decks.

Json-formatted outline

Layout Template

Quantitive Results

Qualitative Results

Scroll inside the frame to view the next page.

BibTeX

 @article{liang2025slidegen,
  title={SlideGen: Collaborative Multimodal Agents for Scientific Slide Generation},
  author={Liang, Xin and Zhang, Xiang and Xu, Yiwei and Sun, Siqi and You, Chenyu},
  journal={arXiv preprint arXiv:2512.04529},
  year={2025}}