SlideGen
Collaborative Multimodal Agents for Scientific Slide Generation
Xin Liang1 Xiang Zhang2 Yiwei Xu3 Siqi Sun4 Chenyu You1
1Stony Brook University 2University of British Columbia
3University of California, Los Angeles 4Fudan University
Input: Paper.pdf
Output: Slide.pptx
Abstract
Generating academic slides from scientific papers is a challenging multimodal reasoning task that requires both long-context understanding and deliberate visual planning. Existing approaches largely reduce it to text-only summarization, overlooking the visual component and design-intensive nature of slide creation. In this paper, we introduce SlideGen, an agentic, modular, and visual-in-the-loop framework for scientific paper-to-slide generation. SlideGen orchestrates a group of vision–language agents that reason collaboratively over the document structure and semantics, producing editable PPTX slides with logical flow and compelling visual presentation. By integrating coordinated outlining, mapping, arrangement, note synthesis, and iterative refinement, our system consistently delivers slides of expert-level quality. Across diverse benchmarks and strong baselines, SlideGen outperforms existing methods in visual quality, content faithfulness, and readability, positioning it as the new state of the art in automated slide generation. Our work establishes a foundation for design-aware multi-modal slide generation, demonstrating how agentic collaboration can bridge understanding and presentation in complex multimodal reasoning tasks.
Method Overview
SlideGen is a modular multimodal-agentic framework that converts scientific papers into structured, editable PPTX slides. It extracts paper content into an explicit outline, then plans each slide with visual reasoning for readability and compactness. The pipeline includes six specialized agents covering the full workflow.
We parse the PDF into a Markdown-like format and build a shared asset library: text assets align headings with paragraph content, and visual assets align figure/table captions with extracted images.
Outliner reads the document and produces a two-level presentation outline: section ordering plus slide-level plans with titles and concise summaries. It outputs a JSON-like structure with global metadata and an ordered slide list.
Mapper assigns figures and tables to the slide(s) they best support and records a short rationale for each placement. Assets can be reused across slides, while irrelevant visuals are left unassigned.
Formulizer extracts formulas and links them to the most relevant sections, storing a normalized representation (LaTeX or crop) plus a brief explanation from surrounding context. It supports both automatic extraction and human-in-the-loop selection.
Arranger selects layouts from an extensible template library (e.g., text-only, image-left/right, multi-image, formula strip) based on element types and image sizes. This separation improves visual balance and consistency across the deck.
Refiner improves global coherence by merging redundant slides, choosing better text templates, and deriving a readable theme color from figures. Speaker then generates short presenter notes per slide, producing a talk-ready script aligned with the final deck.
Aesthetic-Aware VLMs
SlideGen integrates visual design as a first-class objective in automated slide generation, reducing visual fatigue and producing more polished decks.
Json-formatted outline
Layout Template
Quantitive Results
BibTeX
@article{liang2025slidegen,
title={SlideGen: Collaborative Multimodal Agents for Scientific Slide Generation},
author={Liang, Xin and Zhang, Xiang and Xu, Yiwei and Sun, Siqi and You, Chenyu},
journal={arXiv preprint arXiv:2512.04529},
year={2025}}