Kevin Galim

FuriosaAI

Seoul, South Korea

I am a Senior AI Research Engineer at FuriosaAI, where I work on efficient LLM inference, post-training systems, and accelerator-aware training/inference pipelines. My research sits at the intersection of efficient inference, post-training, and AI systems.

I have authored 10+ publications at venues including ICLR, ICML, ACL, CVPR, ECCV, and WACV, with work spanning KV-cache and prompt/context compression, parameter-efficient adaptation for state space models, and diffusion LLMs with parallel decoding.

Before joining FuriosaAI, I worked on applied computer vision at Funzin, including autonomous golf cart perception and CES 2021 demos; GPU-accelerated image processing at ARRI in Munich; and freelance AR/web development. I received my M.Sc. in Informatics (Games Engineering) from the Technical University of Munich (grade 1.4), including a semester of research in computer graphics at the University of Tokyo.

Research interests:

Efficient LLM inference: KV-cache and prompt/context compression, speculative/draft-based decoding, and approximate inference
Parameter-efficient fine-tuning: LoRA, state space models, and Mamba-style architectures
Diffusion LLMs, parallel decoding, and generative systems
Post-training systems: on-policy distillation, asynchronous RL/OPD pipelines, stale rollout correction, teacher-cache constraints, and throughput–quality trade-offs
Accelerator-aware LLM systems: rollout generation, inference pipelines, and custom hardware integration

Ongoing work:

AsyncOPD: How Stale Can On-Policy Distillation Be?. Studies stale rollouts, KL-direction sensitivity, teacher-cache constraints, estimator design, and throughput–quality trade-offs in asynchronous OPD pipelines.

Languages: German (native) · English (fluent) · Korean (professional, TOPIK 5)

selected publications

ICLR

Draft-based Approximate Inference for LLMs

Kevin Galim^*, Ethan Ewer^*, Wonjun Kang, and 3 more authors

In International Conference on Learning Representations, 2026

Abs arXiv

We present a unified framework for approximate inference in long-context LLMs using small draft models to predict token and KV-cache importance. We introduce SpecKV, SpecPC, and SpecKV-PC, enabling more accurate KV-cache and prompt compression while preserving the same efficiency gains in memory usage, latency, and throughput.
ICLR

ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs

Wonjun Kang^*, Kevin Galim^*, Seunghyuk Oh^*, and 8 more authors

In International Conference on Learning Representations, 2026

Abs arXiv

Analyzes the fundamental limitations of parallel decoding in diffusion LLMs and introduces ParallelBench, the first benchmark designed to measure quality degradation caused by token dependency violations. Reveals key speed–quality trade-offs and highlights the need for new decoding strategies.
ICML

Parameter-Efficient Fine-Tuning of State Space Models

Kevin Galim^*, Wonjun Kang^*, Yuchen Zeng^*, and 2 more authors

In International Conference on Machine Learning, 2025

Abs arXiv

Introduces Sparse Dimension Tuning (SDT), a parameter-efficient fine-tuning method specifically designed for state space models such as Mamba. By combining SDT for SSM modules with LoRA for projection layers, achieves state-of-the-art performance for adapting SSM-based language models with minimal additional parameters.