I am a system software engineer at MangoBoost, developing full-stack, highly optimized systems for large language model (LLM).
I received Ph.D. at KAIST, advised by Dongsu Han in INA lab.
Before then, I received my MS degree in Electrical Engineering at KAIST, advised by Yung Yi.
Also, I have received BS degree in Electrical Engineering at KAIST.
My interests are in crafting computer systems for AI/ML and cloud applications. I concentrate on integrating AI/ML techniques into real-world systems for more efficient and better performance management policies.
My works aim at 1) designing efficient systems built on AI/ML approaches; 2) improving AI/ML approaches by exploiting system-specific assumptions.
All my work involves several months of implementation followed by thorough testing on real-world data.
During PhD, I developed an edge-assisted inference framework that splits LLM workloads between edge and server GPU using a speculative decoding scheme.
I also worked on various topics, including resource optimization and adaptive overload control in SLO-oriented microservices, optimizing image compression for accelerating neural restoration, and designing wireless MAC with a multi-agent RL approach.
Selected Publications
-
SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs
Jinwoo Park, Seunggeun Cho, and Dongsu Han
The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS) Spotlight, 2025
Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge.
We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network.
SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput.
Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving.
TopFull: An Adaptive Top-Down Overload Control for SLO-Oriented Microservices
Jinwoo Park, Jaehyeong Park, Youngmok Jung, Hwijoon Lim, Hyunho Yeo, and Dongsu Han
ACM Special Interest Group on Data Communication (SIGCOMM), 2024
Microservice has become a de facto standard for building large-scale cloud applications. Overload control is essential in preventing microservice failures and maintaining system performance under overloads.
Although several approaches have been proposed, they are limited to mitigating overload at individual microservices lacking assessments over the interdependent microservices and APIs.
This paper presents TopFull, a holistic overload control framework for microservices that leverage global coordination to maximize the throughput that meet service level objectives (i.e., goodput).
TopFull (a) dynamically cluster APIs according to the dependency with overloaded microservices, (b) choose APIs to load-control among those that are in contending relationships, and (c) take actions from RL agents which adaptively adjust the admitted rates of the APIs to maximize the goodput.
Our experiments on various open-source benchmarks demonstrate that TopFull significantly increases the goodput under overload scenarios, outperforming DAGOR by 1.82x and Breakwater by 2.26x.
Furthermore, Kubernetes autoscaler with TopFull serves up to 3.91x more requests under traffic surge and tolerates traffic spikes with up to 57% fewer resources than the Kubernetes autoscaler standalone.
GRAF: A graph neural network based proactive resource allocation framework for SLO-oriented microservices
Jinwoo Park, Byungkwon Choi, Chunghan Lee, and Dongsu Han
Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies, 2021
Microservice is an architectural style that has been widely adopted in various latency-sensitive applications. Similar to the monolith, autoscaling has attracted the attention of operators for managing resource utilization of microservices.
However, it is still challenging to optimize resources in terms of latency service-level-objective (SLO) without human intervention.
In this paper, we present GRAF, a graph neural network-based proactive resource allocation framework for minimizing total CPU resources while satisfying latency SLO. GRAF leverages front-end workload, distributed tracing data, and machine learning approaches to
(a) observe/estimate impact of traffic change (b) find optimal resource combinations (c) make proactive resource allocation.
Experiments using various open-source benchmarks demonstrate that GRAF successfully targets latency SLO while saving up to 19% of total CPU resources compared to the fine-tuned autoscaler.
Moreover, GRAF handles traffic surge with 36% fewer resources while achieving up to 2.6x faster tail latency convergence compared to the Kubernetes autoscaler.
Publications
-
SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs
Jinwoo Park, Seunggeun Cho, and Dongsu Han
The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS) Spotlight 2025
Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge.
We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network.
SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput.
Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving.
TopFull: An Adaptive Top-Down Overload Control for SLO-Oriented Microservices
Jinwoo Park, Jaehyeong Park, Youngmok Jung, Hwijoon Lim, Hyunho Yeo, and Dongsu Han
ACM Special Interest Group on Data Communication (SIGCOMM), 2024
Microservice has become a de facto standard for building large-scale cloud applications. Overload control is essential in preventing microservice failures and maintaining system performance under overloads.
Although several approaches have been proposed, they are limited to mitigating overload at individual microservices lacking assessments over the interdependent microservices and APIs.
This paper presents TopFull, a holistic overload control framework for microservices that leverage global coordination to maximize the throughput that meet service level objectives (i.e., goodput).
TopFull (a) dynamically cluster APIs according to the dependency with overloaded microservices, (b) choose APIs to load-control among those that are in contending relationships, and (c) take actions from RL agents which adaptively adjust the admitted rates of the APIs to maximize the goodput.
Our experiments on various open-source benchmarks demonstrate that TopFull significantly increases the goodput under overload scenarios, outperforming DAGOR by 1.82x and Breakwater by 2.26x.
Furthermore, Kubernetes autoscaler with TopFull serves up to 3.91x more requests under traffic surge and tolerates traffic spikes with up to 57% fewer resources than the Kubernetes autoscaler standalone.
Graph Neural Network-Based SLO-Aware Proactive Resource Autoscaling Framework for Microservices
Jinwoo Park, Byungkwon Choi, Chunghan Lee, and Dongsu Han
IEEE/ACM Transactions on Networking, 2024
Microservice is an architectural style widely adopted in various latency-sensitive cloud applications.
Similar to the monolith, autoscaling has attracted the attention of operators for managing the resource utilization of microservices.
However, it is still challenging to optimize resources in terms of latency service-level-objective (SLO) without human intervention.
In this paper, we present GRAF, a graph neural network-based SLO-aware proactive resource autoscaling framework for minimizing total CPU resources while satisfying latency SLO.
GRAF leverages front-end workload, distributed tracing data, and machine learning approaches to (a) observe/estimate the impact of traffic change (b) find optimal resource combinations (c) make proactive resource allocation.
Experiments using various open-source benchmarks demonstrate that GRAF successfully targets latency SLO while saving up to 19% of total CPU resources compared to the fine-tuned autoscaler.
GRAF also handles a traffic surge with 36% fewer resources while achieving up to 2.6x faster tail latency convergence compared to the Kubernetes autoscaler.
Moreover, we verify the scalability of GRAF on large-scale deployments, where GRAF saves 21.6% and 25.4% for CPU resources and memory resources, respectively.
SAND: A Storage Abstraction for Video-based Deep Learning
Uitaek Hong, Hwijoon Lim, Hyunho Yeo, Jinwoo, Park, and Dongsu Han
15th ACM Workshop on Hot Topics in Storage and File Systems, 2023
AccelIR: Task-Aware Image Compression for Accelerating Neural Restoration
Juncheol Ye, Hyunho Yeo Jinwoo, Park, and Dongsu Han
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
Recently, deep neural networks have been successfully applied for image restoration (IR)(eg, super-resolution, de-noising, de-blurring). Despite their promising performance, running IR networks requires heavy computation.
A large body of work has been devoted to addressing this issue by designing novel neural networks or pruning their parameters.
However, the common limitation is that while images are saved in a compressed format before being enhanced by IR, prior work does not consider the impact of compression on the IR quality.
In this paper, we present AccelIR, a framework that optimizes image compression considering the end-to-end pipeline of IR tasks.
AccelIR encodes an image through IR-aware compression that optimizes compression levels across image blocks within an image according to the impact on the IR quality.
Then, it runs a lightweight IR network on the compressed image, effectively reducing IR computation, while maintaining the same IR quality and image size.
Our extensive evaluation using seven IR networks shows that AccelIR can reduce the computing overhead of super-resolution, de-nosing, and de-blurring by 49%, 29%, and 32% on average, respectively.
GRAF: A graph neural network based proactive resource allocation framework for SLO-oriented microservices
Jinwoo Park, Byungkwon Choi, Chunghan Lee, and Dongsu Han
Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies, 2021
Microservice is an architectural style that has been widely adopted in various latency-sensitive applications. Similar to the monolith, autoscaling has attracted the attention of operators for managing resource utilization of microservices.
However, it is still challenging to optimize resources in terms of latency service-level-objective (SLO) without human intervention.
In this paper, we present GRAF, a graph neural network-based proactive resource allocation framework for minimizing total CPU resources while satisfying latency SLO. GRAF leverages front-end workload, distributed tracing data, and machine learning approaches to
(a) observe/estimate impact of traffic change (b) find optimal resource combinations (c) make proactive resource allocation.
Experiments using various open-source benchmarks demonstrate that GRAF successfully targets latency SLO while saving up to 19% of total CPU resources compared to the fine-tuned autoscaler.
Moreover, GRAF handles traffic surge with 36% fewer resources while achieving up to 2.6x faster tail latency convergence compared to the Kubernetes autoscaler.
Neuro-DCF: Design of wireless MAC via multi-agent reinforcement learning approach
Sangwoo Moon, Sumyeong Ahn, Kyunghwan Son, Jinwoo Park, and Yung Yi
Proceedings of the Twenty-second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, 2021
The carrier sense multiple access (CSMA) algorithm has been used in the wireless medium access control (MAC) under standard 802.11 implementation due to its simplicity and generality.
An extensive body of research on CSMA has long been made not only in the context of practical protocols, but also in a distributed way of optimal MAC scheduling.
However, the current state-of-the-art CSMA (or its extensions) still suffers from poor performance, especially in multi-hop scenarios, and often requires patch-based solutions rather than a universal solution.
In this paper, we propose an algorithm which adopts an experience-driven approach and train CSMA-based wireless MAC by using deep reinforcement learning. We name our protocol, Neuro-DCF.
Two key challenges are: (i) a stable training method for distributed execution and (ii) a unified training method for embracing various interference patterns and configurations.
For (i), we adopt a multi-agent reinforcement learning framework, and for (ii) we introduce a novel graph neural network (GNN) based training structure.
We provide extensive simulation results which demonstrate that our protocol, Neuro-DCF, significantly outperforms 802.11 DCF and O-DCF, a recent theory-based MAC protocol, especially in terms of improving delay performance while preserving optimal utility.
We believe our multi-agent reinforcement learning based approach would get broad interest from other learning-based network controllers in different layers that require distributed operation.
pHPA: A proactive autoscaling framework for microservice chain
Byungkwon Choi, Jinwoo Park, Chunghan Lee, and Dongsu Han
5th Asia-Pacific Workshop on Networking, 2021
Microservice is an architectural style that breaks down monolithic applications into smaller microservices and has been widely adopted by a variety of enterprises.
Like the monolith, autoscaling has attracted the attention of operators in scaling microservices. However, most existing approaches of autoscaling do not consider microservice chain and severely degrade the performance of microservices when traffic surges.
In this paper, we present pHPA, an autoscaling framework for the microservice chain. pHPA proactively allocates resources to the microservice chains and effectively handles traffic surges.
Our evaluation using various open-source benchmarks shows that pHPA reduces 99%-tile latency and resource usage by up to 70% and 58% respectively compared to the most widely used autoscaler when traffic surges.