This repository contains our solution for the LG Aimers 8th AI Hackathon, where we achieved 3rd place.
- Task: Model compression of EXAONE-4.0-1.2B
- Goal: Reduce model size and improve efficiency while maintaining performance under a fully private evaluation setting
๐ Hackathon page: https://dacon.io/competitions/official/236689/overview/description
We propose a practical and robust compression pipeline tailored for LLM deployment under constrained environments.
- We handled the activation outliers by applying
W8A8 quantizationandQuantizationModifierwhich achieved more stable and reliable quantization compared to naive approaches. (We found out the optimal approaches empirically.) - We compressed
KV cache into FP8 precisionwhich effectively reduced memory bandwidth bottleneck and enabled improved inference speed. - We included
two types of calibration datawhich were synthesized by Gemini, which was prompted to generate instruction-following QA (i.e., IFEval from Google) and general Korean text. Since the evaluation setting was a fully private, we planned to utilize one of the most common task (raw text & instruction-following QA) for generalization.
- Find the optimal combination of quantization values and recipe empirically is important
- KV cache compression is a highly effective but often overlooked optimization lever
- Exaone-4.0-1.2B has duplicated layers which can be removed without significant performance loss
You can find our presentation slides here.
We thank the organizers of LG Aimers and DACON for providing a challenging and well-designed benchmark environment.