publications
(*) denotes equal contribution
2025
- Every expert matters: Towards effective knowledge distillation for mixture-of-experts language modelsGyeongman Kim*, Gyouk Chu*, and Eunho YangarXiv preprint arXiv:2502.12947, 2025
With the emergence of Mixture-of-Experts (MoE), the efficient scaling of model size has accelerated the development of large language models in recent years. However, their high memory requirements prevent their use in resource-constrained environments. While knowledge distillation (KD) has been a proven method for model compression, its application to MoE teacher models remains underexplored. Through our investigation, we discover that non-activated experts in MoE models possess valuable knowledge that benefits student models. We further demonstrate that existing KD methods are not optimal for compressing MoE models, as they fail to leverage this knowledge effectively. To address this, we propose two intuitive MoE-specific KD methods for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR), both designed to effectively extract knowledge from all experts. Specifically, KA augments knowledge by sampling experts multiple times, while SAR uses all experts and adjusts the expert weights through router training to provide optimal knowledge. Extensive experiments show that our methods outperform conventional KD methods, demonstrating their effectiveness for MoE teacher models.
@article{kim2025every, title = {Every expert matters: Towards effective knowledge distillation for mixture-of-experts language models}, author = {Kim, Gyeongman and Chu, Gyouk and Yang, Eunho}, journal = {arXiv preprint arXiv:2502.12947}, year = {2025}, } - ReviewScore: Misinformed Peer Review Detection with Large Language ModelsHyun Ryu, Doohyuk Jang, Hyemin S Lee, and 8 more authorsarXiv preprint arXiv:2509.21679, 2025
Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.
@article{ryu2025reviewscore, title = {ReviewScore: Misinformed Peer Review Detection with Large Language Models}, author = {Ryu, Hyun and Jang, Doohyuk and Lee, Hyemin S and Jeong, Joonhyun and Kim, Gyeongman and Cho, Donghyeon and Chu, Gyouk and Hwang, Minyeong and Jang, Hyeongwon and Kim, Changhun and others}, journal = {arXiv preprint arXiv:2509.21679}, year = {2025}, }
2024
- KorMedMCQA: Multi-choice question answering benchmark for Korean healthcare professional licensing examinationsSunjun Kweon*, Byungjin Choi*, Gyouk Chu, and 7 more authorsarXiv preprint arXiv:2403.01469, 2024
We present KorMedMCQA, the first Korean Medical Multiple-Choice Question Answering benchmark, derived from professional healthcare licensing examinations conducted in Korea between 2012 and 2024. The dataset contains 7,469 questions from examinations for doctor, nurse, pharmacist, and dentist, covering a wide range of medical disciplines. We evaluate the performance of 59 large language models, spanning proprietary and open-source models, multilingual and Korean-specialized models, and those fine-tuned for clinical applications. Our results show that applying Chain of Thought (CoT) reasoning can enhance the model performance by up to 4.5% compared to direct answering approaches. We also investigate whether MedQA, one of the most widely used medical benchmarks derived from the U.S. Medical Licensing Examination, can serve as a reliable proxy for evaluating model performance in other regions-in this case, Korea. Our correlation analysis between model scores on KorMedMCQA and MedQA reveals that these two benchmarks align no better than benchmarks from entirely different domains (e.g., MedQA and MMLU-Pro). This finding underscores the substantial linguistic and clinical differences between Korean and U.S. medical contexts, reinforcing the need for region-specific medical QA benchmarks. To support ongoing research in Korean healthcare AI, we publicly release the KorMedMCQA via Huggingface.
@article{kweon2024kormedmcqa, title = {KorMedMCQA: Multi-choice question answering benchmark for Korean healthcare professional licensing examinations}, author = {Kweon, Sunjun and Choi, Byungjin and Chu, Gyouk and Song, Junyeong and Hyeon, Daeun and Gan, Sujin and Kim, Jueon and Kim, Minkyu and Park, Rae Woong and Choi, Edward}, journal = {arXiv preprint arXiv:2403.01469}, year = {2024}, }
2023
- Prediction-segmentation tasks for self-supervision of anomaly detection networks under noisy conditionsJihoon Choi, Gyouk Chu, and Jung-Woo ChoiIn INTER-NOISE and NOISE-CON Congress and Conference Proceedings, 2023
For detecting anomalies from sounds generated by electronic devices, self-supervised learning of deep neural networks (DNNs) has been popularly employed. In self-supervised learning, a DNN model is trained over normal data to solve some pretext tasks, and test data giving reduced task performance are regarded as anomalies. Popular choices for the pretext task are the reconstruction and the classification tasks, where a model is trained to predict masked parts of the spectrogram and to classify the internal classes of normal data, respectively. However, the reconstruction task is hard to distinguish anomalies from noises in noisy conditions, and the classification task often fails to learn meaningful features when the diversity across internal classes is too small or too evident. We propose a combination of prediction and segmentation tasks to overcome these limitations. For the proposed tasks, two different machine sounds are mixed with a constant ratio, and a model is trained to predict both the mixed spectrogram of future time and mixing ratio based on the present and past sound mixture. We train a WaveNet-based model using dual tasks simultaneously, which shows remarkable performance improvements over the conventional models and achieves state-of-the-art performance on the DCASE 2020 Task 2 dataset.
@inproceedings{choi2023prediction, title = {Prediction-segmentation tasks for self-supervision of anomaly detection networks under noisy conditions}, author = {Choi, Jihoon and Chu, Gyouk and Choi, Jung-Woo}, booktitle = {INTER-NOISE and NOISE-CON Congress and Conference Proceedings}, volume = {268}, number = {6}, pages = {2241--2248}, year = {2023}, organization = {Institute of Noise Control Engineering}, }