ENCODE Lab

Yungu Campus

Westlake University

Hangzhou, China

About Us:
The ENCODE Lab is led by Dr. Huan Wang, a Tenure-Track Assistant Professor in the AI Department at Westlake University. Our lab is dedicated to advancing the field of Artificial Intelligence by focusing on creating efficient and effective AI solutions.

Research Focus:
Our research focuses on Efficient AI in vision and language modeling, spanning image classifcation / detection / segmentation [GReg, PaI-Survey, TPP] to neural style transfer [Ultra-Resolution-NST], single image super-resolution [ASSL/GASSL, SRP, ISS-P, Oracle-Pruning-Sanity-Check], 3D novel view synthesis / neural rendering / NeRF / NeLF [R2L, MobileR2L, LightAvatar], AIGC / diffusion models / Stable Diffusion [SnapFusion, FreeBlend], LLM / MLLM [DyCoke, Poison-as-Cure], and snapshot compressive imaging (SCI) [QuantizedSCI, MobileSCI].

Our Mission:
Our mission is to advance AI by creating efficient, broadly applicable methods and models. We’re dedicated to driving both theoretical innovation and tangible solutions for diverse real-world problems.

News

2025/02	`[CVPR'25]` DyCoke is accepted by CVPR’25! Congrats to Keda!🎉 DyCoke is a training-free, plug-and-play token compression method for fast video LLMs: 1.5x wall-clock inference speedup and 1.4x memory reduction with no performance drop. [arxiv] [code]
2025/02	`[Preprint]` Can diffusion models blend visual concepts that are semantically very unsimilar (e.g., an orange and a teddy bear)? Yes, we introduce FreeBlend, a new method to blend arbitrary concepts. [arxiv] [code] [webpage]
2025/01	`[Preprint]` Adversarial visual noise is always malicious to our models like “poison”? No, we find it can also be a cure to mitigate the hallucination problem of VLMs. [arxiv] [code] [webpage]
2025/01	`[ICLR'25]` One paper about distilling large foundation models with low cost “Compressing Vision Foundation Models at ImageNet-level Costs” is accepted by ICLR’25. Thanks to the lead author Yitian!
2024/12	`[Preprint]` We present empirical evidence to show that oracle pruning, the “ground-truth” pruning paradigm that has been followed for around 35 years in the pruning community, does not hold in practice. [arxiv][webpage]
2024/07	`[NeurIPS'24]` We introduce a training framework Scala to learn slimmable ViTs. Using Scala, a ViT model is trained once but can inference at different widths, up to the need of devices with different resources. The project is led by Yitian. Congrats!
2024/07	`[MM'24]` We present the first real-time on-device video SCI (Snapshot Compressive Imaging) framework via dedicated network design and a distillation-based training strategy. Congrats to Miao!
2024/07	`[ECCV'24]` One paper about efficient video SCI (Snapshot Compressive Imaging) via network quantization is accepted by ECCV’24 as an oral. Congrats to Miao! [code]

Latest Posts

Selected Publications

arXiv’25/05

HoliTom: Holistic Token Merging for Fast Video Large Language Models

Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang^†

arXiv preprint arXiv:2505.21334, 2025

arXiv Bib Code Website

@article{shao2025holitom,
  title = {HoliTom: Holistic Token Merging for Fast Video Large Language Models},
  author = {Shao, Kele and Tao, Keda and Qin, Can and You, Haoxuan and Sui, Yang and Wang, Huan},
  journal = {arXiv preprint arXiv:2505.21334},
  year = {2025},
}

arXiv’25/05
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

Sicheng Feng^* , Song Wang^*, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang^† , and Xinchao Wang

arXiv preprint arXiv:2505.18675, 2025

Abs arXiv Bib Code Website Dataset 量子位

Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.
@article{feng2025canmllms, title = {Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps}, author = {Feng, Sicheng and Wang, Song and Ouyang, Shuyi and Kong, Lingdong and Song, Zikai and Zhu, Jianke and Wang, Huan and Wang, Xinchao}, journal = {arXiv preprint arXiv:2505.18675}, year = {2025}, dataset = {https://huggingface.co/datasets/FSCCS/ReasonMap}, qbitai = {https://mp.weixin.qq.com/s/sPJLQtHgl5DZghWLWa_H3Q}, }

CVPR’25

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang

CVPR, 2025

arXiv Bib Code

@inproceedings{tao2025dycoke,
  title = {DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models},
  author = {Tao, Keda and Qin, Can and You, Haoxuan and Sui, Yang and Wang, Huan},
  booktitle = {CVPR},
  year = {2025},
}

ICLR’25

Accessing Vision Foundation Models at ImageNet-level Costs

Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, and Yun Fu

ICLR, 2025

arXiv Bib Code

@inproceedings{zhang2024accessing,
  title = {Accessing Vision Foundation Models at ImageNet-level Costs},
  author = {Zhang, Yitian and Ma, Xu and Bai, Yue and Wang, Huan and Fu, Yun},
  booktitle = {ICLR},
  year = {2025},
}

arXiv’24/11

Is Oracle Pruning the True Oracle?

Sicheng Feng, Keda Tao, and Huan Wang

arXiv preprint arXiv:2412.00143, 2024

arXiv Bib Code Website

ACM MM’24

Towards Real-time Video Compressive Sensing on Mobile Devices

Miao Cao , Lishun Wang, Huan Wang , Guoqing Wang, and Xin Yuan

ACM MM, 2024

arXiv Bib Code

@inproceedings{cao2024towards,
  title = {Towards Real-time Video Compressive Sensing on Mobile Devices},
  author = {Cao, Miao and Wang, Lishun and Wang, Huan and Wang, Guoqing and Yuan, Xin},
  booktitle = {ACM MM},
  year = {2024},
}

ECCV’24 Oral

A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging

Miao Cao , Lishun Wang, Huan Wang, and Xin Yuan

ECCV, 2024

arXiv Bib Code

@inproceedings{cao2024simple,
  title = {A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging},
  author = {Cao, Miao and Wang, Lishun and Wang, Huan and Yuan, Xin},
  booktitle = {ECCV},
  year = {2024},
}