Vision-Language Coach for Ingestive Behaviors
Fine-tuning Qwen2.5-VL with LoRA to jointly classify eating-behavior quality and generate clinician-style coaching feedback (NIH R01 — DIBS).
A vision-language model that watches a person eat and produces both a behavior-quality score and free-text coaching feedback in a single pass. Built on Qwen2.5-VL with LoRA adapters and trained on profile-view meal video collected by the DIBS team.
Key contributions
- Designed a multi-bite temporal windowing strategy that captures chewing rhythm and inter-bite pauses, improving classification accuracy over per-bite clips.
- Showed a single LoRA adapter on a VLM matches the classification accuracy of a prior multi-head spatial-temporal architecture while adding clinician-grounded language generation in one pass.
- Built a participant-level cross-validation pipeline producing structurally parseable outputs suitable for clinical review.
Presented as first-author poster at the ECBE Graduate Student Poster Competition, University of Rhode Island, 2026.
Funded under NIH R01 — Diet, Ingestion, and Behavior Sensing (DIBS).