Selected Publications
I'm interested in vision-language, agents, vision/video understanding and beyond. If you are interested, please feel free to reach out!
|
|
Training Task Experts Through Retrieval Based Distillation
Jiaxin Ge*,
Xueying Jia*,
Vijay Viswanathan,
Hongyin Luo,
Graham Neubig
preprint, 2024
abstract /
pdf /
code
One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist.
Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data.
In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data.
This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.
|
|
Recursive Visual Programming
Jiaxin Ge,
Sanjay Subramanian,
Baifeng Shi,
Roei Herzig,
Trevor Darrell
ECCV, 2024
abstract /
pdf /
code
Visual Programming (VP) has emerged as
a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code
for each question, these methods demonstrate impressive compositional and reasoning capabilities,
especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in
a single function, resulting in code that is suboptimal in terms of both accuracy and
interpretability. Inspired by human coding practices, we propose Recursive Visual Programming
(RVP), which simplifies generated routines, provides more efficient problem solving, and can
manage more complex data structures. RVP is inspired by human coding practices and approaches VQA
tasks with an iterative recursive code generation approach, allowing decomposition of complicated
problems into smaller parts. Notably, RVP is capable of dynamic type assignment, i.e., as the
system recursively generates a new piece of code, it autonomously determines the appropriate
return type and crafts the requisite code to generate that output. We show RVP's efficacy through
extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value
of adopting human-like recursive and modular programming techniques for solving VQA tasks through
coding.
|
|
Self-Corrected Multimodal Large Language Model for End-to-End Robot
Manipulation
Jiaming Liu,
Chenxuan Li,
Guanqun Wang,
Lily Lee,
Kaichen Zhou,
Sixiang Chen,
Chuyan Xiong,
Jiaxin Ge,
Renrui Zhang,
Shanghang Zhang
preprint, 2024
abstract /
pdf
Robot manipulation policies have shown
unsatisfactory action performance when confronted with novel task or object instances. Hence, the
capability to automatically detect and self-correct failure action is essential for a practical
robotic system. Recently, Multimodal Large Language Models (MLLMs) have shown promise in visual
instruction following and demonstrated strong reasoning abilities in various tasks. To unleash
general MLLMs as an end-to-end robotic agent, we introduce a Self-Corrected (SC)-MLLM, equipping
our model not only to predict end-effector poses but also to autonomously recognize and correct
failure actions. Specifically, we first conduct parameter-efficient fine-tuning to empower MLLM
with pose prediction ability, which is reframed as a language modeling problem. When facing
execution failures, our model learns to identify low-level action error causes (i.e., position and
rotation errors) and adaptively seeks prompt feedback from experts. Based on the feedback, SC-MLLM
rethinks the current failure scene and generates the corrected actions. Furthermore, we design a
continuous policy learning method for successfully corrected samples, enhancing the model's
adaptability to the current scene configuration and reducing the frequency of expert intervention.
To evaluate our SC-MLLM, we conduct extensive experiments in both simulation and real-world
settings. SC-MLLM agent significantly improve manipulation accuracy compared to previous
state-of-the-art robotic MLLM (ManipLLM), increasing from 57% to 79% on seen object categories
and from 47% to 69% on unseen novel categories.
|
|
Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning
Tianhua Zhang*,
Jiaxin Ge*,
Hongyin Luo*,
Yung-Sung Chuang,
Mingye Gao,
Yuan Gong,
Xixin Wu,
Yoon Kim,
Helen Meng,
Jim Glass
NAACL, 2024
abstract /
pdf /
code
How can we perform computations over
natural language representations to solve tasks that require symbolic and numeric reasoning? We
propose natural language embedded programs (NLEP) as a unifying framework for addressing
math/symbolic reasoning, natural language understanding, and instruction following tasks. Our
approach prompts a language model to generate full Python programs that define functions over data
structures which contain natural language representations of structured knowledge. A Python
interpreter then executes the generated code and prints the output. Despite using a task-general
prompt, we find that this approach can improve upon strong baselines across a range of different
tasks including math and symbolic reasoning, text classification, question answering, and
instruction following. We found that the generated programs are interpretable since they outline
the exact reasoning process followed by the program interpreter.
|
|
Lyfe agents: Generative agents for low-cost real-time social interactions
Zhao Kaiya,
Michelangelo Naim,
Jovana Kondic,
Manuel Cortes,
Jiaxin Ge,
Shuying Luo,
Guangyu Robert Yang,
Andrew Ahn
preprint, 2023
abstract /
pdf
Highly autonomous generative agents
powered by large language models promise to simulate intricate social behaviors in virtual
societies. However, achieving real-time interactions with humans at a low computational cost
remains challenging. Here, we introduce Lyfe Agents. They combine low-cost with real-time
responsiveness, all while remaining intelligent and goal-oriented. Key innovations include: (1) an
option-action framework, reducing the cost of high-level decisions; (2) asynchronous
self-monitoring for better self-consistency; and (3) a Summarize-and-Forget memory mechanism,
prioritizing critical memory items at a low cost. We evaluate Lyfe Agents' self-motivation and
sociability across several multi-agent scenarios in our custom LyfeGame 3D virtual environment
platform. When equipped with our brain-inspired techniques, Lyfe Agents can exhibit human-like
self-motivated social reasoning. For example, the agents can solve a crime (a murder mystery)
through autonomous collaboration and information exchange. Meanwhile, our techniques enabled Lyfe
Agents to operate at a computational cost 10-100 times lower than existing alternatives. Our
findings underscore the transformative potential of autonomous generative agents to enrich human
social experiences in virtual worlds.
|
|
From Wrong To Right: A Recursive Approach Towards Vision-Language
Explanation
Jiaxin Ge,
Sanjay Subramanian,
Trevor Darrell,
Boyi Li
EMNLP, 2023
abstract /
pdf /
code
Addressing the challenge of adapting
pre-trained vision-language models for generating insightful explanations for visual reasoning
tasks with limited annotations, we present ReVisE: a cursive ual xplanation algorithm. Our method
iteratively computes visual features (conditioned on the text input), an answer, and an
explanation, to improve the explanation quality step by step until the answer converges. We find
that this multi-step approach guides the model to correct its own answers and outperforms
single-step explanation generation. Furthermore, explanations generated by ReVisE also serve as
valuable annotations for few-shot self-training. Our approach outperforms previous methods while
utilizing merely 5% of the human-annotated explanations across 10 metrics, demonstrating up to a
4.2 and 1.3 increase in BLEU-1 score on the VCR and VQA-X datasets, underscoring the efficacy and
data-efficiency of our method.
|
|
Entailment as robust self-learner
Jiaxin Ge*,
Hongyin Luo*,
Yoon Kim,
Jim Glass
ACL, 2023
abstract /
pdf /
code
Entailment has been recognized as an
important metric for evaluating natural language understanding (NLU) models, and recent studies
have found that entailment pretraining benefits weakly supervised fine-tuning. In this work, we
design a prompting strategy that formulates a number of different NLU tasks as contextual
entailment. This approach improves the zero-shot adaptation of pretrained entailment models.
Secondly, we notice that self-training entailment-based models with unlabeled data can
significantly improve the adaptation performance on downstream tasks. To achieve more stable
improvement, we propose the Simple Pseudo-Label Editing (SimPLE) algorithm for better
pseudo-labeling quality in self-training. We also found that both pretrained entailment-based
models and the self-trained models are robust against adversarial evaluation data. Experiments on
binary and multi-class classification tasks show that SimPLE leads to more robust self-training
results, indicating that the self-trained entailment models are more efficient and trustworthy
than large language models on language understanding tasks.
|
|