Description
This paper might have scratched the surface of future VisionAgent models. Maybe, Rabbit inc might be able to do 1% of what they have promised :p
The Problem
Usually the LLMs are able to deal in texts quite efficently. And with current developments, vision capabilities are quite impressive. But if you want to have an AI agent which can book a flight for you, or run some errands. Right now, ther interaction with the outside world is limiting.
In the paper, authors will fine tune a vision llm to take a vision input and text prompt to output sequential action items. Notably, they introduce a framework that prompts the VLM to generate chain-of-thought (CoT) reasoning before outputting an action. This CoT reasoning is designed to enable efficient exploration of intermediate steps leading to the final action
Notes
Basically on the very high level the way there system works is -
please note that VLM is actually infused with RL.
In previous couple of research, the LLMs, VLM + RL adopted well into text based tasks. But in this work, there are two important elements :
- Visual inputs
- Chain of Thought (CoT) reasoning
Authors have used Llava-next(7b model) as a VLM backbone. In experiments(which we will discuss below) they found out Chain of thought is essential for RL training. And model outperformed GPT4V and Gemini on several datasets.
VLM also adaptes rewards from the environment interaction which is important for AI agent. Because it cannot be trained on a specific dataset.
Authors also refers to the work of (Wei et al) “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” which basically studies better prompting leads to better chain of thought in large language models.
Training VLM with RL
it is quite challenging because of two reasons
- open ended text outputs
- estimating action probabilites
Their framework uses a task-specific prompt to generate formatted output including CoT reasoning, followed by a post-processing function to parse the text into executable actions.
If we look in a bit more detail, this is how there complete dynamics looks
Prompting is a big chunk in this. Before the quality of it will reinforce model to perform better.
Task specific input prompt => Good formatted output (from the model)
This is how input prompt looks like
and output format
you are likely to see what kind of prompt we are looking for. It is detailed, comprehensive bunch of words which breaks down a complex task into a sub parts, maybe with examples to contribute to a good prompt.
After you get the output, there should a post processing open ended text function which should allow to map open ended text to legal actions items, which further are used to execute task in a setting.
estimating probabilites of action items is also quite important to know because we want to understand the confidence distribution on a particular set of problem.
In Naive Calculation, they consider a straightforward approach to calculate the log probability of an action by summing the log-likelihoods of all tokens in the output. The Problem with Naive Approach is because the output v^out contains both Chain-of-Thought (CoT) reasoning (v^tht) and the actual action (v^act). The CoT reasoning tokens are typically much longer than the action tokens.
As shown in the table below,
Due to this imbalance, the log probability would be largely determined by the CoT tokens rather than the action tokens, which is undesirable since only the action tokens are used for decision-making.
To address this issue, they introduce a scaling factor λ ∈ [0, 1] to adjust the contribution of the CoT tokens.
However, after experiments authors suggested the tuned λ ∈ [0.2, 0.5].
Experimental Results:
Their method improved average performance from the initial LLaVA-sft model by 27.1% on arithmetic tasks and 4.0% on visual semantic decision-making tasks
It outperformed GPT4-V and Gemini on most tasks, despite using a much smaller 7B parameter model.
Ablation studies showed that removing CoT reasoning significantly decreased performance.
Conclusions, Limitations, and Future Directions:
- The paper demonstrates that RL can effectively fine-tune VLMs for decision-making tasks, with CoT reasoning playing a crucial role.
- Limitations include not extensively exploring different prompting techniques and only improving performance on individual tasks rather than multiple tasks simultaneously.
- Future work could involve extending the method to improve multiple tasks at once and exploring different prompting strategies.