Why LLM+low-level control might fail to scale

Nov 10, 2024 4 min read

Image credit: Bing

A current trend of combing LLM (LVM) with robotics is to first use an LLM to decompose a high-level task into several subtasks given the instruction and scene image. For example, given an instruction “Put all toys into the basket.”, an LLM might correctly output the intermediate, grounded subtasks as

“Put the car into the basket.”
“Put the ball into the basket.”
“Put the stuffed animal into the basket.”

Sometimes even simpler, instead of being represented by an instruction, each subtask is represented by some predefined structured representation, for example, a bounding box indicating the target object and a point location indicating the target location in the example above.

Box(car), TargetLoc(basket)
Box(ball), TargetLoc(basket)
Box(animal), TargetLoc(basket)

If we restrict our use case to PickAndDrop on a clean tabletop, this box-point subtask representation might be good enough. All we need to do is to train a low-level policy that given the current observation and a box-point pair, finishes the pick-and-drop subtask, that is, to move the object highlighted by the bounding box and place it at the target location.

There are several downsides of this Box-Point representation, however:

How well this LLM+box-point policy system generalizes to more scenarios? Probably not well. One can immediately see that the predefined intermediate representation “Box-Point” is the bottleneck here. What if we later want to finish a task of pouring water from a cup to another? This “Box-Point” represenation is not aligned well with it.
Another caveat of this system is that there must be some “judge” that determines when a subtask has been finished by the low-level policy, in order to initiates the next subtask. Usually this judge can be hardcoded. For example, if a program finds that the target location is within the object bounding box, then the current subtask finishes. But this same rule cannot be easily generalized to the water-pouring task. So a new judge needs to be defined. Alternatively, a different LLM might be able to serve as the judge, but it has to be called frequently with no accuracy guaranteed.
Finally, if the object is not currently in the view, we probably need to tell the low-level policy to turn around and look for that object. But how to represent this turn-around subtask? Another new representation?

Goal representation

The output of the high-level LLM is actually some goal representation. A goal dictates what the robot should do for each subtask. In the example above, we have either an instruction or Box-Point pair as the representation.

A goal can be designed as either general or specific. For example, an instruction as the goal is probably the most general form. In contrast, a Box-Point pair is very specific to PickAndDrop subtasks. We could define other goal representations, for example, segmentation masks and/or optical flow, etc.

Vecerik et al. 2023

Although a more general goal seems more scalable, it of course requires more training data for the low-level policy!

Why instruction as the goal might also not be promising

So if we decide to use instruction as a general goal representation, then to have a universal manipulation robot, we probably need to collect a large-scale dataset containing instructed short-horizon robotic data to train the low-level policy.

But which instructions can qualify as goals? Or what is the clear definition for a short-horizon subtask? Is “Put the toy into the basket.” short-horizon enough, or should we further decompose it into “Pick up the toy”, “Move the toy to the basket”, and “Drop the toy into the basket”? This question has to be answered first before we can actually start collecting such a large-scale dataset. And the answer might not be the same for different robots.

Eventually, there might always be a mismatch regarding this granularity between the LLM’s output space and the dataset instruction space.

My two cents

Given the reasoning above, I personally don’t think an off-the-shelf pretrained LLM(LVM) can output a goal representation that is simultaneously:

General enough, and
Friendly enough for training a general low-level policy,

no matter how hard one prompts the LLM.

Future: A Large Goal Model (LGM) is needed

My personal belief is that

we need to carefully think about an alternative (latent) goal representation that satisfies our needs, and
the goal representation and generation can be achieved by training a Large Goal Model (LGM).

For 1), the goal representation probably will turn out to be latent, which means it is basically infeasible to prompt existing LLMs to output such goals.

For 2), an LGM can be trained from large-scale internet language-video data as it doesn’t require annotated robotic actions. With proper goal and subtask input definitions, hopefully the low-level policy can also be trained to be scalable, either with in-domain collected data or with sim2real techniques.

I do have some ideas in mind for training an LGM+low-level policy system but it would probably be talked about in a future post.

Why LLM+low-level control might fail to scale

Goal representation

Why instruction as the goal might also not be promising

My two cents

Future: A Large Goal Model (LGM) is needed

Haonan Yu

Researcher & Engineer