An everyday robot moving around the office.

Lost in an unfamiliar office building, department store or warehouse? Just ask the nearest robot for directions.

A team of researchers from Google combined natural language processing and computer vision to develop new robotic navigation tools in new research published Wednesday.

Essentially, the team set out to teach a robot, in this case a regular robot, to navigate an indoor space using natural language cues and visual data. Previously, robotic navigation required researchers to not only map out the environment in advance, but also provide specific physical coordinates in space to control the machine. Recent advances in so-called Vision Language navigation have allowed users to simply give robots commands in natural language, such as “go to workstation.” Google researchers are taking this concept a step further by introducing multimodal capabilities so that the robot can simultaneously perceive natural language instructions and images. CHECK OUT THE INSTAGRAM VIDEO HERE.

View this post on Instagram

A post shared by Google DeepMind (@googledeepmind)

For example, a user in a warehouse might show a robot an item and ask, “Which shelf is this on?” Using the capabilities of Gemini 1.5 Pro, the AI ​​interprets both the spoken question and visual information to formulate not only an answer, but also a navigation path that will take the user to the right place in the warehouse. The robots were also tested with commands such as, “Take me to the conference room with the double doors,” “Where can I borrow hand sanitizer?” and “I want to hide something from the public.” Where should I go?

Or, as shown in the Instagram video above, the researcher activates the system with “OK, robot,” before asking to be taken to a place where it “can draw.” The robot responds, “Give me a minute. Thinking with Gemini…” before quickly walking through DeepMind’s 9,000-square-foot office in search of a large whiteboard hanging on the wall.

To be fair, these innovative robots were already familiar with office design. The team used a technique known as Multimodal Instructional Navigation with Instructional Tours (MINT). To do this, the team first manually guided the robot around the office, pointing out specific areas and features using natural language, although the same effect could be achieved by simply recording a video of the space on a smartphone. The AI ​​then generates a topological graph that it works on to match what your cameras see with the “target frame” of the instructional video.

The team then uses a hierarchical Vision-Language-Action (VLA) navigation policy, “which combines environmental understanding and common sense,” to instruct the AI ​​on how to translate user queries into actions.

The results were quite successful: the robots achieved “end-to-end success rates of 86 and 90 percent when solving previously intractable navigation tasks involving complex reasoning and multimodal user instructions in large real-world environments,” the researchers write.

Still, they acknowledge that there’s still room for improvement, noting that the robot can’t (yet) autonomously perform its own demo tour, and noting that the AI’s clunky inference time (how long it takes to formulate a response) of 10 to 30 seconds makes interacting with the system a study in patience.


Source: Digital Trends

Previous articleMusk says second Neuralink implant in humans is imminent
Next articlePerekrestok opened the first supermarket with an updated concept
I am Garth Carter and I work at Gadget Onus. I have specialized in writing for the Hot News section, focusing on topics that are trending and highly relevant to readers. My passion is to present news stories accurately, in an engaging manner that captures the attention of my audience.

LEAVE A REPLY

Please enter your comment!
Please enter your name here