Active Visual Object Search

Mutli-view object representations are acquired from objects held in the five-fingered hand. Hence, hand-object and background-object segmentation has to be performed.
Segmentation of 20 object views using sensor fusion techniques and occupancy grid mapping.
Ego-centric scene memory content after a successful active object search task. After several saccadic eye movements, the hypotheses lead to one identified object instance.
Aspect Graph based multi-view representation of an object.

In order to perceive and act in human-centered environments, the humanoid robot ARMAR-III needs the ability to adapt to inherent changes in such environments. In particular, the robot is confronted with different objects every day. In order to allow the robot for adapting to previously unobserved objects, we take an integrated view on the acquisition of object representations and the application of such representations in recognition tasks. Both, acquisition and recognition should be achieved online. For our research, the ability of ARMAR-III to use its motor capabilities in order to solve ill-posed problems in vision is exploited.

Object representations are generated autonomously from an object held in the five-fingered hand. Different salient views of the held object are explored by exploiting the redundancy of the arm. The object is segmented from the hand and background based on Bayesian filtering and fusion techniques. In order to reduce the number of acquired views, we investigate solutions for determining the next best view during exploration.

The resulting object views are accumulated in an aspect-graph based representation. Prototypical views for each feature channel are generated based on similarities between views in order to achieve a compact representation. These multi-view representations are used in active visual search tasks. The humanoid robot ARMAR-III is equipped with one foveal and one perspective stereo camera pair. Processing in the perspective images generates hypotheses of locations of object instances in the scene. The foveal images are used to verify the generated hypotheses. Therefore, a top-down attention scheme is proposed, which directs the gaze of the foveal camera based on spherical winner-takes-all networks. The information collected during the saccadic eye movement are accumulated within an ego-centric scene memory which guarantees consistency and persistency between the real world and the internal model of the scene.

Implementing the complete chain from exploration of unknown objects to recognition in a visual search tasks supports the incremental generation of world-knowledge in terms of visual object representations on the system. In order to apply such representations in manipulation tasks, a strong connection to haptic representations is necessary. We focus on this issue in the Multimodal Exploration section.