Jeremie Papon

Group(s): Computer Vision
Email:
jpapon@gwdg.de

Global QuickSearch:   Matches: 0

Search Settings

    Author / Editor / Organization
    Year
    Title
    Journal / Proceedings / Book
    Aein, M J. and Aksoy, E E. and Tamosiunaite, M. and Papon, J. and Ude, A. and Wörgötter, F. (2013).
    Toward a library of manipulation actions based on Semantic Object-Action Relations. IEEE/RSJ International Conference on Intelligent Robots and Systems. DOI: 10.1109/IROS.2013.6697011.
    BibTeX:
    @inproceedings{aeinaksoytamosiunaite2013,
      author = {Aein, M J. and Aksoy, E E. and Tamosiunaite, M. and Papon, J. and Ude, A. and Wörgötter, F.},
      title = {Toward a library of manipulation actions based on Semantic Object-Action Relations},
      booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems},
      year = {2013},
      doi = {10.1109/IROS.2013.6697011},
      abstract = {The goal of this study is to provide an architecture for a generic definition of robot manipulation actions. We emphasize that the representation of actions presented here is procedural. Thus, we will define the structural elements of our action representations as execution protocols. To achieve this, manipulations are defined using three levels. The top- level defines objects, their relations and the actions in an abstract and symbolic way. A mid-level sequencer, with which the action primitives are chained, is used to structure the actual action execution, which is performed via the bottom level. This (lowest) level collects data from sensors and communicates with the control system of the robot. This method enables robot manipulators to execute the same action in different situations i.e. on different objects with different positions and orientations. In addition, two methods of detecting action failure are provided which are necessary to handle faults in system. To demonstrate the effectiveness of the proposed framework, several different actions are performed on our robotic setup and results are shown. This way we are creating a library of human-like robot actions, which can be used by higher-level task planners to execute more complex tasks.}}
    Abstract: The goal of this study is to provide an architecture for a generic definition of robot manipulation actions. We emphasize that the representation of actions presented here is procedural. Thus, we will define the structural elements of our action representations as execution protocols. To achieve this, manipulations are defined using three levels. The top- level defines objects, their relations and the actions in an abstract and symbolic way. A mid-level sequencer, with which the action primitives are chained, is used to structure the actual action execution, which is performed via the bottom level. This (lowest) level collects data from sensors and communicates with the control system of the robot. This method enables robot manipulators to execute the same action in different situations i.e. on different objects with different positions and orientations. In addition, two methods of detecting action failure are provided which are necessary to handle faults in system. To demonstrate the effectiveness of the proposed framework, several different actions are performed on our robotic setup and results are shown. This way we are creating a library of human-like robot actions, which can be used by higher-level task planners to execute more complex tasks.
    Review:
    Reich, S. and Abramov, A. and Papon, J. and Wörgötter, F. and Dellen, B. (2013).
    A Novel Real-time Edge-Preserving Smoothing Filter. International Conference on Computer Vision Theory and Applications, 5 - 14.
    BibTeX:
    @inproceedings{reichabramovpapon2013,
      author = {Reich, S. and Abramov, A. and Papon, J. and Wörgötter, F. and Dellen, B.},
      title = {A Novel Real-time Edge-Preserving Smoothing Filter},
      pages = {5 - 14},
      booktitle = {International Conference on Computer Vision Theory and Applications},
      year = {2013},
      location = {Barcelona (Spain)},
      month = {February 21-24},
      url = {http://www.visapp.visigrapp.org/Abstracts/2013/VISAPP_2013_Abstracts.htm},
      abstract = {The segmentation of textured and noisy areas in images is a very challenging task due to the large variety of objects and materials in natural environments, which cannot be solved by a single similarity measure. In this paper, we address this problem by proposing a novel edge-preserving texture filter, which smudges the color values inside uniformly textured areas, thus making the processed image more workable for color-based image segmentation. Due to the highly parallel structure of the method, the implementation on a GPU runs in real-time, allowing us to process standard images within tens of milliseconds. By preprocessing images with this novel filter before applying a recent real-time color-based image segmentation method, we obtain significant improvements in performance for images from the Berkeley dataset, outperforming an alternative version using a standard bilateral filter for preprocessing. We further show that our combined approach leads to better segmentations in terms of a standard performance measure than graph-based and mean-shift segmentation for the Berkeley image dataset.}}
    Abstract: The segmentation of textured and noisy areas in images is a very challenging task due to the large variety of objects and materials in natural environments, which cannot be solved by a single similarity measure. In this paper, we address this problem by proposing a novel edge-preserving texture filter, which smudges the color values inside uniformly textured areas, thus making the processed image more workable for color-based image segmentation. Due to the highly parallel structure of the method, the implementation on a GPU runs in real-time, allowing us to process standard images within tens of milliseconds. By preprocessing images with this novel filter before applying a recent real-time color-based image segmentation method, we obtain significant improvements in performance for images from the Berkeley dataset, outperforming an alternative version using a standard bilateral filter for preprocessing. We further show that our combined approach leads to better segmentations in terms of a standard performance measure than graph-based and mean-shift segmentation for the Berkeley image dataset.
    Review:
    Papon, J. and Kulvicius, T. and Aksoy, E E. and Wörgötter, F. (2013).
    Point Cloud Video Object Segmentation using a Persistent Supervoxel World-Model. IEEE/RSJ International Conference on Intelligent Robots and Systems IROS, 3712-3718. DOI: 10.1109/IROS.2013.6696886.
    BibTeX:
    @inproceedings{paponkulviciusaksoy2013,
      author = {Papon, J. and Kulvicius, T. and Aksoy, E E. and Wörgötter, F.},
      title = {Point Cloud Video Object Segmentation using a Persistent Supervoxel World-Model},
      pages = {3712-3718},
      booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems IROS},
      year = {2013},
      location = {Tokyo (Japan)},
      month = {November 3-8},
      organization = {},
      doi = {10.1109/IROS.2013.6696886},
      abstract = {Robust visual tracking is an essential precursor to understanding and replicating human actions in robotic systems. In order to accurately evaluate the semantic meaning of a sequence of video frames, or to replicate an action contained therein, one must be able to coherently track and segment all observed agents and objects. This work proposes a novel online point cloud based algorithm which simultaneously tracks 6DoF pose and determines spatial extent of all entities in indoor scenarios. This is accomplished using a persistent supervoxel world-model which is updated, rather than replaced, as new frames of data arrive. Maintenance of a world model enables general object permanence, permitting successful tracking through full occlusions. Object models are tracked using a bank of independent adaptive particle filters which use a supervoxel observation model to give rough estimates of object state. These are united using a novel multi-model RANSAC-like approach, which seeks to minimize a global energy function associating world-model supervoxels to predicted states. We present results on a standard robotic assembly benchmark for two application scenarios - human trajectory imitation and semantic action understanding - demonstrating the usefulness of the tracking in intelligent robotic systems.}}
    Abstract: Robust visual tracking is an essential precursor to understanding and replicating human actions in robotic systems. In order to accurately evaluate the semantic meaning of a sequence of video frames, or to replicate an action contained therein, one must be able to coherently track and segment all observed agents and objects. This work proposes a novel online point cloud based algorithm which simultaneously tracks 6DoF pose and determines spatial extent of all entities in indoor scenarios. This is accomplished using a persistent supervoxel world-model which is updated, rather than replaced, as new frames of data arrive. Maintenance of a world model enables general object permanence, permitting successful tracking through full occlusions. Object models are tracked using a bank of independent adaptive particle filters which use a supervoxel observation model to give rough estimates of object state. These are united using a novel multi-model RANSAC-like approach, which seeks to minimize a global energy function associating world-model supervoxels to predicted states. We present results on a standard robotic assembly benchmark for two application scenarios - human trajectory imitation and semantic action understanding - demonstrating the usefulness of the tracking in intelligent robotic systems.
    Review:
    Papon, J. and Abramov, A. and Schoeler, M. and Wörgötter, F. (2013).
    Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds. IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2027 - 2034. DOI: 10.1109/CVPR.2013.264.
    BibTeX:
    @inproceedings{paponabramovschoeler2013,
      author = {Papon, J. and Abramov, A. and Schoeler, M. and Wörgötter, F.},
      title = {Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds},
      pages = {2027 - 2034},
      booktitle = {IEEE Conference on Computer Vision and Pattern Recognition CVPR},
      year = {2013},
      location = {Portland, OR, USA},
      month = {06},
      doi = {10.1109/CVPR.2013.264},
      abstract = {Unsupervised over-segmentation of an image into regions of perceptually similar pixels, known as super pixels, is a widely used preprocessing step in segmentation algorithms. Super pixel methods reduce the number of regions that must be considered later by more computationally expensive algorithms, with a minimal loss of information. Nevertheless, as some information is inevitably lost, it is vital that super pixels not cross object boundaries, as such errors will propagate through later steps. Existing methods make use of projected color or depth information, but do not consider three dimensional geometric relationships between observed data points which can be used to prevent super pixels from crossing regions of empty space. We propose a novel over-segmentation algorithm which uses voxel relationships to produce over-segmentations which are fully consistent with the spatial geometry of the scene in three dimensional, rather than projective, space. Enforcing the constraint that segmented regions must have spatial connectivity prevents label flow across semantic object boundaries which might otherwise be violated. Additionally, as the algorithm works directly in 3D space, observations from several calibrated RGB+D cameras can be segmented jointly. Experiments on a large data set of human annotated RGB+D images demonstrate a significant reduction in occurrence of clusters crossing object boundaries, while maintaining speeds comparable to state-of-the-art 2D methods.}}
    Abstract: Unsupervised over-segmentation of an image into regions of perceptually similar pixels, known as super pixels, is a widely used preprocessing step in segmentation algorithms. Super pixel methods reduce the number of regions that must be considered later by more computationally expensive algorithms, with a minimal loss of information. Nevertheless, as some information is inevitably lost, it is vital that super pixels not cross object boundaries, as such errors will propagate through later steps. Existing methods make use of projected color or depth information, but do not consider three dimensional geometric relationships between observed data points which can be used to prevent super pixels from crossing regions of empty space. We propose a novel over-segmentation algorithm which uses voxel relationships to produce over-segmentations which are fully consistent with the spatial geometry of the scene in three dimensional, rather than projective, space. Enforcing the constraint that segmented regions must have spatial connectivity prevents label flow across semantic object boundaries which might otherwise be violated. Additionally, as the algorithm works directly in 3D space, observations from several calibrated RGB+D cameras can be segmented jointly. Experiments on a large data set of human annotated RGB+D images demonstrate a significant reduction in occurrence of clusters crossing object boundaries, while maintaining speeds comparable to state-of-the-art 2D methods.
    Review:
    Papon, J. and Abramov, A. and Aksoy, E. and Wörgötter, F. (2012).
    A modular system architecture for online parallel vision pipelines. Applications of Computer Vision WACV, 2012 IEEE Workshop on, 361-368. DOI: 10.1109/WACV.2012.6163002.
    BibTeX:
    @inproceedings{paponabramovaksoy2012,
      author = {Papon, J. and Abramov, A. and Aksoy, E. and Wörgötter, F.},
      title = {A modular system architecture for online parallel vision pipelines},
      pages = {361-368},
      booktitle = {Applications of Computer Vision WACV, 2012 IEEE Workshop on},
      year = {2012},
      month = {jan},
      doi = {10.1109/WACV.2012.6163002},
      abstract = {We present an architecture for real-time, online vision systems which enables development and use of complex vision pipelines integrating any number of algorithms. Individual algorithms are implemented using modular plugins, allowing integration of independently developed algorithms and rapid testing of new vision pipeline configurations. The architecture exploits the parallelization of graphics processing units (GPUs) and multi-core systems to speed processing and achieve real-time performance. Additionally, the use of a global memory management system for frame buffering permits complex algorithmic flow (e.g. feedback loops) in online processing setups, while maintaining the benefits of threaded asynchronous operation of separate algorithms. To demonstrate the system, a typical real-time system setup is described which incorporates plugins for video and depth acquisition, GPU-based segmentation and optical flow, semantic graph generation, and online visualization of output. Performance numbers are shown which demonstrate the insignificant overhead cost of the architecture as well as speed-up over strictly CPU and single threaded implementations.}}
    Abstract: We present an architecture for real-time, online vision systems which enables development and use of complex vision pipelines integrating any number of algorithms. Individual algorithms are implemented using modular plugins, allowing integration of independently developed algorithms and rapid testing of new vision pipeline configurations. The architecture exploits the parallelization of graphics processing units (GPUs) and multi-core systems to speed processing and achieve real-time performance. Additionally, the use of a global memory management system for frame buffering permits complex algorithmic flow (e.g. feedback loops) in online processing setups, while maintaining the benefits of threaded asynchronous operation of separate algorithms. To demonstrate the system, a typical real-time system setup is described which incorporates plugins for video and depth acquisition, GPU-based segmentation and optical flow, semantic graph generation, and online visualization of output. Performance numbers are shown which demonstrate the insignificant overhead cost of the architecture as well as speed-up over strictly CPU and single threaded implementations.
    Review:
    Papon, J. and Abramov, A. and Wörgötter, F. (2012).
    Occlusion Handling in Video Segmentation via Predictive Feedback. Computer Vision ECCV 2012. Workshops and Demonstrations, 233-242, 7585. DOI: 10.1007/978-3-642-33885-4_24.
    BibTeX:
    @incollection{paponabramovwoergoetter2012,
      author = {Papon, J. and Abramov, A. and Wörgötter, F.},
      title = {Occlusion Handling in Video Segmentation via Predictive Feedback},
      pages = {233-242},
      booktitle = {Computer Vision ECCV 2012. Workshops and Demonstrations},
      year = {2012},
      volume= {7585},
      publisher = {Springer Berlin Heidelberg},
      series = {Lecture Notes i},
      doi = {10.1007/978-3-642-33885-4_24},
      abstract = {We present a method for unsupervised on-line dense video segmentation which utilizes sequential Bayesian estimation techniques to resolve partial and full occlusions. Consistent labeling through occlusions is vital for applications which move from low-level object labels to high-level semantic knowledge - tasks such as activity recognition or robot control. The proposed method forms a predictive loop between segmentation and tracking, with tracking predictions used to seed the segmentation kernel, and segmentation results used to update tracked models. All segmented labels are tracked, without the use of a-priori models, using parallel color-histogram particle filters. Predictions are combined into a probabilistic representation of image labels, a realization of which is used to seed segmentation. A simulated annealing relaxation process allows the realization to converge to a minimal energy segmented image. Found segments are subsequently used to repopulate the particle sets, closing the loop. Results on the Cranfield benchmark sequence demonstrate that the prediction mechanism allows on-line segmentation to maintain temporally consistent labels through partial & full occlusions, significant appearance changes, and rapid erratic movements. Additionally, we show that tracking performance matches state-of-the art tracking methods on several challenging benchmark sequences.}}
    Abstract: We present a method for unsupervised on-line dense video segmentation which utilizes sequential Bayesian estimation techniques to resolve partial and full occlusions. Consistent labeling through occlusions is vital for applications which move from low-level object labels to high-level semantic knowledge - tasks such as activity recognition or robot control. The proposed method forms a predictive loop between segmentation and tracking, with tracking predictions used to seed the segmentation kernel, and segmentation results used to update tracked models. All segmented labels are tracked, without the use of a-priori models, using parallel color-histogram particle filters. Predictions are combined into a probabilistic representation of image labels, a realization of which is used to seed segmentation. A simulated annealing relaxation process allows the realization to converge to a minimal energy segmented image. Found segments are subsequently used to repopulate the particle sets, closing the loop. Results on the Cranfield benchmark sequence demonstrate that the prediction mechanism allows on-line segmentation to maintain temporally consistent labels through partial & full occlusions, significant appearance changes, and rapid erratic movements. Additionally, we show that tracking performance matches state-of-the art tracking methods on several challenging benchmark sequences.
    Review:
    Abramov, A. and Papon, J. and Pauwels, K. and Wörgötter, F. and Dellen, B. (2012).
    Depth-supported real-time video segmentation with the Kinect. IEEE workshop on the Applications of Computer Vision WACV. DOI: 10.1109/WACV.2012.6163000.
    BibTeX:
    @inproceedings{abramovpaponpauwels2012,
      author = {Abramov, A. and Papon, J. and Pauwels, K. and Wörgötter, F. and Dellen, B.},
      title = {Depth-supported real-time video segmentation with the Kinect},
      booktitle = {IEEE workshop on the Applications of Computer Vision WACV},
      year = {2012},
      doi = {10.1109/WACV.2012.6163000},
      abstract = {We present a real-time technique for the spatiotemporal segmentation of color/depth movies. Images are segmented using a parallel Metropolis algorithm implemented on a GPU utilizing both color and depth information, acquired with the Microsoft Kinect. Segments represent the equilibrium states of a Potts model, where tracking of segments is achieved by warping obtained segment labels to the next frame using real-time optical flow, which reduces the number of iterations required for the Metropolis method to encounter the new equilibrium state. By including depth information into the framework, true objects boundaries can be found more easily, improving also the temporal coherency of the method. The algorithm has been tested for videos of medium resolutions showing human manipulations of objects. The framework provides an inexpensive visual front end for visual preprocessing of videos in industrial settings and robot labs which can potentially be used in various applications.}}
    Abstract: We present a real-time technique for the spatiotemporal segmentation of color/depth movies. Images are segmented using a parallel Metropolis algorithm implemented on a GPU utilizing both color and depth information, acquired with the Microsoft Kinect. Segments represent the equilibrium states of a Potts model, where tracking of segments is achieved by warping obtained segment labels to the next frame using real-time optical flow, which reduces the number of iterations required for the Metropolis method to encounter the new equilibrium state. By including depth information into the framework, true objects boundaries can be found more easily, improving also the temporal coherency of the method. The algorithm has been tested for videos of medium resolutions showing human manipulations of objects. The framework provides an inexpensive visual front end for visual preprocessing of videos in industrial settings and robot labs which can potentially be used in various applications.
    Review:
    Abramov, A. and Papon, J. and Pauwels, K. and Wörgötter, F. and Babette, D. (2012).
    Real-time Segmentation of Stereo Videos on a Resource-limited System with a Mobile GPU. IEEE Transactions on circuits and systems for video technology, 1292 - 1305, 22, 9. DOI: 10.1109/TCSVT.2012.2199389.
    BibTeX:
    @inproceedings{abramovpaponpauwels2012a,
      author = {Abramov, A. and Papon, J. and Pauwels, K. and Wörgötter, F. and Babette, D.},
      title = {Real-time Segmentation of Stereo Videos on a Resource-limited System with a Mobile GPU},
      pages = {1292 - 1305},
      booktitle = {IEEE Transactions on circuits and systems for video technology},
      year = {2012},
      volume= {22},
      number = {9},
      month = {09},
      doi = {10.1109/TCSVT.2012.2199389},
      abstract = {In mobile robotic applications, visual information needs to be processed fast despite resource limitations of the mobile system. Here, a novel real-time framework for model-free spatiotemporal segmentation of stereo videos is presented. It combines real-time optical flow and stereo with image segmentation and runs on a portable system with an integrated mobile graphics processing unit. The system performs online, automatic, and dense segmentation of stereo videos and serves as a visual front end for preprocessing in mobile robots, providing a condensed representation of the scene that can potentially be utilized in various applications, e.g., object manipulation, manipulation recognition, visual servoing. The method was tested on real-world sequences with arbitrary motions, including videos acquired with a moving camera.}}
    Abstract: In mobile robotic applications, visual information needs to be processed fast despite resource limitations of the mobile system. Here, a novel real-time framework for model-free spatiotemporal segmentation of stereo videos is presented. It combines real-time optical flow and stereo with image segmentation and runs on a portable system with an integrated mobile graphics processing unit. The system performs online, automatic, and dense segmentation of stereo videos and serves as a visual front end for preprocessing in mobile robots, providing a condensed representation of the scene that can potentially be utilized in various applications, e.g., object manipulation, manipulation recognition, visual servoing. The method was tested on real-world sequences with arbitrary motions, including videos acquired with a moving camera.
    Review:
    Stein, S. and Schoeler, M. and Papon, J. and Wörgötter, F. (2014).
    Object Partitioning using Local Convexity. Conference on Computer Vision and Pattern Recognition CVPR, 304-311. DOI: 10.1109/CVPR.2014.46.
    BibTeX:
    @inproceedings{steinschoelerpapon2014,
      author = {Stein, S. and Schoeler, M. and Papon, J. and Wörgötter, F.},
      title = {Object Partitioning using Local Convexity},
      pages = {304-311},
      booktitle = {Conference on Computer Vision and Pattern Recognition CVPR},
      year = {2014},
      location = {Columbus, OH, USA},
      month = {06},
      doi = {10.1109/CVPR.2014.46},
      abstract = {The problem of how to arrive at an appropriate 3D-segmentation of a scene remains difficult. While current state-of-the-art methods continue to gradually improve in benchmark performance, they also grow more and more complex, for example by incorporating chains of classifiers, which require training on large manually annotated data- sets. As an alternative to this, we present a new, efficient learning- and model-free approach for the segmentation of 3D point clouds into object parts. The algorithm begins by decomposing the scene into an adjacency-graph of surface patches based on a voxel grid. Edges in the graph are then classified as either convex or concave using a novel combination of simple criteria which operate on the local geometry of these patches. This way the graph is divided into locally convex connected subgraphs, which - with high accuracy - represent object parts. Additionally, we propose a novel depth dependent voxel grid to deal with the decreasing point-density at far distances in the point clouds. This improves segmentation, allowing the use of fixed parameters for vastly different scenes. The algorithm is straight-forward to implement and requires no training data, while nevertheless producing results that are comparable to state-of-the-art methods which incorporate high-level concepts involving classification, learning and model fitting.}}
    Abstract: The problem of how to arrive at an appropriate 3D-segmentation of a scene remains difficult. While current state-of-the-art methods continue to gradually improve in benchmark performance, they also grow more and more complex, for example by incorporating chains of classifiers, which require training on large manually annotated data- sets. As an alternative to this, we present a new, efficient learning- and model-free approach for the segmentation of 3D point clouds into object parts. The algorithm begins by decomposing the scene into an adjacency-graph of surface patches based on a voxel grid. Edges in the graph are then classified as either convex or concave using a novel combination of simple criteria which operate on the local geometry of these patches. This way the graph is divided into locally convex connected subgraphs, which - with high accuracy - represent object parts. Additionally, we propose a novel depth dependent voxel grid to deal with the decreasing point-density at far distances in the point clouds. This improves segmentation, allowing the use of fixed parameters for vastly different scenes. The algorithm is straight-forward to implement and requires no training data, while nevertheless producing results that are comparable to state-of-the-art methods which incorporate high-level concepts involving classification, learning and model fitting.
    Review:
    Stein, S. and Wörgötter, F. and Schoeler, M. and Papon, J. and Kulvicius, T. (2014).
    Convexity based object partitioning for robot applications. IEEE International Conference on Robotics and Automation (ICRA), 3213-3220. DOI: 10.1109/ICRA.2014.6907321.
    BibTeX:
    @inproceedings{steinwoergoetterschoeler2014,
      author = {Stein, S. and Wörgötter, F. and Schoeler, M. and Papon, J. and Kulvicius, T.},
      title = {Convexity based object partitioning for robot applications},
      pages = {3213-3220},
      booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
      year = {2014},
      month = {05},
      doi = {10.1109/ICRA.2014.6907321},
      abstract = {The idea that connected convex surfaces, separated by concave boundaries, play an important role for the perception of objects and their decomposition into parts has been discussed for a long time. Based on this idea, we present a new bottom-up approach for the segmentation of 3D point clouds into object parts. The algorithm approximates a scene using an adjacency-graph of spatially connected surface patches. Edges in the graph are then classified as either convex or concave using a novel, strictly local criterion. Region growing is employed to identify locally convex connected subgraphs, which represent the object parts. We show quantitatively that our algorithm, although conceptually easy to graph and fast to compute, produces results that are comparable to far more complex state-of-the-art methods which use classification, learning and model fitting. This suggests that convexity/concavity is a powerful feature for object partitioning using 3D data. Furthermore we demonstrate that for many objects a natural decomposition into}}
    Abstract: The idea that connected convex surfaces, separated by concave boundaries, play an important role for the perception of objects and their decomposition into parts has been discussed for a long time. Based on this idea, we present a new bottom-up approach for the segmentation of 3D point clouds into object parts. The algorithm approximates a scene using an adjacency-graph of spatially connected surface patches. Edges in the graph are then classified as either convex or concave using a novel, strictly local criterion. Region growing is employed to identify locally convex connected subgraphs, which represent the object parts. We show quantitatively that our algorithm, although conceptually easy to graph and fast to compute, produces results that are comparable to far more complex state-of-the-art methods which use classification, learning and model fitting. This suggests that convexity/concavity is a powerful feature for object partitioning using 3D data. Furthermore we demonstrate that for many objects a natural decomposition into
    Review:
    Schoeler, M. and Stein, S. and Papon, J. and Abramov, A. and Wörgötter, F. (2014).
    Fast Self-supervised On-line Training for Object Recognition Specifically for Robotic Applications. International Conference on Computer Vision Theory and Applications VISAPP, 1 - 10.
    BibTeX:
    @inproceedings{schoelersteinpapon2014,
      author = {Schoeler, M. and Stein, S. and Papon, J. and Abramov, A. and Wörgötter, F.},
      title = {Fast Self-supervised On-line Training for Object Recognition Specifically for Robotic Applications},
      pages = {1 - 10},
      booktitle = {International Conference on Computer Vision Theory and Applications VISAPP},
      year = {2014},
      month = {January},
      abstract = {Today most recognition pipelines are trained at an off-line stage, providing systems with pre-segmented images and predefined objects, or at an on-line stage, which requires a human supervisor to tediously control the learning. Self-Supervised on-line training of recognition pipelines without human intervention is a highly desirable goal, as it allows systems to learn unknown, environment specific objects on-the-fly. We propose a fast and automatic system, which can extract and learn unknown objects with minimal human intervention by employing a two-level pipeline combining the advantages of RGB-D sensors for object extraction and high-resolution cameras for object recognition. Furthermore, we significantly improve recognition results with local features by implementing a novel keypoint orientation scheme, which leads to highly invariant but discriminative object signatures. Using only one image per object for training, our system is able to achieve a recognition rate of 79% for 18 objects, benchmarked on 42 scenes with random poses, scales and occlusion, while only taking 7 seconds for the training. Additionally, we evaluate our orientation scheme on the state-of-the-art 56-object SDU-dataset boosting accuracy for one training view per object by +37% to 78% and peaking at a performance of 98% for 11 training views.}}
    Abstract: Today most recognition pipelines are trained at an off-line stage, providing systems with pre-segmented images and predefined objects, or at an on-line stage, which requires a human supervisor to tediously control the learning. Self-Supervised on-line training of recognition pipelines without human intervention is a highly desirable goal, as it allows systems to learn unknown, environment specific objects on-the-fly. We propose a fast and automatic system, which can extract and learn unknown objects with minimal human intervention by employing a two-level pipeline combining the advantages of RGB-D sensors for object extraction and high-resolution cameras for object recognition. Furthermore, we significantly improve recognition results with local features by implementing a novel keypoint orientation scheme, which leads to highly invariant but discriminative object signatures. Using only one image per object for training, our system is able to achieve a recognition rate of 79% for 18 objects, benchmarked on 42 scenes with random poses, scales and occlusion, while only taking 7 seconds for the training. Additionally, we evaluate our orientation scheme on the state-of-the-art 56-object SDU-dataset boosting accuracy for one training view per object by +37% to 78% and peaking at a performance of 98% for 11 training views.
    Review:
    Krüger, N. and Ude, A. and Petersen, H. and Nemec, B. and Ellekilde, L. and Savarimuthu, T. and Rytz, J. and Fischer, K. and Buch, A. and Kraft, D. and Mustafa, W. and Aksoy, E. and Papon, J. and Kramberger, A. and Wörgötter, F. (2014).
    Technologies for the Fast Set-Up of Automated Assembly Processes. KI - Künstliche Intelligenz, 1-9. DOI: 10.1007/s13218-014-0329-9.
    BibTeX:
    @article{kruegerudepetersen2014,
      author = {Krüger, N. and Ude, A. and Petersen, H. and Nemec, B. and Ellekilde, L. and Savarimuthu, T. and Rytz, J. and Fischer, K. and Buch, A. and Kraft, D. and Mustafa, W. and Aksoy, E. and Papon, J. and Kramberger, A. and Wörgötter, F.},
      title = {Technologies for the Fast Set-Up of Automated Assembly Processes},
      pages = {1-9},
      journal = {KI - Künstliche Intelligenz},
      year = {2014},
      language = {English},
      publisher = {Springer Berlin Heidelberg},
      url = {http://dx.doi.org/10.1007/s13218-014-0329-9},
      doi = {10.1007/s13218-014-0329-9},
      abstract = {In this article, we describe technologies facilitating the set-up of automated assembly solutions which have been developed in the context of the IntellAct project (2011-2014). Tedious procedures are currently still required to establish such robot solutions. This hinders especially the automation of so called few-of-a-kind production. Therefore, most production of this kind is done manually and thus often performed in low-wage countries. In the IntellAct project, we have developed a set of methods which facilitate the set-up of a complex automatic assembly process, and here we present our work on tele-operation, dexterous grasping, pose estimation and learning of control strategies. The prototype developed in IntellAct is at a TRL4 (corresponding to demonstration in lab environment).}}
    Abstract: In this article, we describe technologies facilitating the set-up of automated assembly solutions which have been developed in the context of the IntellAct project (2011-2014). Tedious procedures are currently still required to establish such robot solutions. This hinders especially the automation of so called few-of-a-kind production. Therefore, most production of this kind is done manually and thus often performed in low-wage countries. In the IntellAct project, we have developed a set of methods which facilitate the set-up of a complex automatic assembly process, and here we present our work on tele-operation, dexterous grasping, pose estimation and learning of control strategies. The prototype developed in IntellAct is at a TRL4 (corresponding to demonstration in lab environment).
    Review:
    Schlette, C. and Buch, A. and Aksoy, E. and Steil, T. and Papon, J. and Savarimuthu, T. and Wörgötter, F. and Krüger, N. and Roßmann, J. (2014).
    A new benchmark for pose estimation with ground truth from virtual reality. Production Engineering, 745-754, 8, 6. DOI: 10.1007/s11740-014-0552-0.
    BibTeX:
    @article{schlettebuchaksoy2014,
      author = {Schlette, C. and Buch, A. and Aksoy, E. and Steil, T. and Papon, J. and Savarimuthu, T. and Wörgötter, F. and Krüger, N. and Roßmann, J.},
      title = {A new benchmark for pose estimation with ground truth from virtual reality},
      pages = {745-754},
      journal = {Production Engineering},
      year = {2014},
      volume= {8},
      number = {6},
      language = {English},
      publisher = {Springer Berlin Heidelberg},
      url = {http://dx.doi.org/10.1007/s11740-014-0552-0},
      doi = {10.1007/s11740-014-0552-0},
      abstract = {The development of programming paradigms for industrial assembly currently gets fresh impetus from approaches in human demonstration and programming-by-demonstration. Major low- and mid-level prerequisites for machine vision and learning in these intelligent robotic applications are pose estimation, stereo reconstruction and action recognition. As a basis for the machine vision and learning involved, pose estimation is used for deriving object positions and orientations and thus target frames for robot execution. Our contribution introduces and applies a novel benchmark for typical multi-sensor setups and algorithms in the field of demonstration-based automated assembly. The benchmark platform is equipped with a multi-sensor setup consisting of stereo cameras and depth scanning devices (see Fig. 1). The dimensions and abilities of the platform have been chosen in order to reflect typical manual assembly tasks. Following the eRobotics methodology, a simulatable 3D representation of this platform was modelled in virtual reality. Based on a detailed camera and sensor simulation, we generated a set of benchmark images and point clouds with controlled levels of noise as well as ground truth data such as object positions and time stamps. We demonstrate the application of the benchmark to evaluate our latest developments in pose estimation, stereo reconstruction and action recognition and publish the benchmark data for objective comparison of sensor setups and algorithms in industry.}}
    Abstract: The development of programming paradigms for industrial assembly currently gets fresh impetus from approaches in human demonstration and programming-by-demonstration. Major low- and mid-level prerequisites for machine vision and learning in these intelligent robotic applications are pose estimation, stereo reconstruction and action recognition. As a basis for the machine vision and learning involved, pose estimation is used for deriving object positions and orientations and thus target frames for robot execution. Our contribution introduces and applies a novel benchmark for typical multi-sensor setups and algorithms in the field of demonstration-based automated assembly. The benchmark platform is equipped with a multi-sensor setup consisting of stereo cameras and depth scanning devices (see Fig. 1). The dimensions and abilities of the platform have been chosen in order to reflect typical manual assembly tasks. Following the eRobotics methodology, a simulatable 3D representation of this platform was modelled in virtual reality. Based on a detailed camera and sensor simulation, we generated a set of benchmark images and point clouds with controlled levels of noise as well as ground truth data such as object positions and time stamps. We demonstrate the application of the benchmark to evaluate our latest developments in pose estimation, stereo reconstruction and action recognition and publish the benchmark data for objective comparison of sensor setups and algorithms in industry.
    Review:
    Papon, J. and Schoeler, M. and Wörgötter, F. (2015).
    Spatially Stratified Correspondence Sampling for Real-Time Point Cloud Tracking. IEEE Winter Conference on Applications of Computer Vision (WACV), 124-131. DOI: 10.1109/WACV.2015.24 edition.
    BibTeX:
    @inproceedings{paponschoelerwoergoetter2015,
      author = {Papon, J. and Schoeler, M. and Wörgötter, F.},
      title = {Spatially Stratified Correspondence Sampling for Real-Time Point Cloud Tracking},
      pages = {124-131},
      booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
      year = {2015},
      month = {Jan},
      doi = {10.1109/WACV.2015.24 edition},
      abstract = {In this paper we propose a novel spatially stratified sampling technique for evaluating the likelihood function in particle filters. In particular, we show that in the case where the measurement function uses spatial correspondence, we can greatly reduce computational cost by exploiting spatial structure to avoid redundant computations. We present results which quantitatively show that the technique permits equivalent, and in some cases, greater accuracy, as a reference point cloud particle filter at significantly faster run-times. We also compare to a GPU implementation, and show that we can exceed their performance on the CPU. In addition, we present results on a multi-target tracking appli- cation, demonstrating that the increases in efficiency permit online 6DoF multi-target tracking on standard hardware.}}
    Abstract: In this paper we propose a novel spatially stratified sampling technique for evaluating the likelihood function in particle filters. In particular, we show that in the case where the measurement function uses spatial correspondence, we can greatly reduce computational cost by exploiting spatial structure to avoid redundant computations. We present results which quantitatively show that the technique permits equivalent, and in some cases, greater accuracy, as a reference point cloud particle filter at significantly faster run-times. We also compare to a GPU implementation, and show that we can exceed their performance on the CPU. In addition, we present results on a multi-target tracking appli- cation, demonstrating that the increases in efficiency permit online 6DoF multi-target tracking on standard hardware.
    Review:
    Schoeler, M. and Wörgötter, F. and Papon, J. and Kulvicius, T. (2015).
    Unsupervised generation of context-relevant training-sets for visual object recognition employing multilinguality. IEEE Winter Conference on Applications of Computer Vision (WACV), 805-812. DOI: 10.1109/WACV.2015.112.
    BibTeX:
    @inproceedings{schoelerwoergoetterpapon2015,
      author = {Schoeler, M. and Wörgötter, F. and Papon, J. and Kulvicius, T.},
      title = {Unsupervised generation of context-relevant training-sets for visual object recognition employing multilinguality},
      pages = {805-812},
      booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
      year = {2015},
      month = {Jan},
      doi = {10.1109/WACV.2015.112},
      abstract = {Image based object classification requires clean training data sets. Gathering such sets is usually done manually by humans, which is time-consuming and laborious. On the other hand, directly using images from search engines creates very noisy data due to ambiguous noun-focused indexing. However, in daily speech nouns and verbs are always coupled. We use this for the automatic generation of clean data sets by the here-presented TRANSCLEAN algorithm, which through the use of multiple languages also solves the problem of polysemes (a single spelling with multiple meanings). Thus, we use the implicit knowledge contained in verbs, e.g. in an imperative such as}}
    Abstract: Image based object classification requires clean training data sets. Gathering such sets is usually done manually by humans, which is time-consuming and laborious. On the other hand, directly using images from search engines creates very noisy data due to ambiguous noun-focused indexing. However, in daily speech nouns and verbs are always coupled. We use this for the automatic generation of clean data sets by the here-presented TRANSCLEAN algorithm, which through the use of multiple languages also solves the problem of polysemes (a single spelling with multiple meanings). Thus, we use the implicit knowledge contained in verbs, e.g. in an imperative such as
    Review:
    Savarimuthu, R. and Papon, J. and Buch, A. G. and Aksoy, E. and Mustafa, W. and Wörgötter, F. and Krüger, N. (2015).
    An Online Vision System for Understanding Complex Assembly Tasks. International Conference on Computer Vision Theory and Applications, 1 - 8. DOI: 10.5220/0005260804540461.
    BibTeX:
    @inproceedings{savarimuthupaponbuch2015,
      author = {Savarimuthu, R. and Papon, J. and Buch, A. G. and Aksoy, E. and Mustafa, W. and Wörgötter, F. and Krüger, N.},
      title = {An Online Vision System for Understanding Complex Assembly Tasks},
      pages = {1 - 8},
      booktitle = {International Conference on Computer Vision Theory and Applications},
      year = {2015},
      location = {Berlin (Germany)},
      month = {March 11 - 14},
      doi = {10.5220/0005260804540461},
      abstract = {We present an integrated system for the recognition, pose estimation and simultaneous tracking of multiple objects in 3D scenes. Our target application is a complete semantic representation of dynamic scenes which requires three essential steps recognition of objects, tracking their movements, and identification of interactions between them. We address this challenge with a complete system which uses object recognition and pose estimation to initiate object models and trajectories, a dynamic sequential octree structure to allow for full 6DOF tracking through occlusions, and a graph-based semantic representation to distil interactions. We evaluate the proposed method on real scenarios by comparing tracked outputs to ground truth trajectories and we compare the results to Iterative Closest Point and Particle Filter based trackers.}}
    Abstract: We present an integrated system for the recognition, pose estimation and simultaneous tracking of multiple objects in 3D scenes. Our target application is a complete semantic representation of dynamic scenes which requires three essential steps recognition of objects, tracking their movements, and identification of interactions between them. We address this challenge with a complete system which uses object recognition and pose estimation to initiate object models and trajectories, a dynamic sequential octree structure to allow for full 6DOF tracking through occlusions, and a graph-based semantic representation to distil interactions. We evaluate the proposed method on real scenarios by comparing tracked outputs to ground truth trajectories and we compare the results to Iterative Closest Point and Particle Filter based trackers.
    Review:
    Schoeler, M. and Papon, J. and Wörgötter, F. (2015).
    Constrained planar cuts - Object partitioning for point clouds. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5207-5215. DOI: 10.1109/CVPR.2015.7299157.
    BibTeX:
    @inproceedings{schoelerpaponwoergoetter2015,
      author = {Schoeler, M. and Papon, J. and Wörgötter, F.},
      title = {Constrained planar cuts - Object partitioning for point clouds},
      pages = {5207-5215},
      booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      year = {2015},
      location = {Boston, MA, USA},
      month = {June},
      doi = {10.1109/CVPR.2015.7299157},
      abstract = {While humans can easily separate unknown objects into meaningful parts, recent segmentation methods can only achieve similar partitionings by training on human-annotated ground-truth data. Here we introduce a bottom-up method for segmenting 3D point clouds into functional parts which does not require supervision and achieves equally good results. Our method uses local concavities as an indicator for inter-part boundaries. We show that this criterion is efficient to compute and generalizes well across different object classes. The algorithm employs a novel locally constrained geometrical boundary model which proposes greedy cuts through a local concavity graph. Only planar cuts are considered and evaluated using a cost function, which rewards cuts orthogonal to concave edges. Additionally, a local clustering constraint is applied to ensure the partitioning only affects relevant locally concave regions. We evaluate our algorithm on recordings from an RGB-D camera as well as the Princeton Segmentation Benchmark, using a fixed set of parameters across all object classes. This stands in stark contrast to most reported results which require either knowing the number of parts or annotated ground-truth for learning. Our approach outperforms all existing bottom-up methods (reducing the gap to human performance by up to 50 %) and achieves scores similar to top-down data-driven approaches.}}
    Abstract: While humans can easily separate unknown objects into meaningful parts, recent segmentation methods can only achieve similar partitionings by training on human-annotated ground-truth data. Here we introduce a bottom-up method for segmenting 3D point clouds into functional parts which does not require supervision and achieves equally good results. Our method uses local concavities as an indicator for inter-part boundaries. We show that this criterion is efficient to compute and generalizes well across different object classes. The algorithm employs a novel locally constrained geometrical boundary model which proposes greedy cuts through a local concavity graph. Only planar cuts are considered and evaluated using a cost function, which rewards cuts orthogonal to concave edges. Additionally, a local clustering constraint is applied to ensure the partitioning only affects relevant locally concave regions. We evaluate our algorithm on recordings from an RGB-D camera as well as the Princeton Segmentation Benchmark, using a fixed set of parameters across all object classes. This stands in stark contrast to most reported results which require either knowing the number of parts or annotated ground-truth for learning. Our approach outperforms all existing bottom-up methods (reducing the gap to human performance by up to 50 %) and achieves scores similar to top-down data-driven approaches.
    Review:
    Papon, J. and Schoeler, M. (2015).
    Semantic Pose using Deep Networks Trained on Synthetic RGB-D. IEEE International Conference on Computer Vision (ICCV), 1-9.
    BibTeX:
    @inproceedings{paponschoeler2015,
      author = {Papon, J. and Schoeler, M.},
      title = {Semantic Pose using Deep Networks Trained on Synthetic RGB-D},
      pages = {1-9},
      booktitle = {IEEE International Conference on Computer Vision (ICCV)},
      year = {2015},
      location = {Santiago, Chile},
      month = {12},
      url = {http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Papon_Semantic_Pose_Using_ICCV_2015_paper.pdf},
      abstract = {In this work we address the problem of indoor scene understanding from RGB-D images. Specifically, we propose to find instances of common furniture classes, their spatial extent, and their pose with respect to generalized class models. To accomplish this, we use a deep, wide, multi-output convolutional neural network (CNN) that predicts class, pose, and location of possible objects simultaneously. To overcome the lack of large annotated RGB-D training sets (especially those with pose), we use an on-the-fly rendering pipeline that generates realistic cluttered room scenes in parallel to training. We then perform transfer learning on the relatively small amount of publicly available annotated RGB-D data, and find that our model is able to successfully annotate even highly challenging real scenes. Importantly, our trained network is able to understand noisy and sparse observations of highly cluttered scenes with a remarkable degree of accuracy, inferring class and pose from a very limited set of cues. Additionally, our neural network is only moderately deep and computes class, pose and position in tandem, so the overall run-time is significantly faster than existing methods, estimating all output parameters simultaneously in parallel on a GPU in seconds.}}
    Abstract: In this work we address the problem of indoor scene understanding from RGB-D images. Specifically, we propose to find instances of common furniture classes, their spatial extent, and their pose with respect to generalized class models. To accomplish this, we use a deep, wide, multi-output convolutional neural network (CNN) that predicts class, pose, and location of possible objects simultaneously. To overcome the lack of large annotated RGB-D training sets (especially those with pose), we use an on-the-fly rendering pipeline that generates realistic cluttered room scenes in parallel to training. We then perform transfer learning on the relatively small amount of publicly available annotated RGB-D data, and find that our model is able to successfully annotate even highly challenging real scenes. Importantly, our trained network is able to understand noisy and sparse observations of highly cluttered scenes with a remarkable degree of accuracy, inferring class and pose from a very limited set of cues. Additionally, our neural network is only moderately deep and computes class, pose and position in tandem, so the overall run-time is significantly faster than existing methods, estimating all output parameters simultaneously in parallel on a GPU in seconds.
    Review:

    © 2011 - 2017 Dept. of Computational Neuroscience • comments to: sreich _at_ gwdg.de • Impressum / Site Info