2025 |
Li, S; Keipour, A; Zhao, S; Rajagopalan, S; Swan, C; Bekris, K Learning to Optimize Package Picking for Large-Scale, Real-World Robot Induction Conference 19th International Symposium on Experimental Robotics (ISER), 2025. Abstract | Links | BibTeX | Tags: Learning, Manipulation @conference{li2025learning, title = {Learning to Optimize Package Picking for Large-Scale, Real-World Robot Induction}, author = {S Li and A Keipour and S Zhao and S Rajagopalan and C Swan and K Bekris}, url = {https://arxiv.org/abs/2506.09765}, year = {2025}, date = {2025-07-06}, booktitle = {19th International Symposium on Experimental Robotics (ISER)}, abstract = {Warehouse automation plays a pivotal role in enhancing operational efficiency, minimizing costs, and improving resilience to workforce variability. While prior research has demonstrated the potential of machine learning (ML) models to increase picking success rates in large-scale robotic fleets by prioritizing high-probability picks and packages, these efforts primarily focused on predicting success probabilities for picks sampled using heuristic methods. Limited attention has been given, however, to leveraging data-driven approaches to directly optimize sampled picks for better performance at scale. In this study, we propose an ML-based framework that predicts transform adjustments as well as improving the selection of suction cups for multi-suction end effectors for sampled picks to enhance their success probabilities. The framework was integrated and evaluated in test workcells that resemble the operations of Amazon Robotics' Robot Induction (Robin) fleet, which is used for package manipulation. Evaluated on over 2 million picks, the proposed method achieves a 20\% reduction in pick failure rates compared to a heuristic-based pick sampling baseline, demonstrating its effectiveness in large-scale warehouse automation scenarios. }, keywords = {Learning, Manipulation}, pubstate = {published}, tppubtype = {conference} } Warehouse automation plays a pivotal role in enhancing operational efficiency, minimizing costs, and improving resilience to workforce variability. While prior research has demonstrated the potential of machine learning (ML) models to increase picking success rates in large-scale robotic fleets by prioritizing high-probability picks and packages, these efforts primarily focused on predicting success probabilities for picks sampled using heuristic methods. Limited attention has been given, however, to leveraging data-driven approaches to directly optimize sampled picks for better performance at scale. In this study, we propose an ML-based framework that predicts transform adjustments as well as improving the selection of suction cups for multi-suction end effectors for sampled picks to enhance their success probabilities. The framework was integrated and evaluated in test workcells that resemble the operations of Amazon Robotics' Robot Induction (Robin) fleet, which is used for package manipulation. Evaluated on over 2 million picks, the proposed method achieves a 20% reduction in pick failure rates compared to a heuristic-based pick sampling baseline, demonstrating its effectiveness in large-scale warehouse automation scenarios. |
Wang, C; Vanbaar, J; Mitash, C; Li, S; Randle, D; Keipour, A; Hussein, M; Bekris, K; Katyal, K Demonstrating Item Picking at Scale via Effective Learning of Multimodal Representations Conference Proceedings of Robotics: Science and Systems (RSS), 2025. Abstract | Links | BibTeX | Tags: Learning, Manipulation, Planning @conference{wang2025Demonstrating, title = {Demonstrating Item Picking at Scale via Effective Learning of Multimodal Representations}, author = {C Wang and J Vanbaar and C Mitash and S Li and D Randle and A Keipour and M Hussein and K Bekris and K Katyal}, url = {https://arxiv.org/abs/2506.10359}, year = {2025}, date = {2025-06-21}, booktitle = {Proceedings of Robotics: Science and Systems (RSS)}, journal = {Proceedings of Robotics: Science and Systems (RSS)}, abstract = {This work demonstrates how autonomously learning aspects of robotic operation from sparsely-labeled, real-world data of deployed, engineered solutions at industrial scale can provide with solutions that achieve improved performance. Specifically, it focuses on multi-suction robot picking and performs a comprehensive study on the application of multi-modal visual encoders for predicting the success of candidate robotic picks. Picking diverse items from unstructured piles is an important and challenging task for robot manipulation in real-world settings, such as warehouses. Methods for picking from clutter must work for an open set of items while simultaneously meeting latency constraints to achieve high throughput. The demonstrated approach utilizes multiple input modalities, such as RGB, depth and semantic segmentation, to estimate the quality of candidate multi-suction picks. The strategy is trained from real-world item picking data, with a combination of multimodal pretrain and finetune. The manuscript provides comprehensive experimental evaluation performed over a large item-picking dataset, an item-picking dataset targeted to include partial occlusions, and a package-picking dataset, which focuses on containers, such as boxes and envelopes, instead of unpackaged items. The evaluation measures performance for different item configurations, pick scenes, and object types. Ablations help to understand the effects of in-domain pretraining, the impact of different modalities and the importance of finetuning. These ablations reveal both the importance of training over multiple modalities but also the ability of models to learn during pretraining the relationship between modalities so that during finetuning and inference, only a subset of them can be used as input. }, keywords = {Learning, Manipulation, Planning}, pubstate = {published}, tppubtype = {conference} } This work demonstrates how autonomously learning aspects of robotic operation from sparsely-labeled, real-world data of deployed, engineered solutions at industrial scale can provide with solutions that achieve improved performance. Specifically, it focuses on multi-suction robot picking and performs a comprehensive study on the application of multi-modal visual encoders for predicting the success of candidate robotic picks. Picking diverse items from unstructured piles is an important and challenging task for robot manipulation in real-world settings, such as warehouses. Methods for picking from clutter must work for an open set of items while simultaneously meeting latency constraints to achieve high throughput. The demonstrated approach utilizes multiple input modalities, such as RGB, depth and semantic segmentation, to estimate the quality of candidate multi-suction picks. The strategy is trained from real-world item picking data, with a combination of multimodal pretrain and finetune. The manuscript provides comprehensive experimental evaluation performed over a large item-picking dataset, an item-picking dataset targeted to include partial occlusions, and a package-picking dataset, which focuses on containers, such as boxes and envelopes, instead of unpackaged items. The evaluation measures performance for different item configurations, pick scenes, and object types. Ablations help to understand the effects of in-domain pretraining, the impact of different modalities and the importance of finetuning. These ablations reveal both the importance of training over multiple modalities but also the ability of models to learn during pretraining the relationship between modalities so that during finetuning and inference, only a subset of them can be used as input. |
2024 |
Bekris, K; Doerr, J; Meng, P; Tangirala, S The State of Robot Motion Generation Inproceedings International Symposium of Robotics Research (ISRR), Long Beach, California, 2024. Abstract | Links | BibTeX | Tags: Dynamics, Learning, Planning @inproceedings{Bekris:2024aa, title = {The State of Robot Motion Generation}, author = {K Bekris and J Doerr and P Meng and S Tangirala}, url = {https://arxiv.org/abs/2410.12172}, year = {2024}, date = {2024-12-01}, booktitle = {International Symposium of Robotics Research (ISRR)}, address = {Long Beach, California}, abstract = {This paper first reviews the large spectrum of methods for generating robot motion proposed over the 50 years of robotics research culminating to recent developments. It crosses the boundaries of methodologies, which are typically not surveyed together, from those that operate over explicit models to those that learn implicit ones. The paper concludes with a discussion of the current state-of-the-art and the properties of the varying methodologies highlighting opportunities for integration.}, keywords = {Dynamics, Learning, Planning}, pubstate = {published}, tppubtype = {inproceedings} } This paper first reviews the large spectrum of methods for generating robot motion proposed over the 50 years of robotics research culminating to recent developments. It crosses the boundaries of methodologies, which are typically not surveyed together, from those that operate over explicit models to those that learn implicit ones. The paper concludes with a discussion of the current state-of-the-art and the properties of the varying methodologies highlighting opportunities for integration. |
2023 |
Lu, S; Deng, Y; Boularias, A; Bekris, K Self-Supervised Learning of Object Segmentation from Unlabeled RGB-D Videos Inproceedings IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023. Abstract | Links | BibTeX | Tags: Learning, Perception @inproceedings{Lu:2023ab, title = {Self-Supervised Learning of Object Segmentation from Unlabeled RGB-D Videos}, author = {S Lu and Y Deng and A Boularias and K Bekris}, url = {https://arxiv.org/abs/2304.04325}, year = {2023}, date = {2023-05-01}, booktitle = {IEEE International Conference on Robotics and Automation (ICRA)}, address = {London, UK}, abstract = {This work proposes a self-supervised learning system for segmenting rigid objects in RGB images. The proposed pipeline is trained on unlabeled RGB-D videos of static objects, which can be captured with a camera carried by a mobile robot. A key feature of the self-supervised training process is a graph-matching algorithm that operates on the over-segmentation output of the point cloud that is reconstructed from each video. The graph matching, along with point cloud registration, is able to find reoccurring object patterns across videos and combine them into 3D object pseudo labels, even under occlusions or different viewing angles. Projected 2D object masks from 3D pseudo labels are used to train a pixel-wise feature extractor through contrastive learning. During online inference, a clustering method uses the learned features to cluster foreground pixels into object segments. Experiments highlight the method's effectiveness on both real and synthetic video datasets, which include cluttered scenes of tabletop objects. The proposed method outperforms existing unsupervised methods for object segmentation by a large margin.}, keywords = {Learning, Perception}, pubstate = {published}, tppubtype = {inproceedings} } This work proposes a self-supervised learning system for segmenting rigid objects in RGB images. The proposed pipeline is trained on unlabeled RGB-D videos of static objects, which can be captured with a camera carried by a mobile robot. A key feature of the self-supervised training process is a graph-matching algorithm that operates on the over-segmentation output of the point cloud that is reconstructed from each video. The graph matching, along with point cloud registration, is able to find reoccurring object patterns across videos and combine them into 3D object pseudo labels, even under occlusions or different viewing angles. Projected 2D object masks from 3D pseudo labels are used to train a pixel-wise feature extractor through contrastive learning. During online inference, a clustering method uses the learned features to cluster foreground pixels into object segments. Experiments highlight the method's effectiveness on both real and synthetic video datasets, which include cluttered scenes of tabletop objects. The proposed method outperforms existing unsupervised methods for object segmentation by a large margin. |
2022 |
McMahon, T; Sivaramakrishnan, A; Kedia, K; Granados, E; Bekris, K Terrain-Aware Learned Controllers for Sampling-Based Kinodynamic Planning Over Physically Simulated Terrains Inproceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022. Abstract | Links | BibTeX | Tags: Dynamics, Learning, Planning @inproceedings{McMahon:2022ab, title = {Terrain-Aware Learned Controllers for Sampling-Based Kinodynamic Planning Over Physically Simulated Terrains}, author = {T McMahon and A Sivaramakrishnan and K Kedia and E Granados and K Bekris}, url = {https://ieeexplore.ieee.org/document/9982136}, year = {2022}, date = {2022-06-01}, booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, abstract = {This paper explores learning an effective controller for improving the efficiency of kinodynamic planning for vehicular systems navigating uneven terrains. It describes the pipeline for training the corresponding controller and using it for motion planning purposes. The training process uses a soft actor-critic approach with hindsight experience replay to train a model, which is parameterized by the incline of the robot's local terrain. This trained model is then used during the expansion process of an asymptotically optimal kinodynamic planner to generate controls that allow the robot to reach desired local states. It is also used to define a heuristic cost-to-go function for the planner via a wavefront operation that estimates the cost of reaching the global goal. The cost-to-go function is used both for selecting nodes for expansion as well as for generating local goals for the controller to expand towards. The accompanying experimental section applies the integrated planning solution on models of all-terrain robots in a variety of physically simulated terrains. It shows that the proposed terrain-aware controller and the proposed wavefront function based on the cost-to-go model enable motion planners to find solutions in less time and with lower cost than alternatives. An ablation study emphasizes the benefits of a learned controller that is parameterized by the incline of the robot's local terrain as well as of an incremental training process for the controller.}, keywords = {Dynamics, Learning, Planning}, pubstate = {published}, tppubtype = {inproceedings} } This paper explores learning an effective controller for improving the efficiency of kinodynamic planning for vehicular systems navigating uneven terrains. It describes the pipeline for training the corresponding controller and using it for motion planning purposes. The training process uses a soft actor-critic approach with hindsight experience replay to train a model, which is parameterized by the incline of the robot's local terrain. This trained model is then used during the expansion process of an asymptotically optimal kinodynamic planner to generate controls that allow the robot to reach desired local states. It is also used to define a heuristic cost-to-go function for the planner via a wavefront operation that estimates the cost of reaching the global goal. The cost-to-go function is used both for selecting nodes for expansion as well as for generating local goals for the controller to expand towards. The accompanying experimental section applies the integrated planning solution on models of all-terrain robots in a variety of physically simulated terrains. It shows that the proposed terrain-aware controller and the proposed wavefront function based on the cost-to-go model enable motion planners to find solutions in less time and with lower cost than alternatives. An ablation study emphasizes the benefits of a learned controller that is parameterized by the incline of the robot's local terrain as well as of an incremental training process for the controller. |
Wen, B; Lian, W; Bekris, K; Schaal, S You Only Demonstrate Once: Category-Level Manipulation from Single Visual Demonstration Inproceedings Robotics: Science and Systems (RSS), 2022, (Nomination for Best Paper Award). Abstract | Links | BibTeX | Tags: Learning, Manipulation, Perception @inproceedings{Wen:2022ab, title = {You Only Demonstrate Once: Category-Level Manipulation from Single Visual Demonstration}, author = {B Wen and W Lian and K Bekris and S Schaal}, url = {https://www.roboticsproceedings.org/rss18/p044.pdf}, year = {2022}, date = {2022-06-01}, booktitle = {Robotics: Science and Systems (RSS)}, abstract = {Promising results have been achieved recently in category-level manipulation that generalizes across object instances. Nevertheless, it often requires expensive real-world data collection and manual specification of semantic keypoints for each object category and task. Additionally, coarse keypoint predictions and ignoring intermediate action sequences hinder adoption in complex manipulation tasks beyond pick-and-place. This work proposes a novel, category-level manipulation framework that leverages an object-centric, category-level representation and model-free 6 DoF motion tracking. The canonical object representation is learned solely in simulation and then used to parse a category-level, task trajectory from a single demonstration video. The demonstration is reprojected to a target trajectory tailored to a novel object via the canonical representation. During execution, the manipulation horizon is decomposed into long range, collision-free motion and last-inch manipulation. For the latter part, a category-level behavior cloning (CatBC) method leverages motion tracking to perform closed-loop control. CatBC follows the target trajectory, projected from the demonstration and anchored to a dynamically selected category-level coordinate frame. The frame is automatically selected along the manipulation horizon by a local attention mechanism. This framework allows to teach different manipulation strategies by solely providing a single demonstration, without complicated manual programming. Extensive experiments demonstrate its efficacy in a range of challenging industrial tasks in high precision assembly, which involve learning complex, long-horizon policies. The process exhibits robustness against uncertainty due to dynamics as well as generalization across object instances and scene configurations.}, note = {Nomination for Best Paper Award}, keywords = {Learning, Manipulation, Perception}, pubstate = {published}, tppubtype = {inproceedings} } Promising results have been achieved recently in category-level manipulation that generalizes across object instances. Nevertheless, it often requires expensive real-world data collection and manual specification of semantic keypoints for each object category and task. Additionally, coarse keypoint predictions and ignoring intermediate action sequences hinder adoption in complex manipulation tasks beyond pick-and-place. This work proposes a novel, category-level manipulation framework that leverages an object-centric, category-level representation and model-free 6 DoF motion tracking. The canonical object representation is learned solely in simulation and then used to parse a category-level, task trajectory from a single demonstration video. The demonstration is reprojected to a target trajectory tailored to a novel object via the canonical representation. During execution, the manipulation horizon is decomposed into long range, collision-free motion and last-inch manipulation. For the latter part, a category-level behavior cloning (CatBC) method leverages motion tracking to perform closed-loop control. CatBC follows the target trajectory, projected from the demonstration and anchored to a dynamically selected category-level coordinate frame. The frame is automatically selected along the manipulation horizon by a local attention mechanism. This framework allows to teach different manipulation strategies by solely providing a single demonstration, without complicated manual programming. Extensive experiments demonstrate its efficacy in a range of challenging industrial tasks in high precision assembly, which involve learning complex, long-horizon policies. The process exhibits robustness against uncertainty due to dynamics as well as generalization across object instances and scene configurations. |
2021 |
Wang, K; Aanjaneya, M; Bekris, K Sim2Sim Evaluation of a Novel Data-Efficient Differentiable Physics Engine for Tensegrity Robots Inproceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021. Abstract | Links | BibTeX | Tags: Dynamics, Learning, Soft-Robots @inproceedings{Wang:2021ab, title = {Sim2Sim Evaluation of a Novel Data-Efficient Differentiable Physics Engine for Tensegrity Robots}, author = {K Wang and M Aanjaneya and K Bekris}, url = {https://arxiv.org/abs/2011.04929}, year = {2021}, date = {2021-09-01}, booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, abstract = {Learning policies in simulation is promising for reducing human effort when training robot controllers. This is especially true for soft robots that are more adaptive and safe but also more difficult to accurately model and control. The sim2real gap is the main barrier to successfully transfer policies from simulation to a real robot. System identification can be applied to reduce this gap but traditional identification methods require a lot of manual tuning. Data-driven alternatives can tune dynamical models directly from data but are often data hungry, which also incorporates human effort in collecting data. This work proposes a data-driven, end-to-end differentiable simulator focused on the exciting but challenging domain of tensegrity robots. To the best of the authors' knowledge, this is the first differentiable physics engine for tensegrity robots that supports cable, contact, and actuation modeling. The aim is to develop a reasonably simplified, data-driven simulation, which can learn approximate dynamics with limited ground truth data. The dynamics must be accurate enough to generate policies that can be transferred back to the ground-truth system. As a first step in this direction, the current work demonstrates sim2sim transfer, where the unknown physical model of MuJoCo acts as a ground truth system. Two different tensegrity robots are used for evaluation and learning of locomotion policies, a 6-bar and a 3-bar tensegrity. The results indicate that only 0.25% of ground truth data are needed to train a policy that works on the ground truth system when the differentiable engine is used for training against training the policy directly on the ground truth system.}, keywords = {Dynamics, Learning, Soft-Robots}, pubstate = {published}, tppubtype = {inproceedings} } Learning policies in simulation is promising for reducing human effort when training robot controllers. This is especially true for soft robots that are more adaptive and safe but also more difficult to accurately model and control. The sim2real gap is the main barrier to successfully transfer policies from simulation to a real robot. System identification can be applied to reduce this gap but traditional identification methods require a lot of manual tuning. Data-driven alternatives can tune dynamical models directly from data but are often data hungry, which also incorporates human effort in collecting data. This work proposes a data-driven, end-to-end differentiable simulator focused on the exciting but challenging domain of tensegrity robots. To the best of the authors' knowledge, this is the first differentiable physics engine for tensegrity robots that supports cable, contact, and actuation modeling. The aim is to develop a reasonably simplified, data-driven simulation, which can learn approximate dynamics with limited ground truth data. The dynamics must be accurate enough to generate policies that can be transferred back to the ground-truth system. As a first step in this direction, the current work demonstrates sim2sim transfer, where the unknown physical model of MuJoCo acts as a ground truth system. Two different tensegrity robots are used for evaluation and learning of locomotion policies, a 6-bar and a 3-bar tensegrity. The results indicate that only 0.25% of ground truth data are needed to train a policy that works on the ground truth system when the differentiable engine is used for training against training the policy directly on the ground truth system. |
2020 |
Wen, B; Mitash, C; Ren, B; Bekris, K se(3)-TrackNet: Data-Driven 6d Pose Tracking by Calibrating Image Residuals in Synthetic Domains Conference IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, 2020. Abstract | Links | BibTeX | Tags: Learning, Perception @conference{Wen:2020ab, title = {se(3)-TrackNet: Data-Driven 6d Pose Tracking by Calibrating Image Residuals in Synthetic Domains}, author = {B Wen and C Mitash and B Ren and K Bekris}, url = {http://arxiv.org/abs/2007.13866}, year = {2020}, date = {2020-10-01}, booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, address = {Las Vegas, NV}, abstract = {Tracking the 6D pose of objects in video sequences is important for robot manipulation. This task, however, introduces multiple challenges: (i) robot manipulation involves significant occlusions; (ii) data and annotations are troublesome and difficult to collect for 6D poses, which complicates machine learning solutions, and (iii) incremental error drift often accumulates in long term tracking to necessitate re-initialization of the object's pose. This work proposes a data-driven optimization approach for long-term, 6D pose tracking. It aims to identify the optimal relative pose given the current RGB-D observation and a synthetic image conditioned on the previous best estimate and the object's model. The key contribution in this context is a novel neural network architecture, which appropriately disentangles the feature encoding to help reduce domain shift, and an effective 3D orientation representation via Lie Algebra. Consequently, even when the network is trained only with synthetic data can work effectively over real images. Comprehensive experiments over benchmarks - existing ones as well as a new dataset with significant occlusions related to object manipulation - show that the proposed approach achieves consistently robust estimates and outperforms alternatives, even though they have been trained with real images. The approach is also the most computationally efficient among the alternatives and achieves a tracking frequency of 90.9Hz.}, keywords = {Learning, Perception}, pubstate = {published}, tppubtype = {conference} } Tracking the 6D pose of objects in video sequences is important for robot manipulation. This task, however, introduces multiple challenges: (i) robot manipulation involves significant occlusions; (ii) data and annotations are troublesome and difficult to collect for 6D poses, which complicates machine learning solutions, and (iii) incremental error drift often accumulates in long term tracking to necessitate re-initialization of the object's pose. This work proposes a data-driven optimization approach for long-term, 6D pose tracking. It aims to identify the optimal relative pose given the current RGB-D observation and a synthetic image conditioned on the previous best estimate and the object's model. The key contribution in this context is a novel neural network architecture, which appropriately disentangles the feature encoding to help reduce domain shift, and an effective 3D orientation representation via Lie Algebra. Consequently, even when the network is trained only with synthetic data can work effectively over real images. Comprehensive experiments over benchmarks - existing ones as well as a new dataset with significant occlusions related to object manipulation - show that the proposed approach achieves consistently robust estimates and outperforms alternatives, even though they have been trained with real images. The approach is also the most computationally efficient among the alternatives and achieves a tracking frequency of 90.9Hz. |
2019 |
Mitash, C; Wen, B; Bekris, K; Boularias, A Scene-Level Pose Estimation for Multiple Instances of Densely Packed Objects Conference Conference on Robot Learning (CoRL), Osaka, Japan, 2019. Abstract | Links | BibTeX | Tags: Learning, Perception @conference{Mitash:2019aa, title = {Scene-Level Pose Estimation for Multiple Instances of Densely Packed Objects}, author = {C Mitash and B Wen and K Bekris and A Boularias}, url = {https://arxiv.org/pdf/1910.04953.pdf}, year = {2019}, date = {2019-10-01}, booktitle = {Conference on Robot Learning (CoRL)}, address = {Osaka, Japan}, abstract = {This paper introduces key machine learning operations that allow the realization of robust, joint 6D pose estimation of multiple instances of objects either densely packed or in unstructured piles from RGB-D data. The first objective is to learn semantic and instance-boundary detectors without manual labeling. An adversarial training framework in conjunction with physics-based simulation is used to achieve detectors that behave similarly in synthetic and real data. Given the stochastic output of such detectors, candidates for object poses are sampled. The second objective is to automatically learn a single score for each pose candidate that represents its quality in terms of explaining the entire scene via a gradient boosted tree. The proposed method uses features derived from surface and boundary alignment between the observed scene and the object model placed at hypothesized poses. Scene-level, multi-instance pose estimation is then achieved by an integer linear programming process that selects hypotheses that maximize the sum of the learned individual scores, while respecting constraints, such as avoiding collisions. To evaluate this method, a dataset of densely packed objects with challenging setups for state-of-the-art approaches is collected. Experiments on this dataset and a public one show that the method significantly outperforms alternatives in terms of 6D pose accuracy while trained only with synthetic datasets.}, keywords = {Learning, Perception}, pubstate = {published}, tppubtype = {conference} } This paper introduces key machine learning operations that allow the realization of robust, joint 6D pose estimation of multiple instances of objects either densely packed or in unstructured piles from RGB-D data. The first objective is to learn semantic and instance-boundary detectors without manual labeling. An adversarial training framework in conjunction with physics-based simulation is used to achieve detectors that behave similarly in synthetic and real data. Given the stochastic output of such detectors, candidates for object poses are sampled. The second objective is to automatically learn a single score for each pose candidate that represents its quality in terms of explaining the entire scene via a gradient boosted tree. The proposed method uses features derived from surface and boundary alignment between the observed scene and the object model placed at hypothesized poses. Scene-level, multi-instance pose estimation is then achieved by an integer linear programming process that selects hypotheses that maximize the sum of the learned individual scores, while respecting constraints, such as avoiding collisions. To evaluate this method, a dataset of densely packed objects with challenging setups for state-of-the-art approaches is collected. Experiments on this dataset and a public one show that the method significantly outperforms alternatives in terms of 6D pose accuracy while trained only with synthetic datasets. |