Back to Blog

The Other F Word: How Robot Developers Can Do Better By Embracing Failure

Florian Pestoni

Modern robots are awesome, but they’re also infamous for getting into compromising situations. Sometimes, it’s hilarious and other times, catastrophic. Robots have been seen steering into ponds, getting stuck next to trash cans, rolling into retail store fitting rooms or suddenly catching fire. Some people may think robots just don’t work, but the reality is more nuanced: Robots work great most of the time — until they don’t.

Traditional robots are designed to complete a single primary task continuously, such as welding a car at an automotive factory. These robots work best in a controlled environment carved out just for them, like in caged working areas along an assembly line. Under these conditions, they’re highly reliable.

Their success has driven automation in manufacturing settings but has historically limited the adoption of robots in other industries. That’s changing now, as autonomous robots attempt to tackle more complex tasks in unstructured environments. No matter how narrow or simple the task may be, anyone working in the robotics space today knows the stories of autonomous robot failure. But this isn’t as bad as it sounds.

When Autonomous Mobile Robots Aren’t So Autonomous

The real world is messy and chaotic, with humans moving unpredictably and environments constantly changing. A new class of autonomous mobile robots (AMRs) are designed specifically to respond to their environment using sensors and artificial intelligence, but even the best robots fall out of autonomy regularly. Something as simple as the glare coming off of a reflected surface confusing a sensor or direct human interference like bumping into a robot can cause even the most robust AMR to become mislocalized, abort a given mission or stop until human help arrives. This unfortunate situation in which a robot finds itself just outside of its operating parameters and fails is called an “autonomy exception.”

The emerging robot marketplace is asking for a lot in terms of productivity, flexibility and efficiency. Robot makers are consequently pushing robots’ capabilities to their limits — and beyond. As manufacturers tackle more of the basic operational functionality, the effort required to approach “perfect autonomy” grows exponentially, resulting in an autonomy gap.


This gap has led companies that are deploying automation solutions to face operational challenges. As automation moves from the pilot phase to scaled deployment, the impact of autonomy exceptions grows exponentially. The problems a company must solve when dealing with five robots are wildly different from the challenges they face when dealing with 500 or 5,000.

Bridging The Autonomy Gap

Future generations may look back and see these robot failures as a historical record marking the progress of robotics, similar to how we may watch black-and-white film coverage of people’s attempts to build flying machines. A foundation of effective robot operations (or RobOps) can ensure that progress continues. Scaling operations starts with accepting that autonomy exceptions are inevitable and robots will fail.

This is a good type of failure. The nature of iterative progress requires failure for growth. In the case of robotics, understanding the lifecycle of failure allows companies to define concrete operational steps to resolve exceptions, and ultimately reduce these incidents over time.

This challenge is compounded when considering the orchestration of multiple robots, coming from different vendors and performing different tasks. This creates complex interoperability challenges and the need to integrate with various line-of-business software.


Lifecycle of Failure

Robust RobOps start with observability. More than just health monitoring, observability means data, logs and metrics that are necessary to understand that problems are occurring, as well as their source. Robot fleets should be considered true distributed systems, with each robot separate but also part of the whole, just like a data center consists of thousands of host servers.

Mitigating problems at scale requires a solid configuration management (CM) solution. Without the ability to deploy auditable changes and updates across a fleet, including support for rollbacks and eventual consistency, robot developers resort to heavy quality assurance and software distribution processes that limit the ability to release updates frequently in response to issues that may manifest when code leaves the lab.

Failure is worthless if it doesn’t lead to learning and growth. What’s better than detecting and solving problems efficiently? Avoiding them in the first place. Prevention goes hand in hand with safety, security and auditing. As autonomous robotics is still a nascent field, ensuring safe operations in a broad range of situations is critical for building trust. The ultimate goal is for robots to augment what humans can do on their own, resulting in the harmonious integration of robots into the life and work of people.

RobOps: A New Hope

Just as the DevOps community came together, emerging out of a discussion between Andrew Clay and Patrick Debois in 2008 and then slowly spreading after the DevOpsDays event held in Belgium in 2009, the RobOps community is starting to coalesce. Experts in running robots at scale meet monthly for virtual discussions as part of the non-profit Robot Operations Group (ROG) to define best practices, such as the recently published open-source Manifesto that informed the framework of this article.

Other organizations like MassRobotics, a group of companies passionate about robotics, and an association of car manufacturers in Europe have published some early interoperability standards. Finally, a set of startups is creating a new breed of RobOps tools designed to tackle autonomy exceptions and simplify fleet management.

Together, standards, tools and best practices will bring the lessons from DevOps and the scalability of the cloud to the world of autonomous robots. The approach to the lifecycle of failure described here builds upon itself to create an agile, self-healing system that can adapt and evolve as robot fleets grow. With new applications for smart robots being created daily across every industry, and increasing interest in automation from executives at companies of every size, this is a great time to participate in the next wave of the digitization of labor.

* This article was originally published on