The ability to use reinforcement learning technique to train drones to search for shipwrecked people was evidenced by students of the Computer Engineering program at Insper. Students Jorás Custodio Campos de Oliveira, Pedro Andrade, Renato Laffranchi Falcão and Ricardo Ribeiro Rodrigues improved the simulation environment developed by other graduating students in last year’s Final Engineering Project (PFE), also supervised by professor Fabrício Barth and mentored by Embraer Product Development Engineer José Fernando Basso Brancalion.
Embraer’s challenge remained the same: checking the efficiency of algorithms trained by reinforcement learning, a method in machine learning which offers rewards for each successful or unsuccessful attempt made by the swarm of drones — a group of small, unmanned aircraft equipped with sensors and actuators. The comparison was with traditional algorithms used in manned aircraft, which establish a pre-configured behavior for drone movements, following zigzag, circular or parallel patterns. The difference is that, with reinforcement learning, the vehicles have the autonomy to determine the best behavior to follow in the search for castaways, without human intervention.
What is the idea behind this technique? “Each drone is an agent that interacts with the environment and knows how to perform a set of actions, such as moving to a certain geographic coordinate, going up or down or searching for the castaway in a certain area,” explains Barth. “These actions change the environment, because the drone moves from one position to a different position during the operation. Through reinforcement learning, for each new action, the drone also receives positive, negative, or neutral reinforcement, depending on the success of its movement.”
In the first PFE, an environment was created to simulate this scenario of searching for castaways at sea and a reinforcement learning algorithm was developed — then compared to the traditional approach. This time, students enhanced the environment by improving the simulation. “The first team’s simulation took into account equations predefined by the user,” says Barth. “Now, there is another software that inputs real data from sea currents into the simulator. Drones are learning based on data that is closer to real data.”
The group from the previous semester presented an article at a defense and security event held at the Instituto Tecnológico da Aeronáutica (ITA) (Aeronautical Technology Institute), where they received recommendations and tips from professionals dedicated to search and rescue operations. This information was built into the simulator, which is available in the form of a library (accessible in this link) of Python programming language. This library was also accepted as a third-party environment by The Farama Foundation, an organization that inherited the projects of reinforcement learning from OpenAI.
According to Barth, the second version of the simulation environment runs faster and complies with Farama protocols. Another novelty is that the second group chose to carry out a second analysis, in addition to the initial analysis of direct search for castaways. This second analysis consists of covering a certain area. “Instead of directing the drones to go straight to a certain point and look for the shipwrecked person – an order that often cannot be followed – the new optimization function is as follows: given an area with a high probability of one or more castaways being found, I want drones to self-organize and be able to go through all the high probability cells as quickly as possible and look in all these cells to see if the castaway is there”, says the professor.
The second delivery improved the reinforcement learning algorithms because the Farama standard allowed students to use the RLib library, which implements algorithms that are the state of the art of this machine learning technique. “Among several hypotheses, the group responded that reinforcement learning obtains better results than traditional algorithms, both for the direct search problem and for the area coverage problem,” says Barth. All material is consolidated in this link.
The professor recognizes the excitement and commitment of the four students, who were awarded the Falconi-Insper certification for the project, issued by Falconi Consultores de Resultado. “Often, creating a solution from scratch is easier than improving a pre-existing software,” says Barth. “This semester, the team managed to maintain the preexisting software and greatly improved the work of the previous team.”
The degree of difficulty increased because reinforcement learning is a research field under development, especially when focused on multi-agent learning. Student Ricardo Rodrigues says that the sources of information are recently published academic articles — the group’s main bibliographic reference is from 2024, for example — or papers still in the process of being revised. “When there was a lack of information, we had to search the Internet and we could only find it in articles, almost nothing was available in easier-to-follow tutorials, and this took a lot of time,” says Rodrigues.
According to Jorás Oliveira, the four of them were still taking the elective course Autonomous Agents and Reinforcement Learning. Therefore, everything about this type of machine learning was new to them, despite their knowledge of artificial intelligence. Each challenge required more reading and guidance from professors. “Some articles said it was better to create a neural network to control all drones, but others suggested one network per drone,” recalls Oliveira. “Each had their own reasons and conclusions. There is no easy source of truth. Thus, we ended up forcing ourselves to participate in this scientific discussion and carry out our own tests to understand what was the best solution for our case.”
The work also required reading Oceanography texts related to search and rescue operations at sea and a change in programming tools. In class, in most previous experiences with machine learning, the group used the open source TensorFlow library, developed by Google. For reinforcement learning, PyTorch, created by Meta AI, is normally used. They highlight the effort in studying, becoming proficient and applying the new tool. “Once you’ve learned the theory, it’s more a matter of understanding how to use another tool to achieve the same result,” says Oliveira.
The distribution of tasks followed the application of the agile methodology, a series of values and project management principles from the technology industry. This methodology encourages constant collaboration between work teams and product users. Rodrigues says that the week’s tasks were shared on a board, assigned to members, and reviewed at the end of a certain period. If everything was ready as planned, we moved on to the next step.
The PFE enabled the team to interact with market professionals based on a real problem. “This project opened my mind and added a lot to my portfolio of experiences,” says student Renato Falcão. “It was an opportunity to develop the process of getting your hands dirty, learning about the techniques of machine learning, coding algorithms and putting agents to practice.” Like his other colleagues, Falcão comments that he likes working with artificial intelligence and intends to pursue a career in the area.
Their involvement did not end with the PFE’s defense before the panel. “We are working hard to produce the articles suggested by Embraer,” says Oliveira. “Academically speaking, it was fantastic. I found it interesting that several professors asked us if we wanted to pursue an academic career. It seems that the project is opening doors in this direction as well. As there is an entire year to take electives and do an internship, we still have time to think.”