In March 2025, participants of one of the most influential pathology conferences in the world, the 2025 USCAP Annual Meeting, organized by the United States and Canadian Academy of Pathology (USCAP), will be introduced to a data structuring solution developed jointly by the department of pathological anatomy at A.C. Camargo Center, the organization's data team, and students of Computer Engineering.
They are Felipe Banzato Pinto de Lemos, born in Campinas (SP), João Victor Pazotti Silva, born in São Paulo and has lived in Rio de Janeiro (RJ) and Uruguay, and Ykaro de Sousa Andrade, born in Ubajara (CE), completed school education in Sobral (CE) and Fortaleza (CE), and on the second attempt, passed the Insper entrance exam and obtained a full scholarship.
Throughout the first semester of 2024, the three dedicated themselves to developing an algorithm capable of tackling a challenge faced daily by the health institution: improving data production and analysis capacity, leading to more precise diagnoses and more efficient process management. The mission was embraced in the form of a final project, or CAPSTONE, formerly Final Engineering Project (PFE).
In the jargon used by data management and analysis professionals, the documentation generated in medical centers is often unstructured, fragmented, dissociated, and not easily accessible. Each hospital, and often even each physician, fills out records and reports in different ways. They do not follow a unique standard, which hinders the ability to structure the different pieces of information efficiently.
Founded in 1953, the A.C. Camargo Cancer Center is responsible for training doctors and health professionals in various areas related to oncology. The stricto sensu graduate program, created in 1997, has already graduated over 450 masters and 250 doctors.
It is, therefore, a reference center for diagnosis, treatment, teaching, and research in cancer in Latin America, which naturally understands the importance of promoting efficiency and the ability to generate relevant insights. It sought a partnership with Insper to develop a natural language processing algorithm for health data.
"The ability to manipulate and analyze pathological anatomy data presents excellent opportunities for the advancement of personalized oncology. However, access to this medical data can be challenging, mainly due to its unstructured nature," says Adriana Passos Bueno, a member of the pathological anatomy department at A.C. Camargo. "Our approach addresses this problem by providing a promising method for managing pathological anatomy report databases, ensuring automated, safe, and effective access to structured data with high-quality pathological anatomy reports."
One of the solutions developed internally is PatoDig, designed to handle pathological anatomy reports. To evaluate the effective performance of this tool, the institution submitted the demand to Insper.
What brought the three students together was their interest in working with data. The students, by the way, already work but not in health: João is a software developer at IBM, Ykaro is an analytics engineer at Itaú, and Felipe has been an entrepreneur for a year and a half, currently leading a startup dedicated to supporting companies in managing financial documents using artificial intelligence, Balancete AI.
The guidance was led by Professor Maciel Calebe Vidal. "The CAPSTONE is a very intense process. In a company, you have more time to onboarding on the project, to learn how to use a particular tool," says the professor. "In this case, the students had two months to establish the focus and start documenting and writing. Often, in projects involving data, it is challenging to visualize how much time will be needed. And the three were efficient and quick in defining the scope."
The work was developed in partnership with the department of pathological anatomy, the medical specialty responsible for providing diagnosis, staging, and molecular profiling of neoplasms through tissue sampling. The department has 19 pathologists, who contribute to establishing diagnoses and molecular profiles of approximately 60,000 exams annually.
It was necessary to define how to tackle the challenge of generating an algorithm capable of analyzing medical data from patients in the area. There are many different exams, and a decision had to be made. At this point, the partnership with the mentor within the A.C. Camargo Cancer Center was very productive, points out Ykaro. "Dr. Adriana was very willing and engaged. She welcomed us in person and introduced us to the blood sample receipt flow. That's when we began to understand the scope."
The group proposed focusing on PD-L1 exams, a molecular test performed in pathological anatomy laboratories with a smaller volume of reports than other types of tests. It would, therefore, be a good model to test the project's performance. Initially, the team used SpaCy, a Python natural language processing library, which performed satisfactorily in identifying and extracting specific information for PD-L1 exams.
"However, for cases where the information was irregular or absent, that is, not explicit in the report, SpaCy did not prove scalable," the students point out in the final project report.
"Thus, to meet this need and ensure a more comprehensive and flexible solution, the project migrated to using Large Language Models (LLMs), which are more capable of interpreting and processing a wide variety of textual contexts and styles, fundamental for the progress and expansion of the project's scope. The model brought substantial improvement."
It is a viable solution, also for options that generate a larger volume of reports, such as immunohistochemistry exams. The next step was to use data visualization tools like PowerBI so that medical teams can access the data in an intuitive and agile manner.
"The solution we presented can continue to develop, focusing on other types of exams," says Felipe. "It was a learning process. We had to move away from the classroom mindset, where the next activity ends up bringing useful information for a problem being debated. CAPSTONE put us in the position of solving challenges with greater independence, as happens in the market," reinforces João. "Working in a group represented a great learning opportunity, as well as interacting with the professor and the mentor," points out Ykaro.
"We achieved our goal and went beyond, testing both SpaCy and a LLM model to identify and extract specific information from PD-L1 exams. Starting from the original report, we validated both approaches and compared the results of SpaCy with those of the LLM for each collected entity," says Adriana.
"This detailed examination allowed us to understand how each model operated and behaved. Moreover, we explored ways to improve performance and reduce errors. More importantly, we identified which model would be more suitable for different situations. This thorough mapping of the various model behaviors was crucial for developing future models cost-effectively," she says.
"We are truly satisfied with the students' performance; it was excellent," she says. "Their commitment to the scheduled weekly deliveries really stood out, and they showed a genuine interest in the topic, which went beyond expectations."