ARGO, Automatic Record Generator for Oncology: a natural language process-based tool to capture pathology features from onco-hematological reports

Gian Maria Zaccaria,; Colella, Vito; Colucci, Simona; Clemente, Felice; Pavone, Fabio; Maria Carmela Vegliante,; Esposito, Flavia; Opinto, Giuseppina; Negri, Antonio; Loseto, Giacomo; Minoia, Carla; Rossini, Bernardo; Angiulli, Vito; Angela Maria Quinto,; Schirosi, Laura; Scattone, Anna; Zito, Alfredo; Luigi Alfredo Grieco,; Guarini, Attilio; Ciavarella, Sabino

doi:10.2196/preprints.27295

N.D The unstructured nature of medical data from Real-World (RW) patients and the scarce accessibility for researchers to integrated systems restrain the use of RW information for clinical and translational research purposes. Natural Language Processing (NLP) might help in transposing unstructured reports in electronic health records (EHR), thus prompting their standardization and sharing. Objective: We aimed at designing a tool to capture pathological features directly from hemo-lymphopathology reports and automatically record them into electronic case report forms (eCRFs). Methods: We exploited Optical Character Recognition and NLP techniques to develop a web application, named ARGO (Automatic Record Generator for Oncology), that recognizes unstructured information from diagnostic paper-based reports of diffuse large B-cell lymphomas (DLBCL), follicular lymphomas (FL), and mantle cell lymphomas (MCL). ARGO was programmed to match data with standard diagnostic criteria of the National Institute of Health, automatically assign diagnosis and, via Application Programming Interface, populate specific eCRFs on the REDCap platform, according to the College of American Pathologists templates. A selection of 239 reports (n. 106 DLBCL, n.79 FL, and n. 54 MCL) from the Pathology Unit at the IRCCS - Istituto Tumori “Giovanni Paolo II” of Bari (Italy) was used to assess ARGO performance in terms of accuracy, precision, recall and F1-score. Results: By applying our workflow, we successfully converted 233 paper-based reports into corresponding eCRFs incorporating structured information about diagnosis, tissue of origin and anatomical site of the sample, major molecular markers and cell-of-origin subtype. Overall, ARGO showed high performance (nearly 90% of accuracy, precision, recall and F1-score) in capturing identification report number, biopsy date, specimen type, diagnosis, and additional molecular features. Conclusions: We developed and validated an easy-to-use tool that converts RW paper-based diagnostic reports of major lymphoma subtypes into structured eCRFs. ARGO is cheap, feasible, and easily transferable into the daily practice to generate REDCap-based EHR for clinical and translational research purposes.