Blog - M47 Labs

Medical coding is a critical process in healthcare, as it directly impacts areas such as billing, resource management, auditing, and epidemiological studies. There are various coding systems, such as the Current Procedural Terminology (CPT) or the Healthcare Common Procedure Coding System (HCPCS), used to code medical procedures and services. However, when it comes to coding diagnoses and diseases, one of the most widely used systems in the United States is the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM). The ICD-10-CM is an adaptation of the ICD-10 developed by the World Health Organization, clinically modified for use in the United States. It provides detailed codes for diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injuries or diseases.

Traditionally, this work has been performed by human coders, making it a costly, slow, and error-prone task. However, automating this process has been a research goal for decades, and recent advancements in artificial intelligence (AI) have opened up new possibilities for improving the accuracy and efficiency of medical coding.

According to some studies, large language models (LLMs), despite their impressive capabilities in other areas, have proven to be inefficient at accurately extracting ICD-10-CM codes when evaluated with basic prompts and without the support of advanced techniques or additional approaches such as Retrieval-Augmented Generation (RAG). As multipurpose models without a deep understanding of specialized medical terminology, their effectiveness in medical coding may be limited. Additionally, the lack of validated and specific datasets for rigorous testing also contributes to these shortcomings. The application of advanced techniques, specialized approaches, and a larger volume of high-quality data could significantly improve these results.

How have we addressed the problem?

To develop an effective coding tool, it is crucial to leverage the power of current LLMs as an integral part of the solution, but not as the sole tool. Advanced prompting techniques and information retrieval, as well as the effective orchestration of specialized services and the application of traditional machine learning methods, among others, should be employed. Below, we will explain some of the building blocks or techniques:

‍

Pipeline for obtaining ICD-10-CM codes from medical text, proposed by M47 Labs.

‍

‍Medical Entity Extraction

‍
At this stage, a thorough analysis of the medical text is required, as this step is crucial for accurately identifying and isolating relevant medical terms, underlying conditions, and diagnoses present in the input text. This process:

Improves accuracy: By focusing on specific medical entities, the system can reduce noise and irrelevant information.
Enhances code matching: The extracted entities can be more effectively mapped directly to ICD-10 codes.
Supports standardization: It helps normalize medical terminology for consistent coding.
Facilitates hierarchical coding: It allows the system to accurately capture both primary and secondary diagnoses.

Information obtained after applying the entity extraction module to a diagnosis.

‍

Challenges: Capturing entities that consist of extensive explanations, cases where the specialist who wrote the text omits information because they consider it implicit, or where the diagnosis is simply not entirely clear, accurately capturing the relationships between recognized medical entities, underlying conditions, negated terms, etc.

Tools:

Third party tools
LLMs
Custom NER (Named-entity recognition)

‍

ICD-10-CM Code Retrieval - Entity Matching

This step ensures that the system has access to the most up-to-date and comprehensive ICD-10-CM code database. Additionally, it reduces the size of the data subset under consideration by adjusting relevant hyperparameters, such as top K, based on experimental results. This guarantees improvements in the following areas:

Accuracy: Matches the most current codes, reducing errors caused by outdated information.
Completeness: Ensures all possible codes are considered, including rare or newly added ones, reducing bias in the LLMs.
Cross-referencing: Enables the system to consider related codes and exclusions.

Challenges: There are numerous synonyms for each medical term, and capturing the meaning solely through embeddings requires additional effort. Semanticity alone is not sufficient. Moreover, although synonym databases are accessible, they have some limitations in terms of agreement and size.

Tools:

Synonyms from UMLS
WHO Synonyms API
Hybrid Retrieval Systems (e.g., BM25 + embeddings, ColBERTv2)

‍

Re-Ranking

Once the subset of medical codes extracted in the previous step has been selected and a fine-tuned Top K applied through experimentation, it is necessary to reorganize and filter the obtained data to provide the LLM that will perform the inference with the maximum amount of relevant information, while minimizing noise and reducing the number of tokens used in the context window.

‍

‍LLM Zero-Shot

Leverage the power of LLMs to interpret medical text without specific training in ICD-10-CM coding. This enables:

Flexibility: It can handle a wide range of medical descriptions without prior examples.
Contextual understanding: LLMs can interpret complex medical narratives and nuances that might not be captured in individual medical entities.
Rapid deployment: It does not require extensive fine-tuning on ICD-10-CM-specific datasets or complex pipelines.

Challenges: Since the model only has access to the information it was trained on, this can lead to hallucinations in its responses. The prompt might not cover corner cases or scenarios that weren't considered. Additionally, it is crucial to avoid using LLMs in this way if up-to-date information is required

LLM Few-Shot:

Including relevant and dynamic information in the prompt of a large language model (LLM) has been shown to improve performance, in this case by providing it with examples and information from official ICD-10-CM coding guidelines related to the medical entity being coded.

Improved accuracy: The examples and information provided help the LLM gain better context regarding the specific requirements of ICD-10-CM coding.
Adherence to guidelines: It ensures that the model follows the official coding rules and conventions.
Handling edge cases: Learning from a few examples can assist with complex or ambiguous coding scenarios, or those the model was not initially trained on.
Adaptability: Dynamically including information in a prompt is considered "in-context learning" (ICL), allowing it to be quickly updated with new examples as guidelines change.

Challenges: The choice of information and examples used is particularly important for In-Context Learning (ICL). Additional retrieval techniques may need to be considered if bias reduction is desired.

‍

Tools:

ICD-10-CM coding guidelines
A medical text with its corresponding codes reviewed by experts in the field.

‍

Reasoning and Classification of Codes

This final step involves making logical decisions to select the most appropriate ICD-10-CM code(s) based on the extracted information and the outputs from the LLM, while also avoiding the "black box" effect that AI-based systems often present.

Improving accuracy: Combine multiple inputs to make a more informed decision.
Application of rules: Implement coding rules and guidelines that may not be fully captured by the LLM alone.
Consistency: Ensure the uniform application of coding principles across different cases.
Explainability: Provide reasoning for the code selection, which is crucial for audits and quality control.

Challenges: The LLM may hallucinate and provide short or incorrect responses, or omit codes it deems unimportant due to biases in its training. This may require prompt engineering, derived through the analysis of results from a large number of experiment iterations.

‍

Conclusion

‍

Automation in medical coding is becoming increasingly important, and although AI-based tools like LLMs offer tremendous potential to enhance the efficiency of the process, they also present significant limitations in terms of accuracy and reliability. While LLMs have the potential to accelerate and optimize coding, their use alone is not sufficient to meet the high standards of accuracy required in the medical field.

‍

To overcome these challenges, it is essential to combine LLMs with more specialized approaches, such as advanced prompting techniques, medical entity extraction, and up-to-date information retrieval. Employing strategies like Few-Shot Prompting, Retrieval, Re-Ranking, and the application of traditional coding rules helps tailor these models to the specific demands of the ICD-10-CM system, enabling greater accuracy and consistency in the results. Although challenges remain, such as the possibility of hallucinations and insufficient generation of correct codes, the hybrid approach that integrates AI with traditional methods holds promise for significantly improving both the efficiency and accuracy of medical code assignment.

‍

References

Boyle, J. S., Kascenas, A., Lok, P., Liakata, M., & O’Neil, A. Q. (2023). Automated clinical coding using off-the-shelf large language models. Canon Medical Research Europe, Queen Mary University of London, University of Edinburgh, Anglia Ruskin University, The Alan Turing Institute, University of Warwick. Retrieved from arXiv.

Simmons, A., Takkavatakarn, K., McDougal, M., Dilcher, B., Pincavitch, J., Meadows, L., Kauffman, J., Klang, E., Wig, R., Smith, G., Soroush, G. N., Freeman, R., Apakama, D. J., Charney, A. W., Kohli-Seth, R., & Sakhuja, A. (2024). Benchmarking Large Language Models for Extraction of International Classification of Diseases Codes from Clinical Documentation. MedRxiv. Retrieved from MedRxiv.

Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., Luo, R., McKinney, S. M., Ness, R. O., Poon, H., Qin, T., Usuyama, N., White, C., & Horvitz, E. (2023). Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. Microsoft. Retrieved from arXiv.

Li, R., Wang, X., & Yu, H. (2023). Exploring LLM Multi-Agents for ICD Coding. UMass Amherst, Microsoft, VA Bedford Healthcare System. Retrieved from arXiv.

Centers for Medicare & Medicaid Services (CMS), & National Center for Health Statistics (NCHS). (2024). ICD-10-CM Official Guidelines for Coding and Reporting FY 2024 (April 1, 2024 - September 30, 2024). U.S. Department of Health and Human Services. Retrieved from CMS.

Soroush, A., Glicksberg, B. S., Zimlichman, E., Barash, Y., Freeman, R., Charney, A. W., Nadkarni, G. N., & Klang, E. (2024). Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying. NEJM AI, 1(5). Massachusetts Medical Society. Retrieved from NEJM AI.

Pinecone. (n.d.). Rerankers and Two-Stage Retrieval. Pinecone. Retrieved from Pinecone.

‍

How to Improve LLM-Based Applications for Better Medical Coding Solutions

How have we addressed the problem?

‍Medical Entity Extraction

ICD-10-CM Code Retrieval - Entity Matching

Re-Ranking

‍LLM Zero-Shot

LLM Few-Shot:

Reasoning and Classification of Codes

Conclusion

References