Routinely collected data: the importance of high-quality diagnostic coding to research ====================================================================================== * Stuart G. Nicholls * Sinéad M. Langan * Eric I. Benchimol See related article at [www.cmajopen.ca/content/5/3/E617](http://www.cmajopen.ca/content/5/3/E617) KEY POINTS * Errors in diagnostic coding may occur along the diagnostic or administrative pathways. * There is an ongoing need for validation work to inform research using health administrative data. * The quality of reporting is key to the evaluation of research using health administrative data, and authors should conform to standards of best practice and use appropriate reporting guidelines. Routinely collected health data are data collected for purposes other than research or without specific a priori research questions developed before collection.1 Examples include clinical information from electronic health records, health administrative data, disease registries and epidemiologic surveillance systems.1 Health data of this type are used widely for clinical, pharmacoepidemiologic and health services research.2 However, the quality of these data remain in question. Tang and colleagues consider this issue in a linked qualitative research article published in *CMAJ Open*.3 The authors examine barriers to coding of high-quality administrative data and highlight several factors that may lead to a lack of specificity of diagnoses or data inconsistencies between the medical chart and administrative data. In particular, they point to the role played by the quality of physician notes. Studies mapping the coding process for diagnoses have identified a range of sources of error along both the patient diagnostic trajectory, as well as errors that occur during the administrative process.4 Tang and colleagues’ research contributes to the growing awareness of the need to develop and validate approaches to accurately identifying patients according to exposure or outcome when using health administrative data for research,5,6 and raises important questions about how best to address inaccuracies. Specifically, it broaches the question of who has responsibility to improve accuracy. The short answer is that every link in the chain has a role because each step may have flaws, from the physician and their notes (the focus of the study by Tang and colleagues),3 to the hospital and its structure, to the use of International Classification of Disease (ICD) codes themselves, as well as coders and their training. Sources of error often act in concert: physicians recognize new diagnoses, yet the adoption of new diagnoses within the ICD-10 system is slow, potentially leading to confusion between the physicians’ notes and coders who were trained only to identify diseases in the coding system used by the administration. For example, eosinophilic esophagitis was recognized as a clinical entity distinct from reflux esophagitis in 1995,7 but the ICD-10 code adopted in 2015 (K20.0) is still not in use by most agencies that collect data. Furthermore, the physician billing systems of many Canadian provinces continue to use an antiquated and simplified version of ICD-9. This can lead to tremendous confusion and inaccuracies in research that relies upon these data. Health care administrators and policy-makers should be encouraged to adopt the latest diagnostic systems and to train coders and physicians appropriately in their use. In using coded routinely collected data, validation of coding accuracy is key. This is important for both the methodologic evaluation of articles (“Can the results be trusted?”)8 and for replication (“Do initial findings hold true in other contexts?”). For example, the accuracy of reports about the prevalence of a disease, or the burden of a disease on the health system, will rely on the accuracy of identification algorithms based on diagnostic codes.9 Given the potential for misclassification bias, it is essential that researchers report the results of validation work.5 Despite this need, a recent review of published studies10 showed that the reporting of research using routinely collected health data often inadequately provided details of validation work or information about coding or classification. Furthermore, those who use health administrative data for research, health care administration or quality improvement have their own obligations to use the data responsibly and to report accurately how data were used.2 For studies using routinely collected data, the REporting of studies Conducted using Observational Routinely-collected Data (RECORD) statement1 sets out standards of reporting that researchers should adhere to. This statement includes 13 items specific to studies using routinely collected data and reflects important study components such as methods of selecting the study population; details of any validation of codes or algorithms; linkage of data sources; and a list of codes that are used to classify exposures, outcomes and confounders. Adherence to these guidelines will allow the consumer of research to accurately interpret the results, a bare minimum requirement for medical research. Although the study by Tang and colleagues3 evaluates barriers to adequate coding at present, there are many unanswered questions. The authors have suggested improved training of physicians to understand the importance of accurate documentation. However, such interventions have not yet been designed or evaluated. It may be better to target this type of intervention at medical students still in training so that the importance of coding is understood by trainees early in their careers. In addition, the linked article did not address the changes that might result from widespread implementation of electronic health records in hospitals. The mix of clinical documentation and structured coding in electronic health records may result in improved interpretation by professional coders. The increasing availability of natural language processing techniques and machine learning may lead to automated interpretation of physician documentation, more accurate coding tools and automated feedback to physicians who provide contradictory diagnoses in the medical record. Although the use of artificial intelligence in health data research is expected to blossom in the future, current research is still largely curated and conducted by people. Therefore, we must strive to improve the quality of coding (through training of physicians and coders, and implementation of the latest coding systems), the quality of research using these data (through validation and advanced methodologies), and the quality of reporting of the research (through the use of reporting guidelines to improve the transparency and reproducibility of research). ## Footnotes * **Competing interests:** Sinéad Langan has received a grant in the form of a Wellcome Senior Clinical Fellowship in Science. Eric Benchimol is supported by a New Investigator Award from the Canadian Institutes of Health Research, Canadian Association of Gastroenterology, and Crohn’s and Colitis Canada. He is also supported by the Career Enhancement Program of the Canadian Child Health Clinician Scientist Program. No other competing interests were declared. * **Contributors:** All of the authors contributed to the conception of the work, drafted and revised the manuscript for important intellectual content, provided final approval of the version to be published and agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. * This article was solicited and has not been peer reviewed. ## References 1. Benchimol EI, Smeeth L, Guttman A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLoS Med 2015;12:e1001885. [CrossRef](http://www.cmaj.ca/lookup/external-ref?access_num=10.1371/journal.pmed.1001885&link_type=DOI) [PubMed](http://www.cmaj.ca/lookup/external-ref?access_num=26440803&link_type=MED&atom=%2Fcmaj%2F189%2F33%2FE1054.atom) 2. 1. Mittelstadt BD, 2. Floridi L Nicholls SG, Langan SM, Benchimol EI. Reporting and transparency in big data: the nexus of ethics and methodology. In: Mittelstadt BD, Floridi L, editors. The ethics of biomedical big data. Switzerland: Springer; 2016:339–65. 3. Tang K, Lucyk K, Quan H. Coder perspectives on physician-related barriers to coding high quality administrative data: a qualitative study. CMAJ Open 2017; 5:E617–22. 4. O’Malley KJ, Cook KF, Price MD, et al. Measuring diagnoses: ICD code accuracy. Health Serv Res 2005;40:1620–39. [CrossRef](http://www.cmaj.ca/lookup/external-ref?access_num=10.1111/j.1475-6773.2005.00444.x&link_type=DOI) [PubMed](http://www.cmaj.ca/lookup/external-ref?access_num=16178999&link_type=MED&atom=%2Fcmaj%2F189%2F33%2FE1054.atom) [Web of Science](http://www.cmaj.ca/lookup/external-ref?access_num=000231908400005&link_type=ISI) 5. Benchimol EI, Manuel DG, To T, et al. Development and use of reporting guidelines for assessing the quality of validation studies of health administrative data. J Clin Epidemiol 2011;64:821–9. [CrossRef](http://www.cmaj.ca/lookup/external-ref?access_num=10.1016/j.jclinepi.2010.10.006&link_type=DOI) [PubMed](http://www.cmaj.ca/lookup/external-ref?access_num=21194889&link_type=MED&atom=%2Fcmaj%2F189%2F33%2FE1054.atom) 6. De Coster C, Quan H, Finlayson A, et al. Identifying priorities in methodological research using ICD-9-CM and ICD-10 administrative data: report from an international consortium. BMC Health Serv Res 2006;6:77. [CrossRef](http://www.cmaj.ca/lookup/external-ref?access_num=10.1186/1472-6963-6-77&link_type=DOI) [PubMed](http://www.cmaj.ca/lookup/external-ref?access_num=16776836&link_type=MED&atom=%2Fcmaj%2F189%2F33%2FE1054.atom) 7. Kelly KJ, Lazenby AJ, Rowe PC, et al. Eosinophilic esophagitis attributed to gastroesophageal reflux: improvement with an amino acid–based formula. Gastroenterology 1995;109:1503–12. [CrossRef](http://www.cmaj.ca/lookup/external-ref?access_num=10.1016/0016-5085(95)90637-1&link_type=DOI) [PubMed](http://www.cmaj.ca/lookup/external-ref?access_num=7557132&link_type=MED&atom=%2Fcmaj%2F189%2F33%2FE1054.atom) [Web of Science](http://www.cmaj.ca/lookup/external-ref?access_num=A1995TB81600013&link_type=ISI) 8. Moher D, Simera I, Schulz KF, et al. Helping editors, peer reviewers and authors improve the clarity, completeness and transparency of reporting health research. BMC Med 2008;6:13. [CrossRef](http://www.cmaj.ca/lookup/external-ref?access_num=10.1186/1741-7015-6-13&link_type=DOI) [PubMed](http://www.cmaj.ca/lookup/external-ref?access_num=18558004&link_type=MED&atom=%2Fcmaj%2F189%2F33%2FE1054.atom) 9. Manuel DG, Rosella LC, Stukel TA. Importance of accurately identifying disease in studies using electronic health records. BMJ 2010;341:c4226. [FREE Full Text](http://www.cmaj.ca/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNDEvYXVnMTlfMS9jNDIyNiI7czo0OiJhdG9tIjtzOjIzOiIvY21hai8xODkvMzMvRTEwNTQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 10. Hemkens LG, Benchimol EI, Langan SM, et al. The reporting of studies using routinely collected health data was often insufficient. J Clin Epidemiol 2016; 79:104–11.