Sign In

Communications of the ACM

Contributed articles

Identifying Patterns in Medical Records through Latent Semantic Analysis

medical records

Credit: Getty Images

Mountains of data are constantly being accumulated, including in the form of medical records of doctor visits and treatments. The question is what actionable information can be gleaned from it beyond a one-time record of a specific medical examination. Arguably, if one were to combine the data in a large corpus of many patients suffering from the same condition, then overall patterns that apply beyond a specific instance of a specific doctor visit might be observed. Such patterns might reveal how medical conditions are related to one another over a broad set of patients, as well as how these conditions might be related to the International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM) of the Centers for Disease Control and Prevention (CDC) Classification of Diseases, Functioning, and Disability codes (henceforth, ICD codesa). Conceivably, applying such a method to a large dataset could even suggest new avenues of medical and public health research by identifying new associations, along with the relative strength of the associations compared to other associations. It might even be applied to identify possible side effects in phase IV drug testing. Moreover, that potential might be even more potent if it could also identify indirect, or through other terms, connections among terms.

Back to Top

Key Insights


To address such medical-analysis objectives, this article explores in a preliminary manner the applicability of a method based on latent semantic analysis (LSA) that transforms the term-document [frequency] matrix (TDM) through a singular value decomposition (SVD) into a semantic space. Once such a semantic space is created, the lexical distance among terms and among combinations of terms can be calculated by projecting them onto that space. More important, lexical distance can be calculated even when terms are associated only indirectly; that is, when the terms do not appear together in any one shared document but are related to one another through another set of terms;10,12 see Gefen et al.4 and Holzinger et al.8 for discussions of other text-analysis methods.

Figure. Word cloud of data.

The study demonstrates how LSA can be applied to sanitized medical records dealing with congestive heart failure to identify patterns of association among terms of interest, as well as among those terms and the ICD codes that appear for the same patient. By "sanitized," researchers mean documents after removing protected health information (PHI) content. The data was provided by Independence Blue Cross (IBX), a health insurer, and deals with congestive heart failure.

Some associations revealed through LSA in this study were expected (such as the one between hypertension and obesity). Associations might be obvious, but identifying them is essential because it shows the credibility of the method. Other associations were less expected by medical experts (such as the infrequent association LSA identified between "hypertension" and "sciatica"). That association might indicate a one-person issue, highlighting the potential for identifying associations among medical terms through LSA that might reveal cases that require special attention or unexpected, possibly previously unknown, relationships. As we explain in the next section, which describes LSA, associations identified by LSA might also include terms that do not appear together in any document but rather are associated with one another through their joint association to another term.

Past research that applied LSA to medical science showed LSA can identify shared ontologies across scientific papers, even if terms have different names,15 and the degree that concepts are shared across papers in the Proceedings of the National Academy of the Sciences can reveal expected patterns.13 The study adds a new angle to the accumulating literature on LSA in medical contexts by showing its potential contribution to medical science by associating medical terms and ICD codes as applied in practice in medical reports, especially by adding an ordinal scale of how close the terms are to one another compared to other terms. For example, the cosines we useb suggest that in this population hypertension is closer to being benign than chronic and even less related to hypothyroidism. The results also suggest the method could be applied to assist in the management of medical treatment by identifying unusual cases for special attention.

Back to Top

Introduction to LSA

LSA creates a semantic space from which it is possible to derive lexical closeness information; that is, how close terms or documents are to one another in a corpus. LSA creates that space by first creating a TDM from a relevant corpus of documents and then running an SVD on that TDM. The TDM is a frequency matrix that records how often each term appears in each document. Before the TDM is created, the text in the corpus is often stemmed and stop words are excluded. Stop words are words that occur frequently (such as "the" and "or") and thus add little or no semantic information to the documents or to how terms relate to one another. There are default lists of stop words in English and other languages in R and other software packages. Additional words of interest can be added to these lists so they, too, are excluded from the semantic space. It is also common in LSA practice to remove accents, cast the text in lower case, and remove punctuation marks.

After the TDM is created researchers often apply a process of weighting, whereby the frequency numbers are replaced with a transformation that considers the distribution of each term in the document it appears in and across the documents in the corpus. Researchers typically apply both local and global weighting. Local weighting gives more weight to terms that appear more often in a single document. Global weighting gives less weight to terms that appear in many documents. One of the most common weighting transformations is the term "frequency-inverse document frequency," or TF-IDF, transformation. Some research (such as by Beel et al.1) claims that TF-IDF is the most common text-mining transformation, giving more weight to terms that appear often in a given document but less weight if the terms appear frequently in the corpus as a whole. It is also a recommended type of transformation.14 Stemming, stop-word removal, weighting transformation, and other preparatory steps are standard options in R packages that are used to create a semantic space; R is a free and popular statistical language.

LSA reveals not only that terms are related but also the degree of that relationship compared to other relationships as a set of ordinal cosine distance measures.

Once the semantic space is created, researchers can project terms and combinations of terms onto it; likewise, documents and parts of documents can be projected on it to produce various measures of semantic closeness.2,12 That degree of semantic closeness is typically a cosine value that provides an ordinal scale of relatedness of terms and documents. Terms and documents that are close to one another in meaning also have higher degrees of closeness, as revealed by LSA.6 This allows researchers to use LSA to identify synonyms.6,9,16 In fact, LSA has been used so successfully in identifying such closeness levels that it has been shown to answer introduction-to-psychology multiple-choice exam questions almost as well as students do12 and score on the Test of English as a Foreign Language exam almost as high as nonnative speakers.11 LSA can also classify articles into core research topics.3 The semantic space created by LSA can be so realistic that LSA has even been applied to identifying how questionnaire items factor together.5

Our study has sought to show that applying standard LSA packages in R is enough to produce associations among medical terms and ICD codes in electronic health records covering medical visits, including their relative strength in comparison to other associations, as corroborated by subject-matter experts, and do so even when applying the packages with only their default settings and without transforming the data beyond the automated transformation the packages introduce. Applying the standard transformation is important because it is not practical for researchers to manually correct typos, alternative spellings, shorthand, or optical character recognition (OCR) errors. Processing such "dirty" data without manual correction is necessary in real real-world applications.

Back to Top


The data we used was provided by IBX in a joint research project with Drexel University. It consisted of 32,124 text files obtained by running an OCR on the medical transcripts of 1,009 scanned medical charts of 416 distinct patients who suffered or were suspected of suffering from congestive heart failure in 2013 and 2014. IBX removed associated patient identifiers, demographics, and cost data. The IBX Privacy office and Drexel University medical institutional review board (IRB) both approved the research protocol in advance. The medical records consisted of the text portions of the medical record in one file and the ICD medical codes in another file. An artificial patient ID key replaced the actual patient ID in each medical report and each ICD list of codes. The medical reports were combined by that patient's ID.

We analyzed the data as is. We did not correct the data for alternative equivalent spellings (such as "catheterization" and "catheterisation"). Nor did we correct the data for obvious spelling mistakes and OCR errors (such as correcting "cardioverterdefibrillator" to "cardioverter-defibrillator"). This was done deliberately so the power of LSA could be shown even when run on untreated raw data. This was important because manually correcting medical reports is both costly and prone to introducing additional error. Manually checking these words revealed them to be mostly misspellings.

Back to Top


We created the TDM after all the words were cast as lower case and punctuations and the standard set of stop words removed. Numbers were not removed from the raw data as they could have represented ICD codes. We then subjected the TDM to a TF-IDF transformation before a SVD was run on it, retaining 100 dimensions. There are no standard rules of thumb for how many dimensions to retain because dimensionality depends on context and corpora.13 Adding more SVD dimensions inevitably results in more nuance and variance, as well as more noise.

Knowing the data concerned congestive heart failure, we identified the closest neighbor terms to "cardiac" and "hypertension" after creating the semantic space. The cosine distances are listed in the link mentioned earlier, omitting terms that appeared in fewer than four patient records, as well as in strings with 10 or more numeric digits (such as phone numbers). Figure 1 and Figure 2 show the heat-map clustering for the terms "hypertension" and "cardiac," respectively, and Figure 3 outlines the LSA process. It is important to emphasize that LSA also reveals indirect associations among terms (such as when one term is related to another only through a third term), a key advantage of LSA over manual inspection.

Figure 1. Clustering of the 40 terms and ICD-9-CM codes closest to "hypertension."

Figure 2. Clustering of the 40 terms and ICD-9-CM codes closest to "cardiac."

Figure 3. Outline of the latent semantic analysis process.

A researcher might correctly associate terms that appear together but could miss those related only indirectly. That is beside the obvious advantage of LSA in that the analysis can be done semi-automatically and on very large corpora quickly and might otherwise require an unrealistic investment of time if done manually. Moreover, and crucially important, LSA reveals not only that terms are related but also the degree of that relationship compared to other relationships as a set of ordinal cosine distance measures. These distances can be used for other analyses (such as clustering terms to determine structures within the text or to compare documents). The ability to run other analyses could be particularly helpful for identifying connections that heretofore had not been identified and thus can be used as a predictive model to enable disease screening, early detection, and intervention.

Demonstrating this ability to identify relationships, the cosine distances in the link show that the term "hypertension" is most closely related to "hyperlipidemia," which, according to the American Heart Association, means "high levels of fat particles (lipids) in the blood." Considering that this is a very common condition, estimated to afflict 31 million Americans, it might be expected. The terms "benign" and "essential" are also close, as is the diagnostic code "4011," or "icd4011," which, in the ICD9 code, means "benign essential hypertension." Hypertension is also, as expected, closely related to "obesity" and "mellitus" (diabetes), hardening arteries ("atherosclerosis"), acid reflux ("gerd"), and high cholesterol ("hypercholesterolemia"). However, hypertension is also semantically close to other terms and diagnostic codes (such as "chronic airway obstruction," or "496"), kidney disease ("ckd," nephropathy, 5852, and 5854), prostate issues ("prostatic" and code "60000"), and others. The terms and codes associated with hypertension in this population are more related to cholesterol- and fat-related terms than to kidney disorders, based on cosine distances. Our analysis also identified relationships to "sciatica" and "cerebral ischemia" ("icd4359") with restricted blood flow in the brain. The heat map in Figure 1 of cosine distances between terms shows how these terms relate to one another. For example, "urged," "sympt," and "assessment," relate to patient interactions, closely associated with diagnosed ("2722") mixed hyperlipidemia at the lower left of the Figure. Similarly, several kidney-related terms and codes are clustered at the upper right. Corroboratively, the analysis tied "hypertension" and "gastroesophageal reflux disease (GERD)" with 91 unique cases, two seemingly unrelated health conditions that only as of March 2017 were found to be related.17

The cosines we provide show "cardiac" connecting most strongly to the first obtuse marginal (OM1) and left anterior descending (LAD) arteries, as well as to the saphenous vein graft ("SVG") procedure. Perhaps because it is an adjective, and not a condition, "cardiac" is associated with terms pertaining to body parts and procedures more than diagnoses. The heat map in Figure 2 shows a noticeable cluster of terms relating to catheterization and stents (such as "stent," "stenting," "xience," and "instent"). Clustering identifies measurements of cardiac performance as well, with ventricular activation time ("vat") and left ventricular end-diastolic pressure ("lvedp"). This emphasis, as identified through clustering, is further reflected when compared with the cosine distances of "hypertension" (see footnote b). The nearest terms in the heat map to "cardiac" have greater cosine values, indicating a smaller distance relative to the neighbors of "hypertension." Such close association implies that "cardiac" has a more focused meaning in our texts, whereas "hypertension" is associated with a larger range of disparate terms, in this case, frequently co-occurring conditions and diagnoses. This may also be because cardiac issues are often acute, with specific actions rendered as treatments, while hypertension is more a chronic disease, associated with many related diseases.

Back to Top


The analysis in our study dealt with a relatively small PHI-cleansed sample of medical-records data from a well-defined context. We caution against drawing medical conclusions from such a sample, but the results are indicative of the potential in applying LSA to such contexts. We created a semantic model that identified known relationships among medical terms, relating diagnoses and treatments. We scrubbed the sample of most demographic data; the dataset was too small anyway to allow cross-sectional analysis. Constructing a model from a larger, more detailed dataset could yield substantial potential for medical discovery. Comparing reports across patients could provide even more information, as by, say, enabling creation of a "typical" profile of care/treatment trajectory, as well as diagnosis and prognosis as they apply to disease, condition names, or ICD codes. Such a profile could conceivably lead to early detection and allow identifying exceptional cases in need of immediate medical attention.

A typical profile for a condition or ICD code could help create a method to at least partly support phase IV testing of new drugs involving long-term monitoring of the effects of drugs following approval by the U.S. Food & Drug Administration. LSA could improve this process not only by automating it but also by identifying a drug's possible indirect effects, or the effects associated with the drug but only through other diagnosis. So, for example, if drug A is associated with condition B and condition B is associated with condition C, then LSA will identify that A and C might be related. A human examiner might not notice it but could be aided by LSA to identify possible connections of interest for the expert to consider; see Holzinger et al.7 for more on interactive machine learning.

Analyzing medical records could also allow comparison of diagnosis and prognosis across populations (such as differentiating between men and women). Accounting for demographics could also indicate the prevalence of diagnosis and prognosis by age and by geographical area, possibly indicating hazardous environmental conditions. Moreover, given the diversity within society, running LSA on medical records could also allow quasi-experimental design studies, as in, say, comparing clinics in areas where unique treatments are allowed against those where they are not. Planning such an experiment would be difficult and IRB approval might not always be forthcoming, but if the treatment conditions occur in the population, then studying them would not be so contentious and could become routine. LSA could support such an after-the-effect examination.

Combining LSA cosine distances with TDM frequency values may also allow identification of extraordinary cases that are closely related to a condition but very rare. In the data we had, we omitted terms that appeared in fewer than four records, but the relationships between rare diseases and more common ones might suggest new avenues of research for a range of health issues. As a supplement to medical practice, a text-analytic approach might be able to suggest alternative diagnoses based on documented symptoms that might otherwise be attributed to more common conditions.

Above all, a key advantage of LSA is that it allows rank ordering of related terms. Being able to assign numbers, and hence categorize how much a condition is related to other conditions, could provide insight about identifying what symptoms, and how much more than others, they might indicate a problem (such as hypertension).

Back to Top


In a musical parody, lyricist David Lazar wrote in a song called "Dr. Freud" that Sigmund Freud's disciples said, "... by God, there's gold in them thar ills." There certainly was. Maybe there also is in gleaning medical insight from medical records documents through LSA.

Back to Top


1. Beel, J., Gipp, B., Langer, S., and Breitinger, C. Research-paper recommender systems: A literature survey. International Journal on Digital Libraries 17, 4 (Nov. 2016), 305–338.

2. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6 (1990), 391–407.

3. Evangelopoulos, N., Zhang, X., and Prybutok, V.R. Latent semantic analysis: Five methodological recommendations. European Journal of Information Systems 21, 1 (Jan. 2012), 70–86.

4. Gefen, D., Endicott, J., Fresneda, J., Miller, J., and Larsen, K.R. A guide to text analysis with latent semantic analysis in R with annotated code studying online reviews and the Stack Exchange community. Communications of the Association for Information Systems 41, 21 (Dec. 2017), 450–496.

5. Gefen, D. and Larsen, K. Controlling for lexical closeness in survey research: A demonstration on the technology acceptance model. Journal of the Association for Information Systems 18, 10 (Oct. 2017), 727–757.

6. Gomez, J.C., Boiy, E., and Moens, M.-F. Highly discriminative statistical features for email classification. Knowledge Information Systems 31, 1 (Apr. 2012), 23–53.

7. Holzinger, A., Plass, M., Holzinger, K., Crisan, G.C., Pintea, C.-M., and Palade, V. A glass-box interactive machine learning approach for solving NP-hard problems with the human-in-the-loop;

8. Holzinger, A., Schantl, J., Schroettner, M., Seifert, C., and Verspoor, K. Biomedical text mining: State-of-the-art, open problems and future challenges. Chapter in Interactive Knowledge Discovery and Data Mining Biomedical Informatics, Lecture Notes in Computer Science LNCS 8401, A. Holzinger and I. Jurisica, Eds. Springer, Berlin, Heidelberg, Germany, 2014, 271–300.

9. Islam, A., Milios, E., and Keselj, V. Text similarity using Google tri-grams. In Proceedings of the 25th Canadian Conference on Artificial Intelligence (Toronto, Canada, May 28–30). Springer, Toronto, Canada, 2012, 312–317.

10. Kintsch, W. Predication. Cognitive Science 25, 2 (2001), 173–202.

11. Landauer, T.K. and Dumais, S.T. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104, 2 (1997), 211–240.

12. Landauer, T.K., Foltz, P.W., and Laham, D. An introduction to latent semantic analysis. Discourse Processes 25, 2 and 3 (1998), 259–284.

13. Landauer, T.K., Laham, D., and Derr, M. From paragraph to graph: Latent semantic analysis for information visualization. Proceedings of the National Academy of Sciences 101, 1 (Apr. 6, 2004), 5214–5219.

14. Larsen, K.R. and Bong, C.H. A tool for addressing construct identity in literature reviews and meta-analyses. MIS Quarterly 40, 3 (Sept. 2016), 529–551.

15. Larsen, K.R., Michie, S., Hekler, E.B., Gibson, B., Spruijt-Metz, D., Ahern, D., Cole-Lewis, H., Ellis, R.J.B., Hesse, B., Moser, R.P., and Yi, J. Behavior change interventions: The potential of ontologies for advancing science and practice. Journal of Behavioral Medicine 40, 1 (Feb. 2017), 6–22.

16. Valle-Lisboa, J.C. and Mizraji, E. The uncovering of hidden structures by latent semantic analysis. Information Sciences 177, 19 (Oct. 2007), 4122–4147.

17. Zhiwei, H., Meiping, C., Jimin, W., Qing, S., Chao, Y., Xing, D., and Zhonggao, W. Improved control of hypertension following laparoscopic fundoplication for gastroesophageal reflux disease. Frontiers of Medicine 11, 1 (Mar. 2017), 68–73.

Back to Top


David Gefen ( is a professor in the Decision Sciences and MIS Department, Academic Director of the Doctorate in Business Administration Program, and Provost Distinguished Research Professor in the LeBow College of Business at Drexel University, Philadelphia, PA, USA.

Jake Miller ( is an assistant clinical professor in the Management Department in the LeBow College of Business at Drexel University, Philadelphia, PA, USA.

Johnathon Kyle Armstrong ( is a research scientist at Independence Blue Cross, Philadelphia, PA, USA.

Frances H. Cornelius ( is a professor in the College of Nursing and Health Professions and Chair of the MSN Advanced Role Department, Complementary and Integrative Health Department, and coordinator of Clinical Nursing Informatics Education in the College of Nursing and Health Professions at Drexel University, Philadelphia, PA, USA.

Noreen Robertson ( is the Associate Vice Dean for research at Drexel University College of Medicine and a research assistant professor in the Department of Biochemistry & Molecular Biology at Drexel University, Philadelphia, PA, USA.

Aaron Smith-McLallen ( is the Director of Data Science and Health Care Analytics at Independence Blue Cross, Philadelphia, PA, USA.

Jennifer A. Taylor ( is an associate professor of environmental and occupational health in the School of Public Health at Drexel University, Philadelphia, PA, USA.

Back to Top




This study was supported by Drexel Grant #282847.

Copyright held by authors. Publication rights licensed to ACM.
Request permission to publish from

The Digital Library is published by the Association for Computing Machinery. Copyright © 2018 ACM, Inc.


Michael Berry

I thought I would provide additional references on LSA that will help readers with the underlying mathematics and usage of LSA in bioinformatics for generating text-based similarity scores similar to p-values from gene expression data. Software for LSA has been available in languages such as C, C++, Java, and Python for many years. M.W. Berry (EECS Department, Univ of Tennessee, Knoxville)

LSA references:

Knowledge-Enhanced Latent Semantic Indexing, D. Guo, M.W. Berry, B.B.
Thompson, and S. Bailin, Information Retrieval 6(2), 2003, pp. 225-250.

Mathematical Foundations Behind Latent Semantic Analysis, D.I. Martin
and M.W. Berry, in Handbook of Latent Semantic Analysis, T.K. Landauer,
D.S. McNamara, S. Dennis, and W. Kintsch (Eds), Lawrence Erlbaum Associates,
2007, pp. 35-55.

Latent Semantic Indexing, Dian I. Martin and Michael W. Berry, in Encyclopedia
of Library and Information Sciences (ELIS), Third Edition, Marcia J. Bates
and Mary Niles Maack (Eds.), Taylor & Francis, Oxford, 2010, pp. 3195-3204.

Latent Semantic Indexing of Pubmed Abstracts for Identification of Transcription
Factor Candidates from Microarray-derived Gene Sets, S. Roy, K. Heinrich,
V. Phan, M.W. Berry, and R. Homayouni, BMC Bioinformatics 12(Suppl 10):S19,

Functional Cohesion of Gene Sets Determined by Latent Semantic Indexing
of PubMed Abstracts, L. Xu, N. Furlotte, E.O. George, K. Heinrich, M.W.
Berry, and R. Homayouni, PLoS ONE, 6(4): e18851, 2011.

Displaying 1 comment