Sources of Big Data in Medicine

Conceptual image of a man clicking on medical data on a clear screen

PeopleImages / Getty Images

A simple definition of big data in medicine is “the totality of data related to patient health care and well-being.” But what exactly are these types of data, and where do they come from?

The following is a broad overview of the types and sources of big data of interest to health-care providers, researchers, payers, policymakers, and industry. These categories are not mutually exclusive, because the same data can originate from a variety of sources.

Nor is this list exhaustive, because the practical application of big data analytics will surely continue to expand.

Clinical Information Systems

These are traditional sources of clinical data that health care providers are accustomed to viewing.

  • Electronic health records (EHRs) collect, store, and display information such as demographics, past medical history, active medical problems, immunizations, allergies, medications, vital signs, results from laboratory and radiology tests, pathology reports, progress notes created by health care providers, and administrative and financial documents.
  • Electronic medical records (EMRs) are not identical to EHRs and usually pertain to data stored with a particular physician.
  • Health information exchanges serve as hubs between disparate clinical information systems.
  • Patient registries, maintained by health care organizations on their own patients, are often linked to the EHR. Other registries track immunizations, cancer, trauma, and other public health issues on a wider geographic scale.
  • Patient portals allow patients to access personal health information stored in a health care organization’s EHR. Some patient portals also allow users to request prescription refills and exchange secure electronic messages with the health care team.
  • Clinical data warehouses aggregate patient-level data from multiple clinical information systems, such as EHRs and other sources listed above.

Claims Data From Payers

Public payers (e.g. Medicare) and private payers have large repositories of claims data on their beneficiaries. Some health insurers now also offer incentives for sharing your health data.

Research Studies

Research databases contain information about study participants, experimental treatments, and clinical outcomes. Large studies are usually sponsored by pharmaceutical companies or government agencies. An application of personalized medicine is to match individual patients with effective treatments, based on patterns in data from clinical trials.

This approach moves beyond applying evidence-based medicine principles, by which a health care provider determines whether a patient shares broad characteristics (e.g. age, gender, race, clinical status) with trial participants. With big data analytics, it is possible to select a treatment based on much more granular information, such as the genetic profile of a patient’s cancer (see below).

Clinical decision support systems (CDSS) have also been developing rapidly and now represent a big part of artificial intelligence (AI) in medicine. They use patient data to assist clinicians with their decision-making and are often combined with EHRs.

Genetic Databases

The repository of human genetic information continues to accumulate at a rapid pace. Since the Human Genome Project was completed in 2003, the cost of human DNA sequencing has been reduced by a million-fold. The Personal Genome Project (PGP), launched in 2005 by Harvard Medical School, seeks to sequence and publicize the complete genomes of 100,000 volunteers from around the world. The PGP itself is a prime example of big data project due to the sheer volume and variety of data. A personal genome contains about 100 gigabytes of data. In addition to sequencing genomes, the PGP is also collecting data from EHRs, surveys, and microbiome profiles.

A number of companies offer direct-to-consumer genetic sequencing for health, personal traits, and pharmacogenetics on a commercial basis.

This personal information could be subjugated to big data analytics. For example, 23andMe stopped offering health-related genetic reports to new customers as of November 22, 2013, to comply with the U.S. Food and Drug Administration. However, in 2015, the company started offering certain health components of their genetic saliva test again, this time with the FDA’s approval.

Public Records

The government keeps detailed records of events related to health, such as immigration, marriage, birth, and death. The U.S. Census has collected vast amounts of information every 10 years since 1790. The Census’ statistics website had 370 billion cells as of 2013, with approximately 11 billion more added yearly.

Web Searches

Web search information gathered by Google and other web search providers could provide real-time insights related to a population’s health. However, the value of big data from web search patterns might be improved by combining it with traditional sources of health data.

Social Media

Facebook, Twitter, and other social media platforms generate a rich variety of data around the clock, giving a view into the locations, health behaviors, emotions, and social interactions of users. The application of social media big data to public health has been referred to as digital disease detection or digital epidemiology. Twitter, for example, has been used to analyze influenza epidemics among the general population.

The World Well-Being Project that started at the University of Pennsylvania is another example of studying social media to understand people’s experience and health better. The project brings together psychologists, statisticians and computer scientists who analyze language used when interacting online, for instance, when writing status updates on Facebook and Twitter. Scientists are observing how users’ language relates to their health and happiness. Advances in natural language processing and machine learning are helping with their endeavors. A recent publication from the University of Pennsylvania looked at ways of predicting mental illness by analyzing social media. It appears that symptoms of depression and other mental health conditions can be detected by studying our use of the Internet. Scientists hope in the future these methods will be able to better identify and assist at-risk individuals.

The Internet of Things (IoT)

Massive troves of health-related information are also collected and stored on mobile and home devices.

  • Smartphones: Thousands of mHealth apps capture information on the user’s physical activity, nutritional intake, sleep patterns, emotions, and other parameters. Native cell phone apps (e.g. GPS, email, texting) can also give clues about an individual’s health status.
  • Wearable monitors and devices: Pedometers, accelerometers, glasses, watches, and chips embedded under the skin also gather health-related information and can also send them into the cloud.
  • Telemedicine devices allow health care providers to monitor patients’ parameters such as blood pressure, heart rate, respiratory rate, oxygenation, temperature, ECG tracings, and weight.

Financial Transactions

Patients’ credit card transactions are included in the predictive models used by Carolinas HealthCare System to identify patients who are at high-risk for being readmitted to the hospital. The Charlotte-based health care provider uses big data to divide patients into various groups, for example, based on disease and geographic location.

Ethical and Privacy Implications

It needs to be highlighted that, in some cases, there might be important ethical and privacy implications when gathering and accessing data in health care. New sources of big data can improve our understanding of what impacts individuals and population health, however, different risks need to be carefully considered and monitored. It has now also been recognized that data previously deemed anonymous, can be re-identified. For example, Professor Latanya Sweeney of Harvard’s Data Privacy Lab reviewed 1,130 volunteers involved in the Personal Genome Project. She and her team were able to correctly name 42% of the participants based on the information they shared (ZIP code, birth date, gender). This knowledge can increase our awareness of potential risks and help us make better data sharing decisions.

Was this page helpful?

Article Sources

Verywell Health uses only high-quality sources, including peer-reviewed studies, to support the facts within our articles. Read our editorial policy to learn more about how we fact-check and keep our content accurate, reliable, and trustworthy.