Disclaimer:

The findings from this project are for educational purposes only and should not be used for clinical decision-making.
Analysis scripts are provided at the bottom of the page and are written for the purpose of learning and should not be used for production without further testing

What you will find in this project

2. Introducing to the US crucial database for Pharmacovigilance: FAERS

3. FAERS data analysis methodology

ETL (Extract, Transform, Load) using Python
Data cleaning using PostgreSQL
Descriptive profiling using R (tidyverse, DBI, RPostgres)
Signal detection using R

4. Findings & Conclusion

Discriptive profiling findings
Statistical signal detection findings using R (ggplot2+plotly)
- Heatmaps to explore drug-reaction matrices
- Interactive volcano plots of the Information Component lower bound (IC025) against the Log10 Chi-square statistic
- Interactive GLP-1 Safety Signals forest plot
- Interactive GLP-1 Safety Signals by sub-population

6. Full analysis scripts on GitHub

Introduction

As a pharmacist who interest in data science, I started learning R and Python and want to implement them as a personal project to improve my skills. I choose to work on FAERS database because it is a rich source of real-world data that can be used to identify safety signals of drugs. I also interest in pharmacovigilance, so this project is a perfect match for my interest.

What is pharmacovigilance?

Pharmacovigilance is the science of monitoring the safety of medicines after they reach the market. As part of this, the U.S. Food and Drug Administration (FDA) maintains the Adverse Event Reporting System (FAERS), a massive database tracking reported drug side effects.

What is FAERS database?

The FDA Adverse Event Reporting System (FAERS) is a publicly available, national database containing millions of reports on adverse events (side effects) and medication errors.

These reports are submitted voluntarily by healthcare professionals and consumers, as well as mandatorily by pharmaceutical manufacturers.

What would be expected from this analysis?

FAERS data can uncover hidden safety patterns that wern’t caught during initial clinical trials. For example, drug combination side effects or rare side effects in specific demographics. But this analysis focuses solely on drug-reaction pairs. Because I want to start with simple task first, which will lead to more complex analysis later on.

Methodology

FAERS Analysis pipeline workflow:

Data source

The raw data comes from the FDA’s FAERS quarterly data releases, provided as massive, deeply nested XML files containing millions of patient and drug records from 2004Q1-2025Q4.

Data available here: FAERS Database

Data preprocessing and cleaning

1. ETL (Extract, Transform, Load)

I started with reading related documentations and sampling XML files to understand the structure and content of the data. Then, with the help of AIs, I developed `faers_etl.py` script using `xml.etree.ElementTree` library. I tested it on small subsets of the data first, and gradually increased the data size to handle the massive XML files without crashing. Finally, I loaded the processed data into a structured PostgreSQL database.

2. Data cleaning

Utilizing SQL (`faers_clean.sql`), I performed extensive data normalization and cleaning:

Date & Age Standardization: Converted string dates to standard formats, handled low-precision dates, and normalized various age units (months, days) into a single `ageyears` column.
Label Decoding: Translated cryptic numeric codes into readable text (e.g., patient sex, reaction outcomes).
Drug & Indication Normalization: Extracted missing active ingredients from product names using regex, stripped chemical salt suffixes, and standardized medical indications.
Sender Normalization: Consolidated various pharmaceutical company subsidiaries into standardized parent company names.
Severity Reconciliation: Created a reliable master “seriousness” flag to fix logical inconsistencies in the raw FDA data.
Deduplication: Built a clinical “fingerprint” (matching demographics, dates, drugs, and reactions) to identify and remove duplicate reports submitted by multiple sources.
Final Analytical View: Compiled all the cleaned and filtered data into a final view (`v_analysis`) to serve as a reliable source of truth for the next statistical phase.

Descriptive profiling

I utilized R with packages (tidyverse, DBI, RPostgres) to plot descriptive profiling for better understanding of the dataset. Key insights from the data profiling include:

A pie chart showing the breakdown of reports by patient sex to observe reporting imbalances.
A bar chart illustrating the distribution of adverse event reports across different patient age groups.
A global choropleth map displaying the volume of adverse event reports by country of origin.
A chart detailing the proportion of seriousness in serious adverse event reports.
A ranking of the organizations that submit the highest volume of adverse event reports.
A breakdown of the most common adverse reactions that resulted in a fatal outcome.

Sex Distribution: Females report adverse events significantly more often than males (47.3% vs 31.0%). A notable portion (21.8%) of reports are missing sex data.
Age Distribution: Among reports with known ages, Adults (18-64) are the largest affected group (31.3%), followed by the Elderly (65-74) and Very elderly (>75). A large proportion (41.9%) of reports unfortunately lack age data.
Geographic Mapping: Since FAERS database collects data in the US, the vast majority of adverse event reports originate from the United States, with secondary reporting clusters in Europe and East Asia.
Severity Breakdown: While many serious reports fall under a non-specific “Other” category (38.8%), a substantial portion resulted in Hospitalization (18.6%) and Death (6.5%), emphasizing the critical nature of the reported events.
Top Senders: Pharmaceutical companies SANOFI and ABBVIE are the top reporting organizations by a wide margin in this dataset. Sanofi and AbbVie top FAERS reporting primarily due to their massive patient volumes and market dominance in biologics, particularly with flagship drugs like Dupixent and Humira.
Fatal Reactions: The most frequent reactions associated with fatal outcomes range from severe acute conditions like “Duodenal ulcer perforation” and broader systemic issues like “Systemic lupus erythematosus”.

Signal detection

In pharmacovigilance, we look for “disproportionate signals”: when a specific side effect is reported for a drug more often than expected by chance.

For readers who are not familiar with pharmacovigilance, here is a quick guide:

Signal detection principle

Important question: Is this drug-event combination reported more than expected?

To answer this, we construct a 2×2 contingency table:

Drug-event pair	Target Reaction	Other Reactions	Total
Target Drug	a	b	a +b
Other Drugs	c	d	c + d
Total	a + c	b + d	n

Where:

a: Number of reports for the target drug with the target reaction
b: Number of reports for the target drug with other reactions
c: Number of reports for other drugs with the target reaction
d: Number of reports for other drugs with other reactions
n: Total number of reports

Let’s apply these variables to the metrics for signal detection.

Metrics for signal detection

1. PRR (Proportional Reporting Ratio)

PRR=\frac{a}{(a+b)}\div\frac{c}{(c+d)}

PRR is the ratio between:

Proportion of reports for a specific drug-event pair
Proportion of reports for other drugs with the same event

Interpret as:

PRR = 1: No association
PRR > 1: signal detected
PRR < 1: signal not detected/protective effect (rare)

2. ROR (Reporting Odds Ratio)

ROR=\frac{a}{b}\div\frac{c}{d}

Is a cross-product ratio
From the case-control perspective
Interpret as:
- ROR = 1: No association
- ROR > 1: signal detected
- ROR < 1: signal not detected/protective effect (rare)

3. Chi-square χ² (with Yates’ continuity correction): Test of independence

χ² = \frac{N × (|ad – bc| – N/2)²}{(a+b)(c+d)(a+c)(b+d)}

Test whether a,b,c,d are independent (null hypothesis)
Interpret as:
- χ² <= 4: fail to reject null hypothesis at ~95% confidence level -> signal is not statistically significant
- χ² > 4: reject null hypothesis at ~95% confidence level -> signal is statistically significant

4. Information Component (IC; with Bayesian correction): Test of disproportionality

IC = log₂[(a + 0.5) / \frac{(a+b+0.5) × (a+c+0.5)}{N+1}]

Measures information gain from the observation (Log2 ratio between observed and expected counts of the event-drug pair)
Interpret as:
- IC = 0: no information gain (Observed = Expected)
- IC > 0: signal detected, IC = 1: 2x more than expected, IC = 2: 4x more than expected, …
- IC < 0: signal not detected/protective effect (rare)

5. IC025: Signal Stability

IC_{var} = \frac{1}{log(2)²} × \frac{1}{a+0.5} – \frac{1}{N+1}

IC₀.₀₂₅ = IC – 1.96 × \sqrt{IC_{var}}

Lower bound of 95% Confidence Interval of IC
Interpret as:
- IC025 < 0: signal are not statistically significant at 95% confidence level
- IC025 >= 0: signal detected with 95% confidence level

For signal detection methods selection, I borrowed methodology from many organizations for robustness:

PRR, ROR, and χ² from European Medicines Agency (EMA)
IC/BCPNN, and IC025 from WHO Uppsala Monitoring Centre (VigiBase)
Thresholds: PRR ≥ 2, χ² ≥ 4 from UK MHRA (Medicines and Healthcare products Regulatory Agency), added: a >= 3 & IC025 > 0 in this analysis

I utilized R to apply these statistical methods to find strong associations.

Confounding control

A major challenge in health data is “Confounding by Indication”, for example, a diabetes drug will have high reports of high blood sugar simply because the patients have diabetes. I built a filter in R to significantly reduce these logical overlaps so we only flag unexpected side effects.

Visualization & Findings

I created interactive visualizations using R (ggplot, Plotly) to make the findings accessible. This includes:

1. Heatmaps of drug-reaction matrices

The Safety Signal Intensity Heatmap visualizes the association strength (IC025) between the top 40 drugs and the top 40 reported adverse reactions. Darker blue cells indicate a higher lower bound of the Information Component (IC), representing a statistically robust signal. Findings from this heatmap include:

GLP-1 Agonist Cluster: The heatmap highlights a distinct gastrointestinal (GI) safety profile for GLP-1 receptor agonists like Semaglutide and Tirzepatide. Both drugs show strong positive associations with Nausea, Vomiting, Diarrhoea, Constipation, and Abdominal pain.
Tirzepatide & Injection Site Reactions: While sharing the GI profile, Tirzepatide stands out with a particularly intense signal for Injection site pain, reflecting its delivery method and potentially higher localized reactivity compared to other substances in the top 40.
Disease-Signal Overlap: The dark signals for Type 2 diabetes mellitus associated with these drugs exemplify ‘Confounding by Indication’ where the underlying condition being treated is reported as an adverse event. Although ‘Indication Filtering’ step is applied, there are still some signals for underlying conditions left.

2. Interactive volcano plots

This visualization displays safety signals by plotting the Information Component lower bound (IC025) against the Log10 Chi-square statistic.

Signal Stability (X-axis): The IC025 represents the Bayesian lower bound of signal strength; values above 0 indicate a stable signal.
Statistical Significance (Y-axis): The Chi-square statistic identifies signals that deviate significantly from expected background reporting.
Magnitude & Risk: Bubble size represents the total case count, while the color gradient (from wheat to indianred) represents the Proportional Reporting Ratio (PRR).
Interactive Filtering: Users can filter signals by WHO ATC Level 1 drug classes using the built-in dropdown menu.

Key Findings: The plot clearly isolates a “Strong Signals” quadrant (top-right) where drug-reaction pairs meet both rigorous Bayesian and Frequentist criteria. By filtering for the “Alimentary tract and metabolism” ATC class, GLP-1 receptor agonists (such as Semaglutide and Tirzepatide) stand out in the upper-right quadrant. Their data points appear as massive, red bubbles representing high case volumes and PRR values for gastrointestinal adverse events. This immediate visual confirmation justifies selecting GLP-1 agonists for a targeted deep-dive analysis.

3. Interactive GLP-1 Safety Signals

The Interactive GLP-1 Safety Signal Forest Plot provides a granular, drug-by-drug comparison of safety signals within the GLP-1 agonist class. This visualization utilizes the Information Component (IC): A Bayesian measure of disproportionate reporting, along with its 95% Confidence Interval to illustrate signal stability.

Comparative Profiling: A built-in dropdown menu allows users to toggle between different reactions (e.g., Nausea, Vomiting, Constipation), revealing how drugs like Semaglutide, Tirzepatide, and Dulaglutide perform relative to one another.
Hover Metadata: The interactive Plotly interface allows users to hover over data points to see exact IC values, confidence bounds (IC025 to IC975), and specific case counts, making it a powerful tool for deep-dive safety assessment.
Precision & Volume: For common GI reactions like ‘Nausea’, ‘Diarrhoea’, ‘Vomiting’, and ‘Constipation’, Semaglutide, Liraglutide, Dulaglutide and Tirzepatide show high-intensity, stable signals (IC 1 – 3.5), while some newer agents show insufficient reporting volumes. For injection site pain, Tirzepatide shows the highest intensity, stable signal (IC > 4), followed by dulaglutide (IC ~ 3), while other GLP-1 agonists show 0 reports for this reaction. This finding agree with adverse reaction information from Lexidrug showing 3-8% of mild injection site pain in Tirzepatide users, but no such adverse reaction in Semaglutide and Liraglutide users.

4. Interactive GLP-1 Safety Signals by sub-population

The Sub-population Safety Signal Heatmap is a multi-dimensional tool designed to uncover how safety signals vary across different patient demographics.

Multi-Dimensional Filtering: Three dropdown menus allow users to slice the data by Gender, Age Group (e.g., Adult 18-64 vs. Elderly 65+), and Clinical Indication (e.g., Diabetes vs. Weight Management).
Evidence-Based Visualization: Each cell displays the IC025 value, with visual markers denoting statistical significance levels. (***: IC025>2, **:IC025>1, *:IC025>0)
Demographic Insights:
- Weight Management Cohort: For patients taking medications for weight loss, signals for “Impaired gastric emptying” and “Abdominal pain” are significantly more pronounced compared to those taking the same drugs for diabetes, especially for Dulaglutide.
- Injection-Site Cluster: Tirzepatide displays a uniquely intense and consistent cluster of injection-site reactions (pain, bruising, erythema) that persists across all age and gender filters, distinguishing it from other GLP-1s.
- Indication-Driven Reporting: In the diabetes population, signals like “Blood glucose increased” and “Drug ineffective” often appear, reflecting clinical reporting patterns where uncontrolled underlying disease is flagged as an adverse event.

Conclusion

This FAERS data analysis pipeline effectively demonstrates how statistical pharmacovigilance techniques combined with interactive visualizations can isolate and interpret genuine drug safety signals from background noise.

By applying both Bayesian (IC) and Frequentist methods (PRR, ROR), I identified GLP-1 receptor agonists as a drug class of exceptionally high interest due to their overwhelmingly strong safety signals. Through our targeted deep-dive visualizations, several key clinical insights emerged:

Class-wide Gastrointestinal Signals: GLP-1 agonists (particularly Semaglutide, Tirzepatide, Liraglutide, and Dulaglutide) consistently exhibit high-intensity, stable signals for GI adverse events like nausea, vomiting, diarrhoea, and constipation.
Drug-Specific Variances: While sharing the GI profile, Tirzepatide and Dulaglutide present uniquely intense, robust signals for injection-site reactions (such as pain, bruising, and erythema) that are virtually absent in reports for other GLP-1 drugs, a finding that corroborates established medical literature.
Sub-population Differences: The adverse event profile shifts significantly based on the patient’s clinical indication. Notably, patients utilizing these medications for weight management report pronounced rates of impaired gastric emptying and abdominal pain compared to those treating diabetes, especially with Dulaglutide.
Confounding by Indication: Despite applying logical filtering steps, the persistent overlap of disease symptoms (e.g., increased blood glucose in diabetes patients) being reported as adverse events highlights the inherent complexities of analyzing real-world, post-marketing data.

Ultimately, this project showcases the power of transforming massive, complex raw data into granular, actionable clinical insights that can inform personalized patient care and enhance drug safety monitoring.

Discussion

While this pipeline successfully extracts actionable insights from raw FAERS data, analyzing real-world pharmacovigilance data presents several inherent challenges that leave room for future improvement:

Drug Name Standardization

Raw adverse event reports use tens of thousands of different names, misspellings, or abbreviations for the same drug. In this iteration, I implemented a custom fuzzy matching algorithm to standardize medicinal product names into generic active substances. Initial explorations using external APIs (like OpenFDA and RxNorm) yielded inconsistent results: such as erroneously mapping “0.9 % Normal saline” to “tolnaftate”. Moving forward, integrating flexible methods like Retrieval-Augmented Generation (RAG) or Large Language Models (LLMs) could provide the contextual understanding necessary for highly accurate, automated drug mapping.

Cross-sender deduplication

The FAERS database frequently contains duplicate reports submitted by different entities (e.g., a physician, a pharmacist, and the manufacturer reporting the same single event). Although this project employs duplication flag and clinical fingerprinting to identify and exclude overlapping reports across different senders, this deduplication strategy is not perfect due to sparse patient-specific identifiers such as age, bodyweight, etc. This may lead to inflated case counts and biased statistical signals. Future enhancements could explore probabilistic record linkage to improve accuracy.

Indication Filtering and Confounding

“Confounding by indication” is a persistent hurdle. While my custom Indication filtering successfully reduces direct logical overlaps. However, some disease-driven signals still occasionally slip through (such as “Blood glucose increased” or “Drug ineffective”). Developing more nuanced clinical ontologies to filter downstream disease complications, or accurately accounting for off-label usage, would help isolate only the truly unexpected adverse drug reactions.

Advanced Signal Detection Methods

The current pipeline utilizes a robust blend of Frequentist (PRR, ROR, Chi-square) and Bayesian (Information Component via BCPNN) methodologies. However, the signal detection capabilities can be further enhanced by incorporating more sophisticated empirical Bayes methods, such as the Multi-item Gamma Poisson Shrinker (MGPS) to calculate the Empirical Bayes Geometric Mean (EBGM). These algorithms, frequently utilized by the FDA, are particularly effective at minimizing false positives in extremely sparse data and detecting complex multi-drug interactions (polypharmacy).

Analysis scripts

My GitHub repo

Thank you for making it this far, this is my first complete health data analysis project. This project taught me many things. I will make sharper analysis, stay tuned for my next project!

Disclaimer (again!?): This project is for educational purposes. Findings should not be used for clinical decision-making.

Medytics

Tag: Pharmacovigilance

Learning pharmacovigilance: FAERS Data analysis personal project