Data Viz & Introduction to ggplot2

Jo Hardin

August 27, 2025

Agenda 8/27/25

  1. grammar of graphics
  2. ggplot

Goals of ggplot2

What I will try to do

  • give a tour of ggplot2

  • explain how to think about plots the ggplot2 way

  • prepare/encourage you to learn more later

What I can’t do in one session

  • show every bell and whistle

  • make you an expert at using ggplot2

Getting help

  1. One of the best ways to get started with ggplot is to Google what you want to do with the word ggplot. Then look through the images that come up. More often than not, the associated code is there. There are also ggplot galleries of images, one of them is here: https://plot.ly/ggplot2/

  2. Look at the end of this presentation and the syllabus. More help options there.

What are the visual cues on this plot?

  • position
  • length
  • shape
  • area/volume
  • shade/color

What are the visual cues on this plot?

  • position
  • length
  • shape
  • area/volume
  • shade/color

What are the visual cues on this plot?

  • position
  • length
  • shape
  • area/volume
  • shade/color

The grammar of graphics ggplot

geom: the geometric “shape” used to display data

  • bar, point, line, ribbon, text, etc.

aesthetic: an attribute controlling how geom is displayed with respect to variables

  • x position, y position, color, fill, shape, size, etc.

guide: helps user convert visual data back into raw data (legends, axes)

stat: a transformation applied to data before geom gets it

  • example: histograms work on binned data

Set up

library(mosaic)
data(Births2015)
head(Births2015) 
date births wday year month day_of_year day_of_month day_of_week
2015-01-01 8068 Thu 2015 1 1 1 5
2015-01-02 10850 Fri 2015 1 2 2 6
2015-01-03 8328 Sat 2015 1 3 3 7
2015-01-04 7065 Sun 2015 1 4 4 1
2015-01-05 11892 Mon 2015 1 5 5 2
2015-01-06 12425 Tue 2015 1 6 6 3

How do we make this plot?

Two Questions:

  1. What do we want R to do? (What is the goal?)

  2. What does R need to know?

How do we make this plot?

  1. Goal: scatterplot = a plot with points

  2. What does R need to know?

    • data source: Births2015

    • aesthetics:

      • date -> x
      • births -> y
    • points

How do we make this plot?

ggplot(data = Births2015, 
       aes(x = date, y = births)) + 
  geom_point() +
  labs(title = "US Births in 2015")

ggplot() +
  geom_point(data = Births2015, 
             aes(x = date, y = births)) +
  labs(title = "US Births in 2015")

Layers: layer 0

ggplot() 

Layers: layer 1

ggplot(data = Births2015, 
       aes(x = date, y = births)) 

Layers: layer 2

ggplot(data = Births2015, 
       aes(x = date, y = births)) + 
  geom_point()

Layers: layer 3

ggplot(data = Births2015, 
       aes(x = date, y = births)) + 
  geom_point() +
  labs(title = "US Births in 2015")

How do we make this plot?

How do we make this plot?

What has changed?

  • new aesthetic: mapping color to day of week

How do we make this plot?

ggplot(data = Births2015,
       aes(x = date,
           y = births, 
           color = wday)) +
  geom_point() +
  labs(title = "US Births in 2015")

How do we make this plot?

How do we make this plot?

lines instead of dots!

ggplot(data = Births2015,
         aes(x = date, 
             y = births,
             color = wday)) +
  geom_line() +
  labs(title = "US Births in 2015")

How do we make this plot?

How do we make this plot?

Now there are two layers: one with points and one with lines

ggplot(data = Births2015,
       aes(x = date,
           y = births,
           color = wday)) + 
  geom_point() +  
  geom_line() +
  labs(title = "US Births in 2015")
  • The layers are placed one on top of the other: the points are below and the lines are above.

  • data and aes specified in ggplot() affect all geoms

What does this code do?

ggplot(data = Births2015,
       aes(x = date, y = births, color = "navy")) + 
  geom_point() +
  labs(title = "US Births in 2015") 

What does this code do?

ggplot(data = Births2015,
       aes(x = date, y = births, color = "navy")) + 
  geom_point()  +
  labs(title = "US Births in 2015")

This is mapping the color aesthetic to a new variable with only one value (“navy”).
So all the dots get set to the same color, but it’s not navy.

Setting vs. Mapping

If we want to set the color to be navy for all of the dots, we do it outside the aes() designation:

ggplot(data = Births2015,
       aes(x = date, y = births)) +   # map variables 
  geom_point(color = "navy")      +   # set attributes
  labs(title = "US Births in 2015")
  • Note that color = "navy" is now outside of the aesthetics list. aes() is how ggplot2 distinguishes between mapping and setting.

How do we make this plot?

How do we make this plot?

ggplot(data = Births2015,
       aes(x = date,
           y = births)) + 
  geom_line(aes(color = wday)) +      
  geom_point(color = "navy")  +         
  labs(title = "US Births in 2015")
  • ggplot() establishes the default data and aesthetics for the geoms, but each geom may change these defaults.

  • good practice: put into ggplot() the things that affect all (or most) of the layers; put the rest in geom_*()

Setting vs. Mapping (again)

Information gets passed to the plot via:

  1. map the variable information inside the aes() (aesthetic) function

  2. set the non-variable information outside the aes() (aesthetic) function

Other geoms

apropos("^geom_")
 [1] "geom_abline"            "geom_area"              "geom_ash"              
 [4] "geom_bar"               "geom_bin_2d"            "geom_bin2d"            
 [7] "geom_blank"             "geom_boxplot"           "geom_col"              
[10] "geom_contour"           "geom_contour_filled"    "geom_count"            
[13] "geom_crossbar"          "geom_curve"             "geom_density"          
[16] "geom_density_2d"        "geom_density_2d_filled" "geom_density2d"        
[19] "geom_density2d_filled"  "geom_dotplot"           "geom_errorbar"         
[22] "geom_errorbarh"         "geom_freqpoly"          "geom_function"         
[25] "geom_hex"               "geom_histogram"         "geom_hline"            
[28] "geom_jitter"            "geom_label"             "geom_line"             
[31] "geom_linerange"         "geom_lm"                "geom_map"              
[34] "geom_path"              "geom_point"             "geom_pointrange"       
[37] "geom_polygon"           "geom_qq"                "geom_qq_line"          
[40] "geom_quantile"          "geom_rangeframe"        "geom_raster"           
[43] "geom_rect"              "geom_ribbon"            "geom_rug"              
[46] "geom_segment"           "geom_sf"                "geom_sf_label"         
[49] "geom_sf_text"           "geom_smooth"            "geom_spline"           
[52] "geom_spoke"             "geom_step"              "geom_text"             
[55] "geom_tile"              "geom_tufteboxplot"      "geom_violin"           
[58] "geom_vline"            

Other geoms

help pages will tell you their aesthetics, default stats, etc.

?geom_area             # for example

Let’s try geom_area

ggplot(data = Births2015,
       aes(x = date,
           y = births, 
           fill = wday)) + 
  geom_area() +
  labs(title = "US Births in 2015")

Let’s try geom_area

ggplot(data = Births2015,
       aes(x = date, y = births, fill = wday)) + 
  geom_area() +
  labs(title = "US Births in 2015")

… not a good plot

  • overplotting is hiding much of the data
  • extending y-axis to 0 may or may not be desirable.

Side note: what makes a plot good?

Most (all?) graphics are intended to help us make comparisons

  • How does something change over time?
  • Do my treatments matter? How much?
  • Do treatment and control respond the same way?

Key plot metric

Does my plot make the comparisons I am interested in:

  • easily, and
  • accurately?

Time for some different data

HELPrct: Health Evaluation and Linkage to Primary care randomized clinical trial. Subjects admitted for treatment for addiction to one of three substances.

head(HELPrct)
age anysubstatus anysub cesd d1 daysanysub dayslink drugrisk e2b female sex g1b homeless i1 i2 id indtot linkstatus link mcs pcs pss_fr racegrp satreat sexrisk substance treat avg_drinks max_drinks hospitalizations
37 1 yes 49 3 177 225 0 NA 0 male yes housed 13 26 1 39 1 yes 25.111990 58.41369 0 black no 4 cocaine yes 13 26 3
37 1 yes 30 22 2 NA 0 NA 0 male yes homeless 56 62 2 43 NA NA 26.670307 36.03694 1 white no 7 alcohol yes 56 62 22
26 1 yes 39 0 3 365 20 NA 0 male no housed 0 0 3 41 0 no 6.762923 74.80633 13 black no 2 heroin no 0 0 0
39 1 yes 15 2 189 343 0 1 1 female no housed 5 5 4 28 0 no 43.967880 61.93168 11 white yes 4 heroin no 5 5 2
32 1 yes 39 12 2 57 0 1 0 male no homeless 10 13 5 38 1 yes 21.675755 37.34558 10 black no 6 cocaine no 10 13 12
47 1 yes 6 1 31 365 0 NA 1 female no housed 4 4 6 29 0 no 55.508991 46.47521 5 black no 5 cocaine yes 4 4 1

Who are the people in the study?

ggplot(data = HELP_data,
       aes(x = substance)) + 
  geom_bar() +
  labs(title = "HELP trial")
  • Hmm. What’s up with y?

    • stat_bin() is being applied to the data before the geom_bar() gets to do its thing. Binning creates the y values.

Who are the people in the study?

ggplot(data = HELP_data,
       aes(x = substance,
           fill = children)) + 
  geom_bar() +
  labs(title = "HELP trial")

Who are the people in the study?

ggplot(HELP_data,
       aes(x = substance,
           fill = children)) + 
  geom_bar(position = "fill") +
  labs(title = "HELP trial",
       y = "actually, percent")

How old are people in the HELP study?

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_histogram() +
  labs(title = "HELP trial")

Notice the messages

  • stat_bin: Histograms are not mapping the raw data but binned data.
    stat_bin() performs the data transformation.

  • binwidth: a default binwidth has been selected, but we should really choose our own.

Setting the binwidth manually

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_histogram(binwidth=2) +
  labs(title = "HELP trial")

How old are people in the HELP study? – Other geoms

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_freqpoly(binwidth=2) +
  labs(title = "HELP clinical trial at detoxification unit")

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_density() +
  labs(title = "HELP clinical trial at detoxification unit")

Selecting stat and geom manually

Every geom comes with a default stat

  • for simple cases, the stat is stat_identity() which does nothing
  • we can mix and match geoms and stats however we like
ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_line(stat = "density") +
  labs(title = "HELP clinical trial at detoxification unit")

Selecting stat and geom manually

Every stat comes with a default geom, every geom with a default stat

  • we can specify stats instead of geom, if we prefer
  • we can mix and match geoms and stats however we like
ggplot(data = HELP_data,
       aes(x = age)) + 
  stat_density(geom = "line") +
  labs(title = "HELP clinical trial at detoxification unit")

More combinations

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_point(stat = "bin", binwidth=3) + 
  geom_line(stat = "bin", binwidth=3)  +
  labs(title = "HELP clinical trial at detoxification unit")

More combinations

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_area(stat = "bin", binwidth=3)  +
  labs(title = "HELP clinical trial at detoxification unit")

More combinations

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_point(stat = "bin", 
             binwidth=3, 
             aes(size = ..count..)) +
  geom_line(stat = "bin", binwidth=3) +
  labs(title = "HELP clinical trial at detoxification unit")

How much drinking? (i1)

ggplot(data = HELP_data,
       aes(x = i1)) + geom_histogram() +
  labs(title = "HELP clinical trial at detoxification unit")

How much drinking? (i1)

ggplot(data = HELP_data,
       aes(x = i1)) + 
  geom_density() +
  labs(title = "HELP clinical trial at detoxification unit")

How much drinking? (i1)

ggplot(data = HELP_data, 
       aes(x = i1)) + 
  geom_area(stat = "density") +
  labs(title = "HELP clinical trial at detoxification unit")

Covariates: Adding in more variables

Using color and linetype:

ggplot(data = HELP_data,
       aes(x = i1,
           color = substance,
           linetype = children)) + 
  geom_line(stat = "density") +
  labs(title = "HELP clinical trial at detoxification unit")

Using color and facets

ggplot(data = HELP_data,
       aes(x = i1, color = substance)) + 
  geom_line(stat = "density") + 
  facet_grid( . ~ children ) +
  labs(title = "HELP clinical trial at detoxification unit")

ggplot(data = HELP_data,
       aes(x = i1, color = substance)) + 
  geom_line(stat = "density") + 
  facet_grid( children ~ . ) +
  labs(title = "HELP clinical trial at detoxification unit")

Boxplots

Boxplots use stat_quantile() (five number summary).

The quantitative variable must be y, and there must be an additional x variable.

ggplot(data = HELP_data,
       aes(x = substance, y = age, color = children)) + 
  geom_boxplot() +
  labs(title = "HELP clinical trial at detoxification unit")

Horizontal boxplots

  • coord_flip() may be used with other plots as well to reverse the roles f x and y on the plot.

ggplot(data = HELP_data,
       aes(x = substance, 
           y = age, 
           color = children)) + 
  geom_boxplot() +
  coord_flip() +
  labs(title = "HELP clinical trial at detoxification unit")

Axes scaling with boxplots

We can scale the continuous axis

ggplot(data = HELP_data,
       aes(x = substance, 
           y = age, 
           color = children)) + 
  geom_boxplot() +
  coord_trans(y = "exp") +
  labs(title = "HELP clinical trial at detoxification unit")

Give me some space

We’ve triggered a new feature: dodge (for dodging things left/right). We can control how much if we set the dodge manually.

ggplot(data = HELP_data,
       aes(x = substance, 
           y = age, 
           color = children)) + 
  geom_boxplot(position = position_dodge(width=1)) +
  labs(title = "HELP clinical trial at detoxification unit")

Issues with bigger data

  • Although we can see a generally positive association (as we would expect), the overplotting may be hiding information.
library(NHANES)
dim(NHANES)
[1] 10000    76
head(NHANES)
ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education MaritalStatus HHIncome HHIncomeMid Poverty HomeRooms HomeOwn Work Weight Length HeadCirc Height BMI BMICatUnder20yrs BMI_WHO Pulse BPSysAve BPDiaAve BPSys1 BPDia1 BPSys2 BPDia2 BPSys3 BPDia3 Testosterone DirectChol TotChol UrineVol1 UrineFlow1 UrineVol2 UrineFlow2 Diabetes DiabetesAge HealthGen DaysPhysHlthBad DaysMentHlthBad LittleInterest Depressed nPregnancies nBabies Age1stBaby SleepHrsNight SleepTrouble PhysActive PhysActiveDays TVHrsDay CompHrsDay TVHrsDayChild CompHrsDayChild Alcohol12PlusYr AlcoholDay AlcoholYear SmokeNow Smoke100 Smoke100n SmokeAge Marijuana AgeFirstMarij RegularMarij AgeRegMarij HardDrugs SexEver SexAge SexNumPartnLife SexNumPartYear SameSex SexOrientation PregnantNow
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51625 2009_10 male 4 0-9 49 Other NA NA NA 20000-24999 22500 1.07 9 Own NA 17.0 NA NA 105.4 15.30 NA 12.0_18.5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA No NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
51630 2009_10 female 49 40-49 596 White NA Some College LivePartner 35000-44999 40000 1.91 5 Rent NotWorking 86.7 NA NA 168.4 30.57 NA 30.0_plus 86 112 75 118 82 108 74 116 76 NA 1.16 6.70 77 0.094 NA NA No NA Good 0 10 Several Several 2 2 27 8 Yes No NA NA NA NA NA Yes 2 20 Yes Yes Smoker 38 Yes 18 No NA Yes Yes 12 10 1 Yes Heterosexual NA
51638 2009_10 male 9 0-9 115 White NA NA NA 75000-99999 87500 1.84 6 Rent NA 29.8 NA NA 133.1 16.82 NA 12.0_18.5 82 86 47 84 50 84 50 88 44 NA 1.34 4.86 123 1.538 NA NA No NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Overplotting

ggplot(data = NHANES,
       aes(x = Height, y = Weight)) +
  geom_point() + 
  facet_grid( Gender ~ PregnantNow )

Using alpha (opacity)

One way to deal with overplotting is to set the opacity low.

ggplot(data = NHANES,
       aes(x = Height, y = Weight)) +
  geom_point(alpha = 0.01) + 
  facet_grid( Gender ~ PregnantNow )

geom_density2d

Alternatively (or simultaneously) we might prefer a different geom altogether.

ggplot(data = NHANES,
       aes(x = Height, y = Weight)) +
  geom_density2d() + 
  facet_grid( Gender ~ PregnantNow )

Multiple layers

ggplot(data = HELP_data, 
       aes(x = children, y = age)) +
  geom_boxplot(outlier.size = 0) +
  geom_point(alpha = .6) +
  coord_flip() +
  labs(title = "HELP clinical trial at detoxification unit")

ggplot(data = HELP_data,
       aes(x = children, y = age)) +
  geom_boxplot(outlier.size = 0) +
  geom_jitter(alpha = .6, width=0.1) +
  coord_flip() +
  labs(title = "HELP clinical trial at detoxification unit")

Multiple layers

ggplot(data = HELP_data,
       aes(x = children, y = age)) +
  geom_boxplot(outlier.size = 0) +
  geom_point(alpha = .6, 
             position = position_jitter(width = .1, height = 0)) +
  coord_flip() +
  labs(title = "HELP clinical trial at detoxification unit")

Things I haven’t mentioned (much)

  • coords (coord_flip() is good to know about)

  • themes (for customizing appearance)

  • position (position_dodge(), position_jitterdodge() (for use with points on top of dodged boxplots), position_stack(), etc.)

  • transforming axes

themes

library(ggthemes)
ggplot(Births2015, aes(x = date, y = births)) + 
  geom_point() + 
  theme_wsj()

jitterdodge()

ggplot(data = HELP_data, 
       aes(x = substance, y = age, color = children)) +
  geom_boxplot(position = position_dodge(width = 1)) +
  geom_point(aes(color = children, 
                 fill = children), 
             position = position_jitterdodge(dodge.width = 1, jitter.width = 0.1), 
             size = 0.5) +
  labs(title = "HELP clinical trial at detoxification unit")

A little bit of everything

ggplot(data = HELP_data, aes(x = substance, y = age, color = children)) +
  geom_boxplot(position = position_dodge(width=1)) +
  geom_point(aes(fill = children), 
             alpha = .5, 
             position = position_jitterdodge(dodge.width = 1, jitter.width = 0.2)) + 
  facet_wrap(~homeless) +
  labs(title = "HELP clinical trial at detoxification unit")

Want to learn more?

What’s around the corner?

shiny

  • interactive graphics / modeling

  • https://shiny.rstudio.com/

plotly

Plotly is an R package for creating interactive web-based graphs via plotly’s JavaScript graphing library, plotly.js. The plotly R libary contains the ggplotly function , which will convert ggplot2 figures into a Plotly object. Furthermore, you have the option of manipulating the Plotly object with the style function.

  • https://plot.ly/ggplot2/getting-started/

gganimate

Examples in the wild

  • Advice
  • Fonts
  • NYT often does data viz quite well
  • W.E.B Du Bois

Preliminaries

  1. Make the data stand out

  2. Facilitate comparison

  3. Add information

(Nolan and Perrett, 2016)

Fonts matter

image credit: Will Chase RStudio::conf 2020

Advice on plotting, specific

  • Avoid having other graph elements interfere with data
  • Use visually prominent symbols
  • Avoid over-plotting (One way to avoid over plotting: jitter the values)
  • Different values of data may obscure each other
  • Include all or nearly all of the data
  • Fill data region

Advice on plotting, general

  • Eliminate superfluous material
  • Facilitate comparisons
  • Choose the best scale
  • Make the plot data / information rich
  • Use good captions, alt text, conclusions

Simplify

A gif of a barplot which starts out cluttered with labels and slowly becomes simplified with the relevant information highlighted.

image credit: https://www.darkhorseanalytics.com/portfolio-data-looks-better-naked

Simplified

The before and after images with the process of simplifying a barplot.

The before and after images with the process of simplifying a barplot.

image credit: https://www.darkhorseanalytics.com/portfolio-data-looks-better-naked

NYT 9/7/21

A scatterplot showing that states with higher vaccination rates have lower COVID case rates.  A few states are highlighted in stronger font: NY, CA, MA have low COVID rates and high vaccination rates; SC GA, ID have high COVID rates and low vaccination rates; TX and USA are in the middle with medium vaccination and medium COVID rates.

One in 5,000, NYT, D. Leonhardt 9/7/21; image credit: https://www.nytimes.com/2021/09/07/briefing/risk-breakthrough-infections-delta.html
  • lighter grid lines
  • no extra information
  • good caption
  • regression line to give context to the trend
  • y-axis label horizontal, not vertical
  • a few states (and the US) are highlighted to draw the reader’s eye

W.E.B. Du Bois

One of the great early data viz pioneers. Remarkable ability to convey information.

Worth a Mention

W.E.B. Du Bois (1868-1963)

  • sociologist
  • data scientist]

image of WEB Du Bois

image credit: wikipedia

In 1900 Du Bois contributed approximately 60 data visualizations to an exhibit at the Exposition Universelle in Paris, an exhibit designed to illustrate the progress made by African Americans since the end of slavery (only 37 years prior, in 1863).

Beautiful & Informative Graphics

https://drawingmatter.org/w-e-b-du-bois-visionary-infographics/

figures from Du Bois's 1900 exhibition

figures from Du Bois's 1900 exhibition