Data Viz & Introduction to ggplot2

August 28 + September 4, 2024

Jo Hardin

Agenda 8/28/24

  1. GitHub
  2. NSSD
  3. grammar of graphics
  4. ggplot

Important

Before next Wednesday, read: Tufte. 1997. Visual and Statistical Thinking: Displays of Evidence for Making Decisions. (Use Google to find it.)

NSSD:

  1. What was Hilary trying to answer in her data collection?

  2. Name two of Hilary’s main hurdles in gathering accurate data.

  3. Which is better: high touch (manual) or low touch (automatic) data collection? Why?

  4. What additional covariates are needed / desired? Any problems with them?

  5. How much data does she need?

Graphics

Grammar of graphics

Yau (2013) gives us nine visual cues, and Wickham (2014) translates them into a language using ggplot2.

  1. Visual Cues: the aspects of the figure where we should focus.
    Position (numerical) where in relation to other things?
    Length (numerical) how big (in one dimension)?
    Angle (numerical) how wide? parallel to something else?
    Direction (numerical) at what slope? In a time series, going up or down?
    Shape (categorical) belonging to what group?
    Area (numerical) how big (in two dimensions)? Beware of improper scaling!
    Volume (numerical) how big (in three dimensions)? Beware of improper scaling!
    Shade (either) to what extent? how severely?
    Color (either) to what extent? how severely? Beware of red/green color blindness.

  2. Coordinate System: rectangular, polar, geographic, etc.

  3. Scale: numeric (linear? logarithmic?), categorical (ordered?), time

  4. Context: in comparison to what (think back to ideas from Tufte)

Pieces of the Graph

Visual Cues of Yau (2013):
Position (numerical)
Length (numerical)
Angle (numerical)
Direction (numerical)
Shape (categorical)
Area (numerical)
Volume (numerical)
Shade (either)
Color (either)

Order Matters

Cues Together

Attributes

Attributes can focus your reader’s attention.1

Agenda 9/4/24

  1. Thoughts on plotting
  2. ggplot
  3. Tufte

Advice for Plotting

  • Basic plotting
  • Avoid having other graph elements interfere with data
  • Use visually prominent symbols
  • Avoid over-plotting (One way to avoid over plotting: jitter the values)
  • Different values of data may obscure each other
  • Include all or nearly all of the data
  • Fill data region

Advice for Plotting

  • Basic plotting
  • Eliminate superfluous material
  • Chart junk & stuff that adds no meaning, e.g. butterflies on top of barplots, background images
  • Extra tick marks and grid lines
  • Unnecessary text and arrows
  • Decimal places beyond the measurement error or the level of difference

Advice for Plotting

  • Basic plotting
  • Eliminate superfluous material
  • Facilitate comparisons
  • Put juxtaposed plots on same scale
  • Make it easy to distinguish elements of superposed plots (e.g. color)
  • Emphasizes the important difference
  • Comparison: volume, area, height (be careful, volume can seem bigger than you mean it to)

Advice for Plotting

  • Basic plotting
  • Eliminate superfluous material
  • Facilitate comparisons
  • Choosing the scale
  • Keep scales on x and y axes the same for both plots to facilitate the comparison
  • Zoom in to focus on the region that contains the bulk of the data
  • Keep the scale the same throughout the plot (i.e. don’t change it mid-axis)
  • Origin need not be on the scale
  • Choose a scale that improves resolution
  • Avoid jiggling the baseline

Advice for Plotting

  • Basic plotting
  • Eliminate superfluous material
  • Facilitate comparisons
  • Choosing the scale
  • How to make a plot information rich
  • Describe what you see in the caption
  • Add context with reference markers (lines and points) including text
  • Add legends and labels
  • Use color and plotting symbols to add more information
  • Plot the same thing more than once in different ways/scales
  • Reduce clutter

Advice for Plotting

  • Basic plotting
  • Eliminate superfluous material
  • Facilitate comparisons
  • Choosing the scale
  • How to make a plot information rich
  • Captions should
  • Be comprehensive
  • Self-contained
  • Describe what has been graphed
  • Draw attention to important features
  • Describe conclusions drawn from graph

Advice for Plotting

  • Basic plotting
  • Eliminate superfluous material
  • Facilitate comparisons
  • Choosing the scale
  • How to make a plot information rich
  • Captions should
  • Good Plot Making Practice
  • Put major conclusions in graphical form
  • Provide reference information
  • Proof read for clarity and consistency
  • Graphing is an iterative process
  • Multiplicity is OK, i.e. two plots of the same variable may provide different messages
  • Make plots data rich

Goals of ggplot2

What I will try to do

  • give a tour of ggplot2

  • explain how to think about plots the ggplot2 way

  • prepare/encourage you to learn more later

What I can’t do in one session

  • show every bell and whistle

  • make you an expert at using ggplot2

Getting help

  1. One of the best ways to get started with ggplot is to Google what you want to do with the word ggplot. Then look through the images that come up. More often than not, the associated code is there. There are also ggplot galleries of images, one of them is here: https://plot.ly/ggplot2/

  2. Look at the end of this presentation and the syllabus. More help options there.

What are the visual cues on this plot?

  • position
  • length
  • shape
  • area/volume
  • shade/color

What are the visual cues on this plot?

  • position
  • length
  • shape
  • area/volume
  • shade/color

What are the visual cues on this plot?

  • position
  • length
  • shape
  • area/volume
  • shade/color

The grammar of graphics ggplot

geom: the geometric “shape” used to display data

  • bar, point, line, ribbon, text, etc.

aesthetic: an attribute controlling how geom is displayed with respect to variables

  • x position, y position, color, fill, shape, size, etc.

guide: helps user convert visual data back into raw data (legends, axes)

stat: a transformation applied to data before geom gets it

  • example: histograms work on binned data

Set up

library(mosaic)
data(Births78)
head(Births78) 
date births wday year month day_of_year day_of_month day_of_week
1978-01-01 7701 Sun 1978 1 1 1 1
1978-01-02 7527 Mon 1978 1 2 2 2
1978-01-03 8825 Tue 1978 1 3 3 3
1978-01-04 8859 Wed 1978 1 4 4 4
1978-01-05 9043 Thu 1978 1 5 5 5
1978-01-06 9208 Fri 1978 1 6 6 6

How do we make this plot?

Two Questions:

  1. What do we want R to do? (What is the goal?)

  2. What does R need to know?

How do we make this plot?

  1. Goal: scatterplot = a plot with points

  2. What does R need to know?

    • data source: Births78

    • aesthetics:

      • date -> x
      • births -> y
      • points (!)

How do we make this plot?

ggplot(data = Births78, 
       aes(x = date, y = births)) + 
  geom_point() +
  labs(title = "US Births in 1978")

ggplot() +
  geom_point(data = Births78, 
             aes(x = date, y = births)) +
  labs(title = "US Births in 1978")

Layers

Layer 1

ggplot(data = Births78, 
       aes(x = date, y = births)) 

Layers

Layer 2

ggplot(data = Births78, 
       aes(x = date, y = births)) + 
  geom_point()

Layers

Layer 3

ggplot(data = Births78, 
       aes(x = date, y = births)) + 
  geom_point() +
  labs(title = "US Births in 1978")

How do we make this plot?

How do we make this plot?

What has changed?

  • new aesthetic: mapping color to day of week

How do we make this plot?

ggplot(data = Births78,
       aes(x = date,
           y = births, 
           color = wday)) +
  geom_point() +
  labs(title = "US Births in 1978")

How do we make this plot?

How do we make this plot?

lines instead of dots!

ggplot(data = Births78,
         aes(x = date, 
             y = births,
             color = wday)) +
  geom_line() +
  labs(title = "US Births in 1978")

How do we make this plot?

How do we make this plot?

Now there are two layers: one with points and one with lines

ggplot(data = Births78,
       aes(x = date,
           y = births,
           color = wday)) + 
  geom_point() +  
  geom_line() +
  labs(title = "US Births in 1978")
  • The layers are placed one on top of the other: the points are below and the lines are above.

  • data and aes specified in ggplot() affect all geoms

What does this code do?

ggplot(data = Births78,
       aes(x = date, y = births, color = "navy")) + 
  geom_point() +
  labs(title = "US Births in 1978") 

What does this code do?

ggplot(data = Births78,
       aes(x = date, y = births, color = "navy")) + 
  geom_point()  +
  labs(title = "US Births in 1978")

This is mapping the color aesthetic to a new variable with only one value (“navy”).
So all the dots get set to the same color, but it’s not navy.

Setting vs. Mapping

If we want to set the color to be navy for all of the dots, we do it outside the aes() designation:

ggplot(data = Births78,
       aes(x = date, y = births)) +   # map variables 
  geom_point(color = "navy")      +   # set attributes
  labs(title = "US Births in 1978")
  • Note that color = "navy" is now outside of the aesthetics list. aes() is how ggplot2 distinguishes between mapping and setting.

How do we make this plot?

How do we make this plot?

ggplot(data = Births78,
       aes(x = date,
           y = births)) + 
  geom_line(aes(color = wday)) +      
  geom_point(color = "navy")  +         
  labs(title = "US Births in 1978")
  • ggplot() establishes the default data and aesthetics for the geoms, but each geom may change these defaults.

  • good practice: put into ggplot() the things that affect all (or most) of the layers; put the rest in geom_XXXX()

Setting vs. Mapping (again)

Information gets passed to the plot via:

  1. map the variable information inside the aes() (aesthetic) function

  2. set the non-variable information outside the aes() (aesthetic) function

Other geoms

apropos("^geom_")
 [1] "geom_abline"            "geom_area"              "geom_ash"              
 [4] "geom_bar"               "geom_bin_2d"            "geom_bin2d"            
 [7] "geom_blank"             "geom_boxplot"           "geom_col"              
[10] "geom_contour"           "geom_contour_filled"    "geom_count"            
[13] "geom_crossbar"          "geom_curve"             "geom_density"          
[16] "geom_density_2d"        "geom_density_2d_filled" "geom_density2d"        
[19] "geom_density2d_filled"  "geom_dotplot"           "geom_errorbar"         
[22] "geom_errorbarh"         "geom_freqpoly"          "geom_function"         
[25] "geom_hex"               "geom_histogram"         "geom_hline"            
[28] "geom_jitter"            "geom_label"             "geom_line"             
[31] "geom_linerange"         "geom_lm"                "geom_map"              
[34] "geom_path"              "geom_point"             "geom_pointrange"       
[37] "geom_polygon"           "geom_qq"                "geom_qq_line"          
[40] "geom_quantile"          "geom_rangeframe"        "geom_raster"           
[43] "geom_rect"              "geom_ribbon"            "geom_rug"              
[46] "geom_segment"           "geom_sf"                "geom_sf_label"         
[49] "geom_sf_text"           "geom_smooth"            "geom_spline"           
[52] "geom_spoke"             "geom_step"              "geom_text"             
[55] "geom_tile"              "geom_tufteboxplot"      "geom_violin"           
[58] "geom_vline"            

Other geoms

help pages will tell you their aesthetics, default stats, etc.

?geom_area             # for example

Let’s try geom_area

ggplot(data = Births78,
       aes(x = date,
           y = births, 
           fill = wday)) + 
  geom_area() +
  labs(title = "US Births in 1978")

Let’s try geom_area

ggplot(data = Births78,
       aes(x = date, y = births, fill = wday)) + 
  geom_area() +
  labs(title = "US Births in 1978")

… not a good plot

  • overplotting is hiding much of the data
  • extending y-axis to 0 may or may not be desirable.

Side note: what makes a plot good?

Most (all?) graphics are intended to help us make comparisons

  • How does something change over time?
  • Do my treatments matter? How much?
  • Do treatment and control respond the same way?

Key plot metric

Does my plot make the comparisons I am interested in:

  • easily, and
  • accurately?

Time for some different data

HELPrct: Health Evaluation and Linkage to Primary care randomized clinical trial. Subjects admitted for treatment for addiction to one of three substances.

head(HELPrct)
age anysubstatus anysub cesd d1 daysanysub dayslink drugrisk e2b female sex g1b homeless i1 i2 id indtot linkstatus link mcs pcs pss_fr racegrp satreat sexrisk substance treat avg_drinks max_drinks hospitalizations
37 1 yes 49 3 177 225 0 NA 0 male yes housed 13 26 1 39 1 yes 25.111990 58.41369 0 black no 4 cocaine yes 13 26 3
37 1 yes 30 22 2 NA 0 NA 0 male yes homeless 56 62 2 43 NA NA 26.670307 36.03694 1 white no 7 alcohol yes 56 62 22
26 1 yes 39 0 3 365 20 NA 0 male no housed 0 0 3 41 0 no 6.762923 74.80633 13 black no 2 heroin no 0 0 0
39 1 yes 15 2 189 343 0 1 1 female no housed 5 5 4 28 0 no 43.967880 61.93168 11 white yes 4 heroin no 5 5 2
32 1 yes 39 12 2 57 0 1 0 male no homeless 10 13 5 38 1 yes 21.675755 37.34558 10 black no 6 cocaine no 10 13 12
47 1 yes 6 1 31 365 0 NA 1 female no housed 4 4 6 29 0 no 55.508991 46.47521 5 black no 5 cocaine yes 4 4 1

Who are the people in the study?

ggplot(data = HELP_data,
       aes(x = substance)) + 
  geom_bar() +
  labs(title = "HELP trial")
  • Hmm. What’s up with y?

    • stat_bin() is being applied to the data before the geom_bar() gets to do its thing. Binning creates the y values.

Who are the people in the study?

ggplot(data = HELP_data,
       aes(x = substance,
           fill = children)) + 
  geom_bar() +
  labs(title = "HELP trial")

Who are the people in the study?

ggplot(HELP_data,
       aes(x = substance,
           fill = children)) + 
  geom_bar(position = "fill") +
  labs(title = "HELP trial",
       y = "actually, percent")

How old are people in the HELP study?

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_histogram() +
  labs(title = "HELP trial")

Notice the messages

  • stat_bin: Histograms are not mapping the raw data but binned data.
    stat_bin() performs the data transformation.

  • binwidth: a default binwidth has been selected, but we should really choose our own.

Setting the binwidth manually

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_histogram(binwidth=2) +
  labs(title = "HELP trial")

How old are people in the HELP study? – Other geoms

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_freqpoly(binwidth=2) +
  labs(title = "HELP clinical trial at detoxification unit")

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_density() +
  labs(title = "HELP clinical trial at detoxification unit")

Selecting stat and geom manually

Every geom comes with a default stat

  • for simple cases, the stat is stat_identity() which does nothing
  • we can mix and match geoms and stats however we like
ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_line(stat = "density") +
  labs(title = "HELP clinical trial at detoxification unit")

Selecting stat and geom manually

Every stat comes with a default geom, every geom with a default stat

  • we can specify stats instead of geom, if we prefer
  • we can mix and match geoms and stats however we like
ggplot(data = HELP_data,
       aes(x = age)) + 
  stat_density(geom = "line") +
  labs(title = "HELP clinical trial at detoxification unit")

More combinations

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_point(stat = "bin", binwidth=3) + 
  geom_line(stat = "bin", binwidth=3)  +
  labs(title = "HELP clinical trial at detoxification unit")

More combinations

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_area(stat = "bin", binwidth=3)  +
  labs(title = "HELP clinical trial at detoxification unit")

More combinations

ggplot(data = HELP_data,
       aes(x = age)) + 
  geom_point(stat = "bin", 
             binwidth=3, 
             aes(size = ..count..)) +
  geom_line(stat = "bin", binwidth=3) +
  labs(title = "HELP clinical trial at detoxification unit")

How much drinking? (i1)

ggplot(data = HELP_data,
       aes(x = i1)) + geom_histogram() +
  labs(title = "HELP clinical trial at detoxification unit")

How much drinking? (i1)

ggplot(data = HELP_data,
       aes(x = i1)) + 
  geom_density() +
  labs(title = "HELP clinical trial at detoxification unit")

How much drinking? (i1)

ggplot(data = HELP_data, 
       aes(x = i1)) + 
  geom_area(stat = "density") +
  labs(title = "HELP clinical trial at detoxification unit")

Covariates: Adding in more variables

Using color and linetype:

ggplot(data = HELP_data,
       aes(x = i1,
           color = substance,
           linetype = children)) + 
  geom_line(stat = "density") +
  labs(title = "HELP clinical trial at detoxification unit")

Using color and facets

ggplot(data = HELP_data,
       aes(x = i1, color = substance)) + 
  geom_line(stat = "density") + 
  facet_grid( . ~ children ) +
  labs(title = "HELP clinical trial at detoxification unit")

ggplot(data = HELP_data,
       aes(x = i1, color = substance)) + 
  geom_line(stat = "density") + 
  facet_grid( children ~ . ) +
  labs(title = "HELP clinical trial at detoxification unit")

Boxplots

Boxplots use stat_quantile() (five number summary).

The quantitative variable must be y, and there must be an additional x variable.

ggplot(data = HELP_data,
       aes(x = substance, y = age, color = children)) + 
  geom_boxplot() +
  labs(title = "HELP clinical trial at detoxification unit")

Horizontal boxplots

  • coord_flip() may be used with other plots as well to reverse the roles f x and y on the plot.

ggplot(data = HELP_data,
       aes(x = substance, 
           y = age, 
           color = children)) + 
  geom_boxplot() +
  coord_flip() +
  labs(title = "HELP clinical trial at detoxification unit")

Axes scaling with boxplots

We can scale the continuous axis

ggplot(data = HELP_data,
       aes(x = substance, 
           y = age, 
           color = children)) + 
  geom_boxplot() +
  coord_trans(y = "exp") +
  labs(title = "HELP clinical trial at detoxification unit")

Give me some space

We’ve triggered a new feature: dodge (for dodging things left/right). We can control how much if we set the dodge manually.

ggplot(data = HELP_data,
       aes(x = substance, 
           y = age, 
           color = children)) + 
  geom_boxplot(position = position_dodge(width=1)) +
  labs(title = "HELP clinical trial at detoxification unit")

Issues with bigger data

  • Although we can see a generally positive association (as we would expect), the overplotting may be hiding information.
library(NHANES)
dim(NHANES)
[1] 10000    76
head(NHANES)
ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education MaritalStatus HHIncome HHIncomeMid Poverty HomeRooms HomeOwn Work Weight Length HeadCirc Height BMI BMICatUnder20yrs BMI_WHO Pulse BPSysAve BPDiaAve BPSys1 BPDia1 BPSys2 BPDia2 BPSys3 BPDia3 Testosterone DirectChol TotChol UrineVol1 UrineFlow1 UrineVol2 UrineFlow2 Diabetes DiabetesAge HealthGen DaysPhysHlthBad DaysMentHlthBad LittleInterest Depressed nPregnancies nBabies Age1stBaby SleepHrsNight SleepTrouble PhysActive PhysActiveDays TVHrsDay CompHrsDay TVHrsDayChild CompHrsDayChild Alcohol12PlusYr AlcoholDay AlcoholYear SmokeNow Smoke100 Smoke100n SmokeAge Marijuana AgeFirstMarij RegularMarij AgeRegMarij HardDrugs SexEver SexAge SexNumPartnLife SexNumPartYear SameSex SexOrientation PregnantNow
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51625 2009_10 male 4 0-9 49 Other NA NA NA 20000-24999 22500 1.07 9 Own NA 17.0 NA NA 105.4 15.30 NA 12.0_18.5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA No NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
51630 2009_10 female 49 40-49 596 White NA Some College LivePartner 35000-44999 40000 1.91 5 Rent NotWorking 86.7 NA NA 168.4 30.57 NA 30.0_plus 86 112 75 118 82 108 74 116 76 NA 1.16 6.70 77 0.094 NA NA No NA Good 0 10 Several Several 2 2 27 8 Yes No NA NA NA NA NA Yes 2 20 Yes Yes Smoker 38 Yes 18 No NA Yes Yes 12 10 1 Yes Heterosexual NA
51638 2009_10 male 9 0-9 115 White NA NA NA 75000-99999 87500 1.84 6 Rent NA 29.8 NA NA 133.1 16.82 NA 12.0_18.5 82 86 47 84 50 84 50 88 44 NA 1.34 4.86 123 1.538 NA NA No NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Overplotting

ggplot(data = NHANES,
       aes(x = Height, y = Weight)) +
  geom_point() + 
  facet_grid( Gender ~ PregnantNow )

Using alpha (opacity)

One way to deal with overplotting is to set the opacity low.

ggplot(data = NHANES,
       aes(x = Height, y = Weight)) +
  geom_point(alpha = 0.01) + 
  facet_grid( Gender ~ PregnantNow )

geom_density2d

Alternatively (or simultaneously) we might prefer a different geom altogether.

ggplot(data = NHANES,
       aes(x = Height, y = Weight)) +
  geom_density2d() + 
  facet_grid( Gender ~ PregnantNow )

Multiple layers

ggplot(data = HELP_data, 
       aes(x = children, y = age)) +
  geom_boxplot(outlier.size = 0) +
  geom_point(alpha = .6) +
  coord_flip() +
  labs(title = "HELP clinical trial at detoxification unit")

ggplot(data = HELP_data,
       aes(x = children, y = age)) +
  geom_boxplot(outlier.size = 0) +
  geom_jitter(alpha = .6, width=0.1) +
  coord_flip() +
  labs(title = "HELP clinical trial at detoxification unit")

Multiple layers

ggplot(data = HELP_data,
       aes(x = children, y = age)) +
  geom_boxplot(outlier.size = 0) +
  geom_point(alpha = .6, 
             position = position_jitter(width = .1, height = 0)) +
  coord_flip() +
  labs(title = "HELP clinical trial at detoxification unit")

Things I haven’t mentioned (much)

  • coords (coord_flip() is good to know about)

  • themes (for customizing appearance)

  • position (position_dodge(), position_jitterdodge() (for use with points on top of dodged boxplots), position_stack(), etc.)

  • transforming axes

themes

library(ggthemes)
ggplot(Births78, aes(x = date, y = births)) + 
  geom_point() + 
  theme_wsj()

jitterdodge()

ggplot(data = HELP_data, 
       aes(x = substance, y = age, color = children)) +
  geom_boxplot(position = position_dodge(width = 1)) +
  geom_point(aes(color = children, 
                 fill = children), 
             position = position_jitterdodge(dodge.width = 1, jitter.width = 0.1), 
             size = 0.5) +
  labs(title = "HELP clinical trial at detoxification unit")

A little bit of everything

ggplot(data = HELP_data, aes(x = substance, y = age, color = children)) +
  geom_boxplot(position = position_dodge(width=1)) +
  geom_point(aes(fill = children), 
             alpha = .5, 
             position = position_jitterdodge(dodge.width = 1, jitter.width = 0.2)) + 
  facet_wrap(~homeless) +
  labs(title = "HELP clinical trial at detoxification unit")

Want to learn more?

What’s around the corner?

shiny

  • interactive graphics / modeling

  • https://shiny.rstudio.com/

plotly

Plotly is an R package for creating interactive web-based graphs via plotly’s JavaScript graphing library, plotly.js. The plotly R libary contains the ggplotly function , which will convert ggplot2 figures into a Plotly object. Furthermore, you have the option of manipulating the Plotly object with the style function.

  • https://plot.ly/ggplot2/getting-started/

gganimate

Examples in the wild

  • Advice
  • Tufte – Cholera & Challenger
  • Fonts
  • NYT often does data viz quite well
  • W.E.B Du Bois

Preliminaries

  1. Make the data stand out

  2. Facilitate comparison

  3. Add information

(Nolan and Perrett, 2016)

Preliminaries

Tufte lists two main motivational steps to working with graphics as part of an argument.

  1. “An essential analytic task in making decisions based on evidence is to understand how things work.”

  2. Making decisions based on evidence requires the appropriate display of that evidence.”

Tufte

Tufte (1997) Visual and Statistical Thinking: Displays of Evidence for Making Decisions. (Use Google to find it.)

Cholera - a picture tells 1000 words

How many aspects of this graph can you point out which are relevant to figuring out that cholera infection was coming from a single pump? Are there any distracting aspects?

Cholera - difficult to interpret

Why would the outbreak already have begun to decline before the pump handle was removed?

Challenger - Problematic

One of the graphics which was particularly unconvincing in trying to explain that O-rings fail in the cold.

Challenger - Better????

A different graph of the Challenger information, now sorted by temperature

Challenger - Improved

The graphic the engineers should have led with in trying to persuade the administrators not to launch. It is evident that the number of O-ring failures is quite highly associated with the ambient temperature. Note the vital information on the x-axis associated with the large number of launches at warm temperatures that had zero O-ring failures.

Note that the “improved” Challenger graphic was made by Tufte, not by the engineers working on the problem at the time.

Fonts matter

image credit: Will Chase RStudio::conf 2020

Advice on plotting, specific

  • Avoid having other graph elements interfere with data
  • Use visually prominent symbols
  • Avoid over-plotting (One way to avoid over plotting: jitter the values)
  • Different values of data may obscure each other
  • Include all or nearly all of the data
  • Fill data region

Advice on plotting, general

  • Eliminate superfluous material
  • Facilitate comparisons
  • Choose the best scale
  • Make the plot data / information rich
  • Use good captions, alt text, conclusions

Simplify

image credit: https://www.darkhorseanalytics.com/portfolio-data-looks-better-naked

Simplified

The before and after images with the process of simplifying a barplot.

The before and after images with the process of simplifying a barplot.

image credit: https://www.darkhorseanalytics.com/portfolio-data-looks-better-naked

NYT 9/7/21

One in 5,000, NYT, D. Leonhardt 9/7/21; image credit: https://www.nytimes.com/2021/09/07/briefing/risk-breakthrough-infections-delta.html
  • lighter grid lines
  • no extra information
  • good caption
  • regression line to give context to the trend
  • y axes labels horizontal, not vertical
  • a few states (and the US) are highlighted to draw the reader’s eye

W.E.B. Du Bois

One of the great early data viz pioneers. Remarkable ability to convey information.

Worth a Mention

W.E.B. Du Bois (1868-1963)

  • sociologist
  • data scientist]

image credit: wikipedia

In 1900 Du Bois contributed approximately 60 data visualizations to an exhibit at the Exposition Universelle in Paris, an exhibit designed to illustrate the progress made by African Americans since the end of slavery (only 37 years prior, in 1863).

Beautiful & Informative Graphics

https://drawingmatter.org/w-e-b-du-bois-visionary-infographics/

figures from Du Bois's 1900 exhibition

figures from Du Bois's 1900 exhibition