August 28 + September 4, 2024
Important
Before next Wednesday, read: Tufte. 1997. Visual and Statistical Thinking: Displays of Evidence for Making Decisions. (Use Google to find it.)
What was Hilary trying to answer in her data collection?
Name two of Hilary’s main hurdles in gathering accurate data.
Which is better: high touch (manual) or low touch (automatic) data collection? Why?
What additional covariates are needed / desired? Any problems with them?
How much data does she need?
Yau (2013) gives us nine visual cues, and Wickham (2014) translates them into a language using ggplot2
.
Visual Cues: the aspects of the figure where we should focus.
Position (numerical) where in relation to other things?
Length (numerical) how big (in one dimension)?
Angle (numerical) how wide? parallel to something else?
Direction (numerical) at what slope? In a time series, going up or down?
Shape (categorical) belonging to what group?
Area (numerical) how big (in two dimensions)? Beware of improper scaling!
Volume (numerical) how big (in three dimensions)? Beware of improper scaling!
Shade (either) to what extent? how severely?
Color (either) to what extent? how severely? Beware of red/green color blindness.
Coordinate System: rectangular, polar, geographic, etc.
Scale: numeric (linear? logarithmic?), categorical (ordered?), time
Context: in comparison to what (think back to ideas from Tufte)
Visual Cues of Yau (2013):
Position (numerical)
Length (numerical)
Angle (numerical)
Direction (numerical)
Shape (categorical)
Area (numerical)
Volume (numerical)
Shade (either)
Color (either)
Attributes can focus your reader’s attention.1
ggplot2
What I will try to do
give a tour of ggplot2
explain how to think about plots the ggplot2
way
prepare/encourage you to learn more later
What I can’t do in one session
show every bell and whistle
make you an expert at using ggplot2
One of the best ways to get started with ggplot is to Google what you want to do with the word ggplot. Then look through the images that come up. More often than not, the associated code is there. There are also ggplot galleries of images, one of them is here: https://plot.ly/ggplot2/
Look at the end of this presentation and the syllabus. More help options there.
ggplot
geom: the geometric “shape” used to display data
aesthetic: an attribute controlling how geom is displayed with respect to variables
guide: helps user convert visual data back into raw data (legends, axes)
stat: a transformation applied to data before geom gets it
date | births | wday | year | month | day_of_year | day_of_month | day_of_week |
---|---|---|---|---|---|---|---|
1978-01-01 | 7701 | Sun | 1978 | 1 | 1 | 1 | 1 |
1978-01-02 | 7527 | Mon | 1978 | 1 | 2 | 2 | 2 |
1978-01-03 | 8825 | Tue | 1978 | 1 | 3 | 3 | 3 |
1978-01-04 | 8859 | Wed | 1978 | 1 | 4 | 4 | 4 |
1978-01-05 | 9043 | Thu | 1978 | 1 | 5 | 5 | 5 |
1978-01-06 | 9208 | Fri | 1978 | 1 | 6 | 6 | 6 |
Two Questions:
What do we want R to do? (What is the goal?)
What does R need to know?
Goal: scatterplot = a plot with points
What does R need to know?
data source: Births78
aesthetics:
date -> x
births -> y
What has changed?
Now there are two layers: one with points and one with lines
The layers are placed one on top of the other: the points are below and the lines are above.
data
and aes
specified in ggplot()
affect all geoms
This is mapping the color aesthetic to a new variable with only one value (“navy”).
So all the dots get set to the same color, but it’s not navy.
If we want to set the color to be navy for all of the dots, we do it outside the aes()
designation:
color = "navy"
is now outside of the aesthetics list. aes()
is how ggplot2
distinguishes between mapping and setting.ggplot()
establishes the default data and aesthetics for the geoms, but each geom may change these defaults.
good practice: put into ggplot()
the things that affect all (or most) of the layers; put the rest in geom_XXXX()
Information gets passed to the plot via:
map
the variable information inside the aes()
(aesthetic) function
set
the non-variable information outside the aes()
(aesthetic) function
[1] "geom_abline" "geom_area" "geom_ash"
[4] "geom_bar" "geom_bin_2d" "geom_bin2d"
[7] "geom_blank" "geom_boxplot" "geom_col"
[10] "geom_contour" "geom_contour_filled" "geom_count"
[13] "geom_crossbar" "geom_curve" "geom_density"
[16] "geom_density_2d" "geom_density_2d_filled" "geom_density2d"
[19] "geom_density2d_filled" "geom_dotplot" "geom_errorbar"
[22] "geom_errorbarh" "geom_freqpoly" "geom_function"
[25] "geom_hex" "geom_histogram" "geom_hline"
[28] "geom_jitter" "geom_label" "geom_line"
[31] "geom_linerange" "geom_lm" "geom_map"
[34] "geom_path" "geom_point" "geom_pointrange"
[37] "geom_polygon" "geom_qq" "geom_qq_line"
[40] "geom_quantile" "geom_rangeframe" "geom_raster"
[43] "geom_rect" "geom_ribbon" "geom_rug"
[46] "geom_segment" "geom_sf" "geom_sf_label"
[49] "geom_sf_text" "geom_smooth" "geom_spline"
[52] "geom_spoke" "geom_step" "geom_text"
[55] "geom_tile" "geom_tufteboxplot" "geom_violin"
[58] "geom_vline"
help pages will tell you their aesthetics, default stats, etc.
geom_area
geom_area
Most (all?) graphics are intended to help us make comparisons
Key plot metric
Does my plot make the comparisons I am interested in:
HELPrct: Health Evaluation and Linkage to Primary care randomized clinical trial. Subjects admitted for treatment for addiction to one of three substances.
age | anysubstatus | anysub | cesd | d1 | daysanysub | dayslink | drugrisk | e2b | female | sex | g1b | homeless | i1 | i2 | id | indtot | linkstatus | link | mcs | pcs | pss_fr | racegrp | satreat | sexrisk | substance | treat | avg_drinks | max_drinks | hospitalizations |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
37 | 1 | yes | 49 | 3 | 177 | 225 | 0 | NA | 0 | male | yes | housed | 13 | 26 | 1 | 39 | 1 | yes | 25.111990 | 58.41369 | 0 | black | no | 4 | cocaine | yes | 13 | 26 | 3 |
37 | 1 | yes | 30 | 22 | 2 | NA | 0 | NA | 0 | male | yes | homeless | 56 | 62 | 2 | 43 | NA | NA | 26.670307 | 36.03694 | 1 | white | no | 7 | alcohol | yes | 56 | 62 | 22 |
26 | 1 | yes | 39 | 0 | 3 | 365 | 20 | NA | 0 | male | no | housed | 0 | 0 | 3 | 41 | 0 | no | 6.762923 | 74.80633 | 13 | black | no | 2 | heroin | no | 0 | 0 | 0 |
39 | 1 | yes | 15 | 2 | 189 | 343 | 0 | 1 | 1 | female | no | housed | 5 | 5 | 4 | 28 | 0 | no | 43.967880 | 61.93168 | 11 | white | yes | 4 | heroin | no | 5 | 5 | 2 |
32 | 1 | yes | 39 | 12 | 2 | 57 | 0 | 1 | 0 | male | no | homeless | 10 | 13 | 5 | 38 | 1 | yes | 21.675755 | 37.34558 | 10 | black | no | 6 | cocaine | no | 10 | 13 | 12 |
47 | 1 | yes | 6 | 1 | 31 | 365 | 0 | NA | 1 | female | no | housed | 4 | 4 | 6 | 29 | 0 | no | 55.508991 | 46.47521 | 5 | black | no | 5 | cocaine | yes | 4 | 4 | 1 |
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice the messages
stat_bin
: Histograms are not mapping the raw data but binned data.
stat_bin()
performs the data transformation.
binwidth
: a default binwidth has been selected, but we should really choose our own.
Every geom comes with a default stat
stat_identity()
which does nothingEvery stat comes with a default geom, every geom with a default stat
Using color and linetype:
Boxplots use stat_quantile()
(five number summary).
The quantitative variable must be y
, and there must be an additional x
variable.
coord_flip()
may be used with other plots as well to reverse the roles f x
and y
on the plot.We can scale the continuous axis
We’ve triggered a new feature: dodge
(for dodging things left/right). We can control how much if we set the dodge manually.
ID | SurveyYr | Gender | Age | AgeDecade | AgeMonths | Race1 | Race3 | Education | MaritalStatus | HHIncome | HHIncomeMid | Poverty | HomeRooms | HomeOwn | Work | Weight | Length | HeadCirc | Height | BMI | BMICatUnder20yrs | BMI_WHO | Pulse | BPSysAve | BPDiaAve | BPSys1 | BPDia1 | BPSys2 | BPDia2 | BPSys3 | BPDia3 | Testosterone | DirectChol | TotChol | UrineVol1 | UrineFlow1 | UrineVol2 | UrineFlow2 | Diabetes | DiabetesAge | HealthGen | DaysPhysHlthBad | DaysMentHlthBad | LittleInterest | Depressed | nPregnancies | nBabies | Age1stBaby | SleepHrsNight | SleepTrouble | PhysActive | PhysActiveDays | TVHrsDay | CompHrsDay | TVHrsDayChild | CompHrsDayChild | Alcohol12PlusYr | AlcoholDay | AlcoholYear | SmokeNow | Smoke100 | Smoke100n | SmokeAge | Marijuana | AgeFirstMarij | RegularMarij | AgeRegMarij | HardDrugs | SexEver | SexAge | SexNumPartnLife | SexNumPartYear | SameSex | SexOrientation | PregnantNow |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
51624 | 2009_10 | male | 34 | 30-39 | 409 | White | NA | High School | Married | 25000-34999 | 30000 | 1.36 | 6 | Own | NotWorking | 87.4 | NA | NA | 164.7 | 32.22 | NA | 30.0_plus | 70 | 113 | 85 | 114 | 88 | 114 | 88 | 112 | 82 | NA | 1.29 | 3.49 | 352 | NA | NA | NA | No | NA | Good | 0 | 15 | Most | Several | NA | NA | NA | 4 | Yes | No | NA | NA | NA | NA | NA | Yes | NA | 0 | No | Yes | Smoker | 18 | Yes | 17 | No | NA | Yes | Yes | 16 | 8 | 1 | No | Heterosexual | NA |
51624 | 2009_10 | male | 34 | 30-39 | 409 | White | NA | High School | Married | 25000-34999 | 30000 | 1.36 | 6 | Own | NotWorking | 87.4 | NA | NA | 164.7 | 32.22 | NA | 30.0_plus | 70 | 113 | 85 | 114 | 88 | 114 | 88 | 112 | 82 | NA | 1.29 | 3.49 | 352 | NA | NA | NA | No | NA | Good | 0 | 15 | Most | Several | NA | NA | NA | 4 | Yes | No | NA | NA | NA | NA | NA | Yes | NA | 0 | No | Yes | Smoker | 18 | Yes | 17 | No | NA | Yes | Yes | 16 | 8 | 1 | No | Heterosexual | NA |
51624 | 2009_10 | male | 34 | 30-39 | 409 | White | NA | High School | Married | 25000-34999 | 30000 | 1.36 | 6 | Own | NotWorking | 87.4 | NA | NA | 164.7 | 32.22 | NA | 30.0_plus | 70 | 113 | 85 | 114 | 88 | 114 | 88 | 112 | 82 | NA | 1.29 | 3.49 | 352 | NA | NA | NA | No | NA | Good | 0 | 15 | Most | Several | NA | NA | NA | 4 | Yes | No | NA | NA | NA | NA | NA | Yes | NA | 0 | No | Yes | Smoker | 18 | Yes | 17 | No | NA | Yes | Yes | 16 | 8 | 1 | No | Heterosexual | NA |
51625 | 2009_10 | male | 4 | 0-9 | 49 | Other | NA | NA | NA | 20000-24999 | 22500 | 1.07 | 9 | Own | NA | 17.0 | NA | NA | 105.4 | 15.30 | NA | 12.0_18.5 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | No | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
51630 | 2009_10 | female | 49 | 40-49 | 596 | White | NA | Some College | LivePartner | 35000-44999 | 40000 | 1.91 | 5 | Rent | NotWorking | 86.7 | NA | NA | 168.4 | 30.57 | NA | 30.0_plus | 86 | 112 | 75 | 118 | 82 | 108 | 74 | 116 | 76 | NA | 1.16 | 6.70 | 77 | 0.094 | NA | NA | No | NA | Good | 0 | 10 | Several | Several | 2 | 2 | 27 | 8 | Yes | No | NA | NA | NA | NA | NA | Yes | 2 | 20 | Yes | Yes | Smoker | 38 | Yes | 18 | No | NA | Yes | Yes | 12 | 10 | 1 | Yes | Heterosexual | NA |
51638 | 2009_10 | male | 9 | 0-9 | 115 | White | NA | NA | NA | 75000-99999 | 87500 | 1.84 | 6 | Rent | NA | 29.8 | NA | NA | 133.1 | 16.82 | NA | 12.0_18.5 | 82 | 86 | 47 | 84 | 50 | 84 | 50 | 88 | 44 | NA | 1.34 | 4.86 | 123 | 1.538 | NA | NA | No | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
One way to deal with overplotting is to set the opacity low.
Alternatively (or simultaneously) we might prefer a different geom altogether.
coords (coord_flip()
is good to know about)
themes (for customizing appearance)
position (position_dodge()
, position_jitterdodge()
(for use with points on top of dodged boxplots), position_stack()
, etc.)
transforming axes
jitterdodge()
ggplot(data = HELP_data,
aes(x = substance, y = age, color = children)) +
geom_boxplot(position = position_dodge(width = 1)) +
geom_point(aes(color = children,
fill = children),
position = position_jitterdodge(dodge.width = 1, jitter.width = 0.1),
size = 0.5) +
labs(title = "HELP clinical trial at detoxification unit")
ggplot(data = HELP_data, aes(x = substance, y = age, color = children)) +
geom_boxplot(position = position_dodge(width=1)) +
geom_point(aes(fill = children),
alpha = .5,
position = position_jitterdodge(dodge.width = 1, jitter.width = 0.2)) +
facet_wrap(~homeless) +
labs(title = "HELP clinical trial at detoxification unit")
R for Data Science by Hadley Wickham and Garrett Grolemund
shiny
interactive graphics / modeling
https://shiny.rstudio.com/
plotly
Plotly
is an R package for creating interactive web-based graphs via plotly’s JavaScript graphing library,plotly.js
. Theplotly
R libary contains theggplotly
function , which will convertggplot2
figures into a Plotly object. Furthermore, you have the option of manipulating the Plotly object with thestyle
function.
gganimate
Make the data stand out
Facilitate comparison
Add information
(Nolan and Perrett, 2016)
Tufte lists two main motivational steps to working with graphics as part of an argument.
“An essential analytic task in making decisions based on evidence is to understand how things work.”
Making decisions based on evidence requires the appropriate display of that evidence.”
Tufte (1997) Visual and Statistical Thinking: Displays of Evidence for Making Decisions. (Use Google to find it.)
Note that the “improved” Challenger graphic was made by Tufte, not by the engineers working on the problem at the time.
image credit: https://www.darkhorseanalytics.com/portfolio-data-looks-better-naked
One of the great early data viz pioneers. Remarkable ability to convey information.
W.E.B. Du Bois (1868-1963)
In 1900 Du Bois contributed approximately 60 data visualizations to an exhibit at the Exposition Universelle in Paris, an exhibit designed to illustrate the progress made by African Americans since the end of slavery (only 37 years prior, in 1863).
https://drawingmatter.org/w-e-b-du-bois-visionary-infographics/