Setup

Load Libraries

# install.packages(c("tidycensus", "sf", "tigris"))
library(tidycensus)
library(tigris)

## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.

library(sf)

## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

library(ggplot2)
library(gridExtra)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

GIS in R

`sf` and `sp` packages

Note: The following text was one of the first published for spatial data analysis in R, and relied on the sp package.

Bivand, R. S., Pebesma, E. J., & Gomez-Rubio, V. (2008). Applied spatial data analysis with R. Springer.

Two years ago, many parts of sp were be deprecated, and folks moved to sf. Bivand and Pebesma have published an updated eBook here: https://r-spatial.org/book/

sf cheetsheet here: https://github.com/rstudio/cheatsheets/blob/main/sf.pdf

Other packages & resources

The online book Geocomputation in R by Robin Lovelace, Jakob Nowosad, and Jannes Muenchow, covers the sf package as well as the terra, spData, and spDataLarge packages.

The raster package is helpful for raster or “continuous” space data.

Packages such as tmap, leaflet, and maptiles can enhance your spatial data visualization capabilities, particularly for adding contextual topographic information, integration with google maps, and the creating of dynamic map figure files.

Packages such as qgisprocess facilitate an interaction between R and other more standard point-and-click GIS software such as QGIS.

Finally, Python packages such as shapely (interface to GEOS), fiona (interface to GDAL), pyproj, and mapclassify exist for GIS operations and spatial data analysis.

Spatial Data & `R` Objects

Spatial data can come in many forms, which are generalized into the following categories:

Points: GPS locations
Lines: Streets, rivers, transit lines
Polygons: Nations, states, counties, cities, Census tracts, lakes
Pixels (Rasters): Satellite imagine, remote sensing, continuous space

sf handles all of these fairly well, but in sp each type of data was given a separate object type in R (like the difference between a matrix and a data.frame).

`tidycensus` and Polygons

R Package by Kyle Walker.
- Tidy for engaging with US Census and ACS data
- Basic Use Vignette
- Get Census API key
Book: Analyzing US Census Data: Methods, Maps, and Models in R

# vignette("tidycensus")

## Load API key ####
source(paste0(my_dir, "CensusAPIKey.R"))
census_api_key(myKey)

## To install your API key for use in future sessions, run this function with `install = TRUE`.

### Option to cache tidycensus shapefiles ####
options(tigris_use_cache = TRUE)

Variable & table names

In tidycensus variables can be identified in the output of the load_variables() function by their name, label, and concept. A concept such as SEX BY AGE below is shared by many variables each of which are uniquely identified by an alphanumeric string, or their name. Variables are also uniquely identified by the combination of their label and their concept.

In the example below, all variables in the 2021 ACS within the SEX BY AGE concept have a name that begins with B01001. These variables contain estimates and margins of error for population counts by sex (Male, Female) and age categories (0-5, 5-9, 10-14, 15-17, 18-19, 20, 21, 22-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-61, 62-64, 65-66, 67-69, 70-74, 75-79, 80-84, and 85+.) Totals are provided for each sex and the population overall (e.g. B01001_001 and B01001_002 below.)

load_variables(year = 2021, dataset = "acs1") %>%
  filter(concept == "SEX BY AGE") %>% head()

## # A tibble: 6 × 3
##   name       label                                   concept   
##   <chr>      <chr>                                   <chr>     
## 1 B01001_001 Estimate!!Total:                        SEX BY AGE
## 2 B01001_002 Estimate!!Total:!!Male:                 SEX BY AGE
## 3 B01001_003 Estimate!!Total:!!Male:!!Under 5 years  SEX BY AGE
## 4 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years   SEX BY AGE
## 5 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years SEX BY AGE
## 6 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years SEX BY AGE

A variable’s name is passed as a character string to the variables argument of the get_acs() or get_decennial() functions, e.g. variables = "B01001_001". A character vector can be passed if you would like to select more than one variable in the same API query, e.g. variables = c("B01001_001", "B01001_002").

There is also a table argument that allows you to select all variables from the same concept. For example, to select all variables from the SEX BY AGE concept, one would specify table = "B01001" and not specify anything for the variable argument.

Tips for searching

If you are not sure which variable you want, it’s helpful to save the output of the load_variables() function for searching and filtering concepts and labels.

There are only 6 unique concepts in the 2020 decennial census, so looking through the variables by hand is possible.

dec_vars_2020 <- load_variables(year = 2020, dataset = "pl")

## Unique 2020 decennial concepts
unique(dec_vars_2020$concept)

## [1] "OCCUPANCY STATUS"                                                                           
## [2] "RACE"                                                                                       
## [3] "HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE"                                     
## [4] "RACE FOR THE POPULATION 18 YEARS AND OVER"                                                  
## [5] "HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE FOR THE POPULATION 18 YEARS AND OVER"
## [6] "GROUP QUARTERS POPULATION BY MAJOR GROUP QUARTERS TYPE"

However, there are 1,200 unique variable concepts in the 2021 ACS data. This cannot be easily searched through by hand.

acs_vars_2021 <- load_variables(year = 2021, dataset = "acs1")

## Number of unique ACS 2021 concepts
length(unique(acs_vars_2021$concept))

## [1] 1200

To find a variable of interest in the ACS, you could begin by using code below to scroll through some of the concepts by hand and get an idea of what types of words are used to describe variable concepts.

acs_vars_2021 %>% 
  select(concept) %>% 
  unique() %>% View()

Once you know some keywords that may be a part of your concept of interest, you can use regular expression and string manipulation functions to filter down the concepts. For example, scrolling through the above output will show you that migration questions almost all contain the phrase “GEOGRAPHICAL MOBILITY”.

acs_vars_2021 %>%
  ## Filter down to concepts containing the string
  ## "GEOGRAPHICAL MOBILITY"
  filter(grepl("GEOGRAPHICAL MOBILITY", concept)) %>% 
  select(concept) %>% 
  ## Remeber concepts cover many variables, 
  ## so we just need to see the unique values to find 
  ## the one we want
  unique()

## # A tibble: 86 × 1
##    concept                                                                      
##    <chr>                                                                        
##  1 GEOGRAPHICAL MOBILITY IN THE PAST YEAR BY AGE FOR CURRENT RESIDENCE IN PUERT…
##  2 GEOGRAPHICAL MOBILITY IN THE PAST YEAR BY AGE FOR CURRENT RESIDENCE IN THE U…
##  3 MEDIAN AGE BY GEOGRAPHICAL MOBILITY IN THE PAST YEAR FOR CURRENT RESIDENCE I…
##  4 MEDIAN AGE BY GEOGRAPHICAL MOBILITY IN THE PAST YEAR FOR CURRENT RESIDENCE I…
##  5 GEOGRAPHICAL MOBILITY IN THE PAST YEAR BY SEX FOR CURRENT RESIDENCE IN PUERT…
##  6 GEOGRAPHICAL MOBILITY IN THE PAST YEAR BY SEX FOR CURRENT RESIDENCE IN THE U…
##  7 GEOGRAPHICAL MOBILITY IN THE PAST YEAR (WHITE ALONE) FOR CURRENT RESIDENCE I…
##  8 GEOGRAPHICAL MOBILITY IN THE PAST YEAR (WHITE ALONE) FOR CURRENT RESIDENCE I…
##  9 GEOGRAPHICAL MOBILITY IN THE PAST YEAR (BLACK OR AFRICAN AMERICAN ALONE) FOR…
## 10 GEOGRAPHICAL MOBILITY IN THE PAST YEAR (BLACK OR AFRICAN AMERICAN ALONE) FOR…
## # ℹ 76 more rows

If we know we are interested in breakdowns by age, we might try the following:

acs_vars_2021 %>%
  ## Filter down to concepts containing the string
  ## "GEOGRAPHICAL MOBILITY" AND "AGE"
  filter(grepl("GEOGRAPHICAL MOBILITY", concept) &
           grepl("AGE", concept)) %>% 
  select(concept) %>% 
  ## Remeber concepts cover many variables, 
  ## so we just need to see the unique values to find 
  ## the one we want
  unique()

## # A tibble: 8 × 1
##   concept                                                                       
##   <chr>                                                                         
## 1 GEOGRAPHICAL MOBILITY IN THE PAST YEAR BY AGE FOR CURRENT RESIDENCE IN PUERTO…
## 2 GEOGRAPHICAL MOBILITY IN THE PAST YEAR BY AGE FOR CURRENT RESIDENCE IN THE UN…
## 3 MEDIAN AGE BY GEOGRAPHICAL MOBILITY IN THE PAST YEAR FOR CURRENT RESIDENCE IN…
## 4 MEDIAN AGE BY GEOGRAPHICAL MOBILITY IN THE PAST YEAR FOR CURRENT RESIDENCE IN…
## 5 GEOGRAPHICAL MOBILITY IN THE PAST YEAR BY AGE FOR RESIDENCE 1 YEAR AGO IN PUE…
## 6 GEOGRAPHICAL MOBILITY IN THE PAST YEAR BY AGE FOR RESIDENCE 1 YEAR AGO IN THE…
## 7 MEDIAN AGE BY GEOGRAPHICAL MOBILITY IN THE PAST YEAR FOR RESIDENCE 1 YEAR AGO…
## 8 MEDIAN AGE BY GEOGRAPHICAL MOBILITY IN THE PAST YEAR FOR RESIDENCE 1 YEAR AGO…

Pulling spatial data

You can pull Census or ACS tables without geometry by specifying geometry = FALSE.
The geography argument has different options for specifications based on what year and data type you are requesting.

kingco_pop <- get_decennial(geography = "tract",
                           # variables = "P1_001N",
                           table = "P1",
                           state = "WA",
                           county = "King",
                           year = 2020,
                           sumfile = "pl",
                           geometry = TRUE)

## Getting data from the 2020 decennial Census

## Using the PL 94-171 Redistricting Data Summary File
## Using the PL 94-171 Redistricting Data Summary File

## Note: 2020 decennial Census data use differential privacy, a technique that
## introduces errors into data to preserve respondent confidentiality.
## ℹ Small counts should be interpreted with caution.
## ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
## This message is displayed once per session.

summary(kingco_pop)

##     GEOID               NAME             variable             value       
##  Length:35145       Length:35145       Length:35145       Min.   :   0.0  
##  Class :character   Class :character   Class :character   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :   0.0  
##                                                           Mean   : 200.4  
##                                                           3rd Qu.:   8.0  
##                                                           Max.   :8837.0  
##           geometry    
##  MULTIPOLYGON :35145  
##  epsg:4269    :    0  
##  +proj=long...:    0  
##                       
##                       
##

str(kingco_pop)

## sf [35,145 × 5] (S3: sf/tbl_df/tbl/data.frame)
##  $ GEOID   : chr [1:35145] "53033024400" "53033024400" "53033024400" "53033024400" ...
##  $ NAME    : chr [1:35145] "Census Tract 244, King County, Washington" "Census Tract 244, King County, Washington" "Census Tract 244, King County, Washington" "Census Tract 244, King County, Washington" ...
##  $ variable: chr [1:35145] "P1_001N" "P1_002N" "P1_003N" "P1_004N" ...
##  $ value   : num [1:35145] 3053 2722 1891 53 9 ...
##  $ geometry:sfc_MULTIPOLYGON of length 35145; first list element: List of 1
##   ..$ :List of 1
##   .. ..$ : num [1:31, 1:2] -122 -122 -122 -122 -122 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
##   ..- attr(*, "names")= chr [1:4] "GEOID" "NAME" "variable" "value"

unique(kingco_pop$variable)

##  [1] "P1_001N" "P1_002N" "P1_003N" "P1_004N" "P1_005N" "P1_006N" "P1_007N"
##  [8] "P1_008N" "P1_009N" "P1_010N" "P1_011N" "P1_012N" "P1_013N" "P1_014N"
## [15] "P1_015N" "P1_016N" "P1_017N" "P1_018N" "P1_019N" "P1_020N" "P1_021N"
## [22] "P1_022N" "P1_023N" "P1_024N" "P1_025N" "P1_026N" "P1_027N" "P1_028N"
## [29] "P1_029N" "P1_030N" "P1_031N" "P1_032N" "P1_033N" "P1_034N" "P1_035N"
## [36] "P1_036N" "P1_037N" "P1_038N" "P1_039N" "P1_040N" "P1_041N" "P1_042N"
## [43] "P1_043N" "P1_044N" "P1_045N" "P1_046N" "P1_047N" "P1_048N" "P1_049N"
## [50] "P1_050N" "P1_051N" "P1_052N" "P1_053N" "P1_054N" "P1_055N" "P1_056N"
## [57] "P1_057N" "P1_058N" "P1_059N" "P1_060N" "P1_061N" "P1_062N" "P1_063N"
## [64] "P1_064N" "P1_065N" "P1_066N" "P1_067N" "P1_068N" "P1_069N" "P1_070N"
## [71] "P1_071N"

A note on `GEOID`s

The GEOID column of output includes the standard Census numeric code for a particular geography. States and counties are determined by their Federal Information Processing Series(FIPS) codes.

The state code is 2 digits long, Washington’s FIPS code is 53, and we see each GEOID above begins with those digits.

The county code within each state is 3 digits long. Adams County’s unique code within Washington is 001. So, the full GEOID for Adams County, Washington is 53001.

The tract code that uniquely identifies a Census tract within a county is 6 digits long. The resulting GEOID for a tract in Adams County, Washington is then 53001TTTTTT, where TTTTTT represents a 6-digit tract code.

The block group code that uniquely identifies each Census block group within a certain Census tract, TTTTTT, within Adams County, Washington is 1 digit long. The resulting GEOID for a block group within tract TTTTTT in Adams County, Washington is 53001TTTTTTG, where G represents a 1 digit tract code.

The block code that uniquely identifies each Census block within a block group, G, within a certain Census tract, TTTTTT, within Adams County, Washington is 3 digits long. The resulting GEOID for a block group within tract TTTTTT and block group G in Adams County, Washington is 53001TTTTTTGBBB, where BBB represents a 3 digit tract code.

Dealing with variable labels

Now that we’ve walked through querying the Census API for the decennial Census and the ACS via tidycensus we need to cover some basics to make this data more usable. First, and foremost, no one will ever be able to remember what every alphanumeric Census variable name is by heart. But, if we use the output of load_variables() coupled with some data-wrangling, we can get neat string labels in no time!

Step 1: Save the output of load_variables()

dec_vars_2020 <- load_variables(year = 2020, dataset = "pl")

Step 2: Join load_variables() output with get_*() output Below, we join dec_vars_2020 onto kingco_pop by the respective variable name column in each dataset. This still leaves the pesky issue of the “!!” and “:” in the label strings.

kingco_pop <- kingco_pop %>% 
  left_join(dec_vars_2020 %>% 
              select(name, label),
            by = c("variable" = "name")) %>% 
  relocate("label", .after = "variable")

head(kingco_pop)

## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2305 ymin: 47.57434 xmax: -122.2014 ymax: 47.58905
## Geodetic CRS:  NAD83
## # A tibble: 6 × 6
##   GEOID       NAME                variable label value                  geometry
##   <chr>       <chr>               <chr>    <chr> <dbl>        <MULTIPOLYGON [°]>
## 1 53033024400 Census Tract 244, … P1_001N  " !!…  3053 (((-122.2299 47.58662, -…
## 2 53033024400 Census Tract 244, … P1_002N  " !!…  2722 (((-122.2299 47.58662, -…
## 3 53033024400 Census Tract 244, … P1_003N  " !!…  1891 (((-122.2299 47.58662, -…
## 4 53033024400 Census Tract 244, … P1_004N  " !!…    53 (((-122.2299 47.58662, -…
## 5 53033024400 Census Tract 244, … P1_005N  " !!…     9 (((-122.2299 47.58662, -…
## 6 53033024400 Census Tract 244, … P1_006N  " !!…   711 (((-122.2299 47.58662, -…

Step 3: Clean up the variable labels How you want to do this will vary for each concept, this specific example is appropriate for the 2020 decennial “RACE” concept. The result of this code is an extra race column added to the data making it easier to plot and analyze.

 kingco_pop <- kingco_pop %>% 
  ## Replace the starting "!!Total" 
  ## with an empty string everywhere
  mutate(label = gsub("!!Total", "", label),
  ## Replace all remaining !!
         label = gsub("!!", "", label)) %>% 
  ## The only special characters left now are the :
  ## between sex and age groups
  ## we will split the string in each row on the :
  ## and retain the first element for sex and the second for label
  group_by(variable) %>% 
  mutate(race_cat = unlist(strsplit(label, ":"))[2],
         race_alone = unlist(strsplit(label, ":"))[3],
         race_two = unlist(strsplit(label, ":"))[4]) %>% 
  ungroup() %>% 
  ## Fill in "Total" strings where sex or age are missing
  mutate(race_cat = ifelse(race_cat == " ", "Total", race_cat),
         race_alone = ifelse(race_alone == " ", "Total", race_alone),
         race_two = ifelse(race_two == " ", "Total", race_two)) %>% 
  relocate(c("race_cat", "race_alone", "race_two"), .after = "label")

## Check our work
kingco_pop %>% 
  select(variable, label, contains("race_")) %>% 
  head()

## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2305 ymin: 47.57434 xmax: -122.2014 ymax: 47.58905
## Geodetic CRS:  NAD83
## # A tibble: 6 × 6
##   variable label          race_cat race_alone race_two                  geometry
##   <chr>    <chr>          <chr>    <chr>      <chr>           <MULTIPOLYGON [°]>
## 1 P1_001N  " :"           Total    Total      Total    (((-122.2299 47.58662, -…
## 2 P1_002N  " :Population… Populat… Total      Populat… (((-122.2299 47.58662, -…
## 3 P1_003N  " :Population… Populat… White alo… Total    (((-122.2299 47.58662, -…
## 4 P1_004N  " :Population… Populat… Black or … Total    (((-122.2299 47.58662, -…
## 5 P1_005N  " :Population… Populat… American … Total    (((-122.2299 47.58662, -…
## 6 P1_006N  " :Population… Populat… Asian alo… Total    (((-122.2299 47.58662, -…

Polygon Plots with `ggplot2`

Any plot of spatial data in ggplot2 will require a combination of ggplot() + geom_sf().

The fill argument inside the aesthetic function (aes()) specifies the variable whose numeric or categorical information will determine each polygon’s color.

kingco_pop %>% 
  ## We have many vars in long form data,
  ## i.e. each polygon has multiple rows in the data,
  ## let's filter down to total population,
  ## i.e. race_cat == "Total"
  filter(race_cat == "Total") %>% 
  ggplot() +
  geom_sf(aes(fill = value))

Additional arguments can improve visualizations. * size adjusts lines used to draw polygon boundaries * scale_fill_viridis_c is a non-default color palette for continuous variables

kingco_pop %>% 
  filter(race_cat == "Total") %>% 
  ggplot() +
  ## size makes polygon lines smaller
  geom_sf(aes(fill = value), size = .25) +
  ## Nice palette, and reverse direction
  ## so light is small, dark is large
  scale_fill_viridis_c(direction = -1,
                       name = "Population") + 
  ggtitle("King County Census Tracts (2020)") +
  ## git rid of long lat axes
  theme_void()

Multiple variables with the same continuous legend

The facet_wrap function tells ggplot to make a multi-panel plot according to the unique values of the variables in the facet_wrap formula.

By default, facet_wrap will put all plots on the same legend.

kingco_pop %>% 
  ## Filter down to Population of Single race + those with two or more
  filter((race_cat == "Population of one race" & race_alone != "Total") |
        (race_cat == "Population of two or more races" &
           race_alone == "Total")) %>% 
  ## Make plot-ready labels
  mutate(Race = case_when(grepl("two or", race_cat) ~ "2+",
                          grepl("Black", race_alone) ~ "Black",
                          grepl("Alaska", race_alone) ~ "AI/AN",
                          grepl("Hawaiian", race_alone) ~ "NHOPI",
                          grepl("Some", race_alone) ~ "Other",
                          TRUE ~ gsub(" alone", "", race_alone)),
         ## Create a factor variable to control order of panels
         ## Otherwise, will be alphabetical
         Race = factor(Race, levels = c("AI/AN", "Asian", "Black",
                                        "NHOPI", "Other", "2+",  "White"))) %>% 
  ggplot() +
  geom_sf(aes(fill = value), size = .25) +
  scale_fill_viridis_c(direction = -1,
                       name = "Population") + 
  facet_wrap(~Race) +
  ggtitle("King County Census Tracts (2020)") +
  theme_void()

By hand

kingco_pop_wide <- kingco_pop %>% 
  ## Filter down to Population of Single race + those with two or more
  filter((race_cat == "Population of one race" & race_alone != "Total") |
           (race_cat == "Population of two or more races" &
              race_alone == "Total")) %>% 
  ## Make plot-ready labels
  mutate(Race = case_when(grepl("two or", race_cat) ~ "2+",
                          grepl("Black", race_alone) ~ "Black",
                          grepl("Alaska", race_alone) ~ "AI/AN",
                          grepl("Hawaiian", race_alone) ~ "NHOPI",
                          grepl("Some", race_alone) ~ "Other",
                          TRUE ~ gsub(" alone", "", race_alone)),
         ## Create a factor variable to control order of panels
         ## Otherwise, will be alphabetical
         Race = factor(Race, levels = c("AI/AN", "Asian", "Black",
                                        "NHOPI", "Other", "2+",  "White"))) %>% 
  ## Create one column per Race variable with pop size inside
  pivot_wider(id_cols = c("GEOID", "NAME", "geometry"),
               names_from = "Race",
               values_from = "value")

## What is the range of values for all tracts?
kingco_pop_wide %>% 
  st_drop_geometry() %>% 
  select(-GEOID, -NAME) %>% 
  range()

## [1]    0 6272

## Check quintiles
kingco_pop_wide %>% 
  st_drop_geometry() %>% 
  select(-GEOID, -NAME) %>%
  as.matrix() %>% 
  quantile(probs = seq(0, 1, .1))

##     0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
##    0.0   13.0   32.0   68.0  129.0  272.0  414.0  592.8  949.2 2136.4 6272.0

## White alone
pop_white <- kingco_pop_wide %>% 
  ## Drop water tracts (tigris fn)
  erase_water() %>% 
  ggplot() +
  geom_sf(aes(fill = White), size = 0.25) +
  scale_fill_viridis_c(direction = -1,
                       name = "Population",
                       breaks = c(0, 50,
                                  100, 250,
                                  500, 1000,
                                  2000, 4000,
                                  6000, 6300),
                       limits = c(0, 6300)) + 
  ggtitle("White alone") +
  theme_void()

## Fetching area water data for your dataset's location...

## Erasing water area...
## If this is slow, try a larger area threshold value.

## White alone
pop_asian <- kingco_pop_wide %>% 
  ## Drop water tracts (tigris fn)
  ## use larger area_threshold arg to ignore smaller bodies of water
  erase_water(area_threshold = 0.99) %>% 
  ggplot() +
  geom_sf(aes(fill = Asian), size = 0.25) +
  scale_fill_viridis_c(direction = -1,
                       name = "Population",
                       breaks = c(0, 50,
                                  100, 250,
                                  500, 1000,
                                  2000, 4000,
                                  6000, 6300),
                       limits = c(0, 6300)) + 
  ggtitle("Asian alone") +
  theme_void()

## Fetching area water data for your dataset's location...
## Erasing water area...
## If this is slow, try a larger area threshold value.

## 2 or more
pop_twoplus <- kingco_pop_wide %>% 
  erase_water(area_threshold = 0.99) %>% 
  ggplot() +
  geom_sf(aes(fill = `2+`), size = 0.25) +
  scale_fill_viridis_c(direction = -1,
                       name = "Population",
                       breaks = c(0, 50,
                                  100, 250,
                                  500, 1000,
                                  2000, 4000,
                                  6000, 6300),
                       limits = c(0, 6300)) + 
  ggtitle("Two or more races") +
  theme_void()

## Fetching area water data for your dataset's location...
## Erasing water area...
## If this is slow, try a larger area threshold value.

grid.arrange(pop_white, pop_asian, pop_twoplus, nrow = 2)

Multiple variables with the same discrete legend

This may be a personal preference, but unless you are treating space continuously, discrete legends are easier for a reader or audience member to digest.

If you are familiar with RColorBrewer, there are a suite of functions designed to work with ggplot that allow you to call upon their palettes.

## To see RColorBrewer palettes
# library(RColorBrewer)
# display.brewer.all()

## Using a 3 color palette, similar to viridis
kingco_pop_racealone <-  kingco_pop %>% 
  ## Filter down to Population of Single race + those with two or more
  filter((race_cat == "Population of one race" & race_alone != "Total") |
           (race_cat == "Population of two or more races" &
              race_alone == "Total")) %>% 
  ## Make plot-ready labels
  mutate(Race = case_when(grepl("two or", race_cat) ~ "2+",
                          grepl("Black", race_alone) ~ "Black",
                          grepl("Alaska", race_alone) ~ "AI/AN",
                          grepl("Hawaiian", race_alone) ~ "NHOPI",
                          grepl("Some", race_alone) ~ "Other",
                          TRUE ~ gsub(" alone", "", race_alone)),
         ## Create a factor variable to control order of panels
         ## Otherwise, will be alphabetical
         Race = factor(Race, levels = c("AI/AN", "Asian", "Black",
                                        "NHOPI", "Other", "2+",  "White"))) %>% 
  ## Remove water
    erase_water(area_threshold = 0.99)

## Fetching area water data for your dataset's location...

## Erasing water area...
## If this is slow, try a larger area threshold value.

kingco_pop_racealone %>% 
  ggplot() +
  geom_sf(aes(fill = value), color = "grey80", size = .05) +
  scale_fill_fermenter(direction = 1, palette = "YlGnBu",
                       name = "Population",
                       breaks = c(0, 50,
                                  100, 250,
                                  500, 1000,
                                  2000, 4000,
                                  6000, 6300),
                       limits = c(0, 6300)) + 
  facet_wrap(~Race) +
  ggtitle("King County Census Tracts (2020)") +
  theme_void()

By hand

pop_white <- kingco_pop_wide %>% 
  erase_water(area_threshold = 0.99) %>% 
  ggplot() +
  geom_sf(aes(fill = White), size = 0.25, show.legend = FALSE) +
  scale_fill_fermenter(direction = 1, palette = "YlGnBu",
                       name = "Population",
                       breaks = c(0, 50,
                                  100, 250,
                                  500, 1000,
                                  2000, 4000,
                                  6000, 6300),
                       limits = c(0, 6300)) + 
  ggtitle("White alone") +
  theme_void()

## Fetching area water data for your dataset's location...

## Erasing water area...
## If this is slow, try a larger area threshold value.

pop_asian <- kingco_pop_wide %>% 
  erase_water(area_threshold = 0.99) %>% 
  ggplot() +
  geom_sf(aes(fill = Asian), size = 0.25) +
  scale_fill_fermenter(direction = 1, palette = "YlGnBu",
                       name = "Population",
                       breaks = c(0, 50,
                                  100, 250,
                                  500, 1000,
                                  2000, 4000,
                                  6000, 6300),
                       limits = c(0, 6300)) + 
  ggtitle("Asian alone") +
  theme_void()

## Fetching area water data for your dataset's location...
## Erasing water area...
## If this is slow, try a larger area threshold value.

pop_twoplus <- kingco_pop_wide %>% 
  erase_water(area_threshold = 0.99) %>% 
  ggplot() +
  geom_sf(aes(fill = `2+`), size = 0.25, show.legend = FALSE) +
  scale_fill_fermenter(direction = 1, palette = "YlGnBu",
                       name = "Population",
                       breaks = c(0, 50,
                                  100, 250,
                                  500, 1000,
                                  2000, 4000,
                                  6000, 6300),
                       limits = c(0, 6300)) + 
  ggtitle("Two or more") +
  theme_void()

## Fetching area water data for your dataset's location...
## Erasing water area...
## If this is slow, try a larger area threshold value.

grid.arrange(pop_white, pop_asian, pop_twoplus, nrow = 2)

Combining spatial data

Now, for the first time, we will load a shapefile using the read_sf function in R.

ESRI (a company one of whose products is ArcGIS) shapefiles consist of at least three files, and this is a common form. The first file .shp contains the geography of each shape. The second file shx is an index file which contains record offsets. The third file .dbf contains feature attributes with one record per feature.
To combine multiple spatial objects, they must be on the same coordinate reference system.

Download Sound Transit Link link rail station point location data: https://www.soundtransit.org/help-contacts/business-information/open-transit-data-otd/otd-downloads.

You’ll notice that the folder, once unzipped, contains many spatial layers. We’re going to load the

link_stns <- read_sf(dsn = "../Data/STPublicData",
                     layer = "LINKStations")

head(link_stns)

## Simple feature collection with 6 features and 6 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1269619 ymin: 215537.9 xmax: 1271843 ymax: 226693.4
## Projected CRS: NAD83 / Washington North (ftUS)
## # A tibble: 6 × 7
##   LINK_TYPE NAME     STATION SEGMENT STATUS STATUSNOTE                 geometry
##       <int> <chr>    <chr>   <chr>   <chr>  <chr>      <POINT [US_survey_foot]>
## 1         1 Westlak… Westla… C       COMPL… <NA>             (1269619 226693.4)
## 2         1 Univers… Univer… C       COMPL… <NA>             (1269773 225331.4)
## 3         1 Pioneer… Pionee… C       COMPL… <NA>             (1270921 223393.7)
## 4         1 Interna… Intern… C       COMPL… <NA>             (1271685 221842.8)
## 5         1 Stadium… Stadium C       COMPL… <NA>             (1271843 219195.9)
## 6         1 SODO St… SODO    C       COMPL… <NA>             (1271706 215537.9)

link_lines <- read_sf(dsn = "../Data/STPublicData",
                      layer = "LINKLine")
head(link_lines)

## Simple feature collection with 6 features and 6 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: 1241431 ymin: 91328.47 xmax: 1285395 ymax: 258631.2
## Projected CRS: NAD83 / Washington North (ftUS)
## # A tibble: 6 × 7
##   OBJECTID LINK_TYPE DESCRIPTIO        STATUS   Shape_Leng LINE 
##      <int>     <int> <chr>             <chr>         <dbl> <chr>
## 1       68         1 Central Link      COMPLETE     73476. 1    
## 2      101         4 Airport Link      COMPLETE      9029. 1    
## 3      109         2 Tacoma Link       COMPLETE      9382. T    
## 4      113         6 University Link   COMPLETE     16875. 1    
## 5      130         3 Northgate Link    COMPLETE     19939. 1    
## 6      167         5 S 200th Extension COMPLETE      8998. 1    
## # ℹ 1 more variable: geometry <MULTILINESTRING [US_survey_foot]>

Coordinate Reference Systems

The st_crs function will print out the coordinate reference system for the object passed to the function
The st_transform function allows you to transform the CRS of an sf object. You can either pass the CRS numeric code or the output of an st_crs function to the crs argument.

st_crs(link_stns)

## Coordinate Reference System:
##   User input: NAD83 / Washington North (ftUS) 
##   wkt:
## PROJCRS["NAD83 / Washington North (ftUS)",
##     BASEGEOGCRS["NAD83",
##         DATUM["North American Datum 1983",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]],
##             ID["EPSG",6269]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["Degree",0.0174532925199433]]],
##     CONVERSION["unnamed",
##         METHOD["Lambert Conic Conformal (2SP)",
##             ID["EPSG",9802]],
##         PARAMETER["Latitude of false origin",47,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8821]],
##         PARAMETER["Longitude of false origin",-120.833333333333,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8822]],
##         PARAMETER["Latitude of 1st standard parallel",47.5,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8823]],
##         PARAMETER["Latitude of 2nd standard parallel",48.7333333333333,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8824]],
##         PARAMETER["Easting at false origin",1640416.66666667,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8826]],
##         PARAMETER["Northing at false origin",0,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8827]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["US survey foot",0.304800609601219,
##                 ID["EPSG",9003]]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["US survey foot",0.304800609601219,
##                 ID["EPSG",9003]]]]

st_crs(kingco_pop)

## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

## Transform the link stations to the Census CRS
link_stns <- link_stns %>% 
  st_transform(crs = st_crs(kingco_pop))

st_crs(link_stns)

## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

## Transform the link lines to the Census CRS
link_lines <- link_lines %>% 
  st_transform(crs = st_crs(kingco_pop))

st_crs(link_lines)

## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

Combining layers with `ggplot2`

Let’s add the location of LINK stations to our existing plot of King County’s 2020 decennial tract-level Asian alone population

kingco_pop_wide %>% 
  erase_water(area_threshold = 0.99) %>% 
  ggplot() +
  geom_sf(aes(fill = Asian), size = 0.25) +
  scale_fill_fermenter(direction = 1,
                       palette = "Blues",
                       name = "Population",
                       breaks = c(0, 50,
                                  100, 250,
                                  500, 1000,
                                  2000, 4000,
                                  6000, 6300),
                       limits = c(0, 6300)) + 
  ## Need to specify the data argument when bringing in a second dataset
  geom_sf(data = link_stns, color = "white") +
  ggtitle("Asian alone Population (2020 Census)", 
          subtitle = "Overlayed w/ LINK Stations") +
  theme_void()

## Fetching area water data for your dataset's location...

## Erasing water area...
## If this is slow, try a larger area threshold value.

Now, Link lines.

names(link_lines)

## [1] "OBJECTID"   "LINK_TYPE"  "DESCRIPTIO" "STATUS"     "Shape_Leng"
## [6] "LINE"       "geometry"

link_lines %>%
  st_drop_geometry() %>% 
  group_by(LINE) %>% 
  tally()

## # A tibble: 4 × 2
##   LINE      n
##   <chr> <int>
## 1 1         6
## 2 2         1
## 3 T         2
## 4 <NA>      4

kingco_pop_wide %>% 
  erase_water(area_threshold = 0.99) %>% 
  ggplot() +
  geom_sf(aes(fill = Asian), size = 0.05, color = "white") +
  scale_fill_fermenter(direction = 1,
                       palette = "Blues",
                       name = "Population",
                       breaks = c(0, 50,
                                  100, 250,
                                  500, 1000,
                                  2000, 4000,
                                  6000, 6300),
                       limits = c(0, 6300)) + 
  ## Need to specify the data argument when bringing in a second dataset
  geom_sf(data = link_lines %>% 
            filter(LINE %in% 1:2) %>% 
            mutate(LINE = as.factor(LINE)), aes(color = LINE)) +
  scale_color_manual(name = "LINK Line", 
                     values = c("1" = "deeppink", "2" = "orange")) +
  ggtitle("King County: Asian alone Population (2020)", 
          subtitle = "Overlayed w/ LINK Lines") +
  theme_void()

## Fetching area water data for your dataset's location...

## Erasing water area...
## If this is slow, try a larger area threshold value.

## Combining & comparing spatial objects in sf

sf functions st_intersects and st_join can be used to understand what parts of one spatial object overlap with another
There are many st_*() functions that make other geographic comparisons between two objects: st_contains, st_contains_properly, st_covered_by, st_covers st_crosses, st_disjoint, st_equals_exact, st_equals, st_is_within_distance, st_nearest_feature, st_overlaps, st_touches, st_within. See the Figure 5.8 from the Lovelace textbook for a visual representation of some function’s purpose.
All of these functions take two primary arguments x (the first object in the pipe or first argument) and y (object within the st_* function in the pipe or the second argument). The two objects x and y must have identical coordinate reference systems (CRS)

Assigning Points to Polygons

This example concerns a points (LINK Light rail stations) to polygon (Census tract) operation, but other assignment types like polygon-to-polygon or line-to-polygon are possible. The latter types are slightly more complicated as one must make decisions about what to do with corresponding area-level data when, for example, the x polygon overlaps with two y polygons.

The easiest way to two assign points-to-polygons is to use the st_join function, as it will automatically merge any polygon (y) affiliated data onto the point object (x).

Note: In this specific case, kingco_pop has 71 rows per Census tract GEOID. It is my personal preference to think about which variables from y I want appended to x, and filter y before or during the call to st_join. However, as we see below the largest = TRUE argument is specified, which tells st_join to take the data for the feature in y with the largest overlap with x. When the size of overlap is equal (as for many rows attached to the same polygon), it will return the first row. If largest = FALSE, the returned object will have a row for every feature in y that each feature in x lies within or intersects.

## Assign link station to 2020 Census tracts
## Pull ALL columns from *first* appropriate row of kingco_pop onto link_stns 
link_join <- st_join(x = link_stns, y = kingco_pop,
                     largest = TRUE,
                     suffix = c("_link", "_pop"))

## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

head(link_join)

## Simple feature collection with 6 features and 14 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.3367 ymin: 47.58108 xmax: -122.3271 ymax: 47.61154
## Geodetic CRS:  NAD83
## # A tibble: 6 × 15
##   LINK_TYPE NAME_link                      STATION     SEGMENT STATUS STATUSNOTE
##       <int> <chr>                          <chr>       <chr>   <chr>  <chr>     
## 1         1 Westlake Station               Westlake    C       COMPL… <NA>      
## 2         1 University Street Station      University… C       COMPL… <NA>      
## 3         1 Pioneer Square Station         Pioneer Sq… C       COMPL… <NA>      
## 4         1 International District Station Internatio… C       COMPL… <NA>      
## 5         1 Stadium Station                Stadium     C       COMPL… <NA>      
## 6         1 SODO Station                   SODO        C       COMPL… <NA>      
## # ℹ 9 more variables: geometry <POINT [°]>, GEOID <chr>, NAME_pop <chr>,
## #   variable <chr>, label <chr>, race_cat <chr>, race_alone <chr>,
## #   race_two <chr>, value <dbl>

str(link_join)

## sf [51 × 15] (S3: sf/tbl_df/tbl/data.frame)
##  $ LINK_TYPE : int [1:51] 1 1 1 1 1 1 1 1 1 1 ...
##  $ NAME_link : chr [1:51] "Westlake Station" "University Street Station" "Pioneer Square Station" "International District Station" ...
##  $ STATION   : chr [1:51] "Westlake" "University Street" "Pioneer Square" "International District / Chinatown" ...
##  $ SEGMENT   : chr [1:51] "C" "C" "C" "C" ...
##  $ STATUS    : chr [1:51] "COMPLETE" "COMPLETE" "COMPLETE" "COMPLETE" ...
##  $ STATUSNOTE: chr [1:51] NA NA NA NA ...
##  $ geometry  :sfc_POINT of length 51; first list element:  'XY' num [1:2] -122.3 47.6
##  $ GEOID     : chr [1:51] "53033008102" "53033008102" "53033008102" "53033009300" ...
##  $ NAME_pop  : chr [1:51] "Census Tract 81.02, King County, Washington" "Census Tract 81.02, King County, Washington" "Census Tract 81.02, King County, Washington" "Census Tract 93, King County, Washington" ...
##  $ variable  : chr [1:51] "P1_001N" "P1_001N" "P1_001N" "P1_001N" ...
##  $ label     : chr [1:51] " :" " :" " :" " :" ...
##  $ race_cat  : chr [1:51] "Total" "Total" "Total" "Total" ...
##  $ race_alone: chr [1:51] "Total" "Total" "Total" "Total" ...
##  $ race_two  : chr [1:51] "Total" "Total" "Total" "Total" ...
##  $ value     : num [1:51] 3005 3005 3005 3611 3611 ...
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:14] "LINK_TYPE" "NAME_link" "STATION" "SEGMENT" ...

st_intersects requires more steps and is a more “under-the-hood” work horse of spatial joins in R. Like st_join the function takes an x and y argument for x-to-y assignment, but returns a list of length nrow(x) whose elements are ALL the rows of y each row of x lies within or intersects.

In our example of a points-to-polygons assignment, x is link_stns and y is kingco_pop. Since we have all the 2020 Census population counts for the P1 Race table for each tract in kingco_pop, this vector elements of the list output by st_intersects is either of length 71 (all variables in the P1 table) or length 0 (the LINK station doesn’t fall in a King County Census tract boundary.)

## return list of length(link_stns)
## whose elements are vectors with the row numbers
## of kingco_pop each link stations latlong lies within
link_int <- st_intersects(link_stns, kingco_pop)

## To access list elements use double brackets [[]]
head(link_int[[1]])

## [1] 8876 8877 8878 8879 8880 8881

## Compare to join output
link_join %>% slice(1)

## Simple feature collection with 1 feature and 14 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.3367 ymin: 47.61154 xmax: -122.3367 ymax: 47.61154
## Geodetic CRS:  NAD83
## # A tibble: 1 × 15
##   LINK_TYPE NAME_link        STATION  SEGMENT STATUS   STATUSNOTE
##       <int> <chr>            <chr>    <chr>   <chr>    <chr>     
## 1         1 Westlake Station Westlake C       COMPLETE <NA>      
## # ℹ 9 more variables: geometry <POINT [°]>, GEOID <chr>, NAME_pop <chr>,
## #   variable <chr>, label <chr>, race_cat <chr>, race_alone <chr>,
## #   race_two <chr>, value <dbl>

link_int[[1]]

##  [1] 8876 8877 8878 8879 8880 8881 8882 8883 8884 8885 8886 8887 8888 8889 8890
## [16] 8891 8892 8893 8894 8895 8896 8897 8898 8899 8900 8901 8902 8903 8904 8905
## [31] 8906 8907 8908 8909 8910 8911 8912 8913 8914 8915 8916 8917 8918 8919 8920
## [46] 8921 8922 8923 8924 8925 8926 8927 8928 8929 8930 8931 8932 8933 8934 8935
## [61] 8936 8937 8938 8939 8940 8941 8942 8943 8944 8945 8946

## To access elements of a data.frame use single brackets [,]
## with row index before the column and column index after
## e.g. below will print the row of kingco_pop[row-number, ]
## identified in the first vector element [1] 
## of the first list element [[1]]
## of the link_int list object
kingco_pop[link_int[[1]][1], ]

## Simple feature collection with 1 feature and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.3423 ymin: 47.60171 xmax: -122.3277 ymax: 47.61268
## Geodetic CRS:  NAD83
## # A tibble: 1 × 9
##   GEOID       NAME             variable label race_cat race_alone race_two value
##   <chr>       <chr>            <chr>    <chr> <chr>    <chr>      <chr>    <dbl>
## 1 53033008102 Census Tract 81… P1_001N  " :"  Total    Total      Total     3005
## # ℹ 1 more variable: geometry <MULTIPOLYGON [°]>

## Now we need to create a vector of length nrow(link_stns)
## whose elements are **single** row-indices of the intersecting polygon
## instead of repeated for each of the 71 variables in the P1 table
link_tract_rows <- rep(NA, nrow(link_stns))
for(stn in 1:nrow(link_stns)){
  ## Take first match (Total Pop) for stations that fall within a tract
  if(length(link_int[[stn]]) > 0){
    link_tract_rows[stn] <- link_int[[stn]][1]
  }else{
    link_tract_rows[stn] <- NA
  }
}
link_tract_rows

##  [1]  8876  8876  8876 31241 31241 31241 25490 26839  4190 33655 12852 11929
## [13] 19313  6249 15905  9586    NA    NA    NA    NA    NA    NA 14911 31028
## [25] 20449 21088 33158    NA    NA 11645  3622 30957  4829 33229 12568 12568
## [37] 28756  8521  8521  8521 11432 11432 31312 10154    NA    NA    NA    NA
## [49]    NA    NA    NA

## Create new dataset combining link stations to total population of tract

link_pop_df <- link_stns

## Create tract and pop columns
link_pop_df$GEOID <- kingco_pop$GEOID[link_tract_rows]
link_pop_df$value <- kingco_pop$value[link_tract_rows]

head(link_pop_df)

## Simple feature collection with 6 features and 8 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.3367 ymin: 47.58108 xmax: -122.3271 ymax: 47.61154
## Geodetic CRS:  NAD83
## # A tibble: 6 × 9
##   LINK_TYPE NAME     STATION SEGMENT STATUS STATUSNOTE             geometry
##       <int> <chr>    <chr>   <chr>   <chr>  <chr>               <POINT [°]>
## 1         1 Westlak… Westla… C       COMPL… <NA>       (-122.3367 47.61154)
## 2         1 Univers… Univer… C       COMPL… <NA>        (-122.336 47.60782)
## 3         1 Pioneer… Pionee… C       COMPL… <NA>       (-122.3312 47.60257)
## 4         1 Interna… Intern… C       COMPL… <NA>       (-122.3279 47.59836)
## 5         1 Stadium… Stadium C       COMPL… <NA>       (-122.3271 47.59111)
## 6         1 SODO St… SODO    C       COMPL… <NA>       (-122.3274 47.58108)
## # ℹ 2 more variables: GEOID <chr>, value <dbl>

names(link_pop_df)

## [1] "LINK_TYPE"  "NAME"       "STATION"    "SEGMENT"    "STATUS"    
## [6] "STATUSNOTE" "geometry"   "GEOID"      "value"

## Create flag in kingco_pop that shows which tracts have stations in them
## unlist() turns link_int into one long vector of the unique
## rows in kingco_pop corresponding to tracts that contain a light rail station
unique_stn_tract_rows <- link_int %>% unlist() %>% unique()
kingco_pop$has_lightrail <- 0
kingco_pop$has_lightrail[unique_stn_tract_rows] <- 1
table(kingco_pop$has_lightrail, useNA = "ifany")

## 
##     0     1 
## 33157  1988

## Plot the tracts with a light rail
kingco_pop %>% 
  ## Filter down to just one row per tract
  filter(race_cat == "Total") %>% 
  erase_water(area_threshold = 0.99) %>% 
  mutate(has_lightrail = factor(has_lightrail, levels = 0:1,
                                labels = c("No", "Yes"))) %>% 
  ggplot() +
  geom_sf(aes(fill = has_lightrail), color = "grey80", size = 0.2) +
  scale_fill_manual(name = "Contains Lightrail",
                    values = c("No" = "white", "Yes" = "firebrick")) +
  ggtitle("2020 King County Census Tracts",
          subtitle = "LINK Light rail Stations") +
  theme_void()

## Fetching area water data for your dataset's location...

## Erasing water area...
## If this is slow, try a larger area threshold value.

Calculating distances

Suppose we want to identify all Census tracts within a certain distance of a LINK Light rail station. While the question seems straight forward, in GIS analysis, it requires clear definitions of what we mean by distance.

Distance between spatial features: * points-to-polygons: distance to the polygon centroid? minimum distance to the boundary? * polygon-to-polygon: centroid to centroid? are they neighbors? minimum number of polygons traversed to get from polygon x to polygon y? * point-to-line: minimum distance? maximum distance? road distance?

Type of distance: * Are we calculating distance on a globe? On a 2-dimensional Mercator (UTM) projection? * Driving distance? Shortest distance without crossing over your route?

Our example is relatively simple, and we will take a simple definition of tracts within a certain distance of a light rail station by calculating distances between station GPS locations and tract polygon centroids.

First, we need to get a spatial object with polygon centroids. st_centroid is one of a suite of sf functions the return manipulations of polygon objects. See ?st_centroid for more information.

## st_centroid retains variable info about the tracts
kingco_pts <- st_centroid(kingco_pop)

## Warning: st_centroid assumes attributes are constant over geometries

str(kingco_pts)

## sf [35,145 × 10] (S3: sf/tbl_df/tbl/data.frame)
##  $ GEOID        : chr [1:35145] "53033024400" "53033024400" "53033024400" "53033024400" ...
##  $ NAME         : chr [1:35145] "Census Tract 244, King County, Washington" "Census Tract 244, King County, Washington" "Census Tract 244, King County, Washington" "Census Tract 244, King County, Washington" ...
##  $ variable     : chr [1:35145] "P1_001N" "P1_002N" "P1_003N" "P1_004N" ...
##  $ label        : chr [1:35145] " :" " :Population of one race:" " :Population of one race:White alone" " :Population of one race:Black or African American alone" ...
##  $ race_cat     : chr [1:35145] "Total" "Population of one race" "Population of one race" "Population of one race" ...
##  $ race_alone   : chr [1:35145] "Total" "Total" "White alone" "Black or African American alone" ...
##  $ race_two     : chr [1:35145] "Total" "Population of one race" "Total" "Total" ...
##  $ value        : num [1:35145] 3053 2722 1891 53 9 ...
##  $ geometry     :sfc_POINT of length 35145; first list element:  'XY' num [1:2] -122.2 47.6
##  $ has_lightrail: num [1:35145] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:9] "GEOID" "NAME" "variable" "label" ...

Now, we can calculate distances between kingco_pts and link_stns using the st_distance function.

tract_to_stn <- st_distance(x = kingco_pts, y = link_stns)

## Output is nrow(kingco_pts) x nrow(link_stns) matrix
## with elements that are distances in meters
head(tract_to_stn)

## Units: [m]
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
## [1,] 9568.917 9377.761 8861.334 8510.782 8301.559 8244.696 7073.274 6035.596
## [2,] 9568.917 9377.761 8861.334 8510.782 8301.559 8244.696 7073.274 6035.596
## [3,] 9568.917 9377.761 8861.334 8510.782 8301.559 8244.696 7073.274 6035.596
## [4,] 9568.917 9377.761 8861.334 8510.782 8301.559 8244.696 7073.274 6035.596
## [5,] 9568.917 9377.761 8861.334 8510.782 8301.559 8244.696 7073.274 6035.596
## [6,] 9568.917 9377.761 8861.334 8510.782 8301.559 8244.696 7073.274 6035.596
##          [,9]    [,10]    [,11]    [,12]   [,13]    [,14]    [,15]    [,16]
## [1,] 6116.759 6775.303 7994.486 14037.46 15914.9 12949.97 11437.25 8790.705
## [2,] 6116.759 6775.303 7994.486 14037.46 15914.9 12949.97 11437.25 8790.705
## [3,] 6116.759 6775.303 7994.486 14037.46 15914.9 12949.97 11437.25 8790.705
## [4,] 6116.759 6775.303 7994.486 14037.46 15914.9 12949.97 11437.25 8790.705
## [5,] 6116.759 6775.303 7994.486 14037.46 15914.9 12949.97 11437.25 8790.705
## [6,] 6116.759 6775.303 7994.486 14037.46 15914.9 12949.97 11437.25 8790.705
##         [,17]    [,18]    [,19]    [,20]    [,21]    [,22]   [,23]    [,24]
## [1,] 40090.46 39800.18 40430.95 40842.37 41105.92 41359.35 10043.1 16206.18
## [2,] 40090.46 39800.18 40430.95 40842.37 41105.92 41359.35 10043.1 16206.18
## [3,] 40090.46 39800.18 40430.95 40842.37 41105.92 41359.35 10043.1 16206.18
## [4,] 40090.46 39800.18 40430.95 40842.37 41105.92 41359.35 10043.1 16206.18
## [5,] 40090.46 39800.18 40430.95 40842.37 41105.92 41359.35 10043.1 16206.18
## [6,] 40090.46 39800.18 40430.95 40842.37 41105.92 41359.35 10043.1 16206.18
##         [,25]   [,26]    [,27]    [,28]    [,29]    [,30]    [,31]    [,32]
## [1,] 18596.57 19050.6 21842.55 23834.61 26733.54 25356.13 30180.05 22056.49
## [2,] 18596.57 19050.6 21842.55 23834.61 26733.54 25356.13 30180.05 22056.49
## [3,] 18596.57 19050.6 21842.55 23834.61 26733.54 25356.13 30180.05 22056.49
## [4,] 18596.57 19050.6 21842.55 23834.61 26733.54 25356.13 30180.05 22056.49
## [5,] 18596.57 19050.6 21842.55 23834.61 26733.54 25356.13 30180.05 22056.49
## [6,] 18596.57 19050.6 21842.55 23834.61 26733.54 25356.13 30180.05 22056.49
##         [,33]    [,34]    [,35]    [,36]    [,37]    [,38]    [,39]    [,40]
## [1,] 1435.076 6612.954 2123.449 3617.767 4263.031 4832.928 5586.573 6206.654
## [2,] 1435.076 6612.954 2123.449 3617.767 4263.031 4832.928 5586.573 6206.654
## [3,] 1435.076 6612.954 2123.449 3617.767 4263.031 4832.928 5586.573 6206.654
## [4,] 1435.076 6612.954 2123.449 3617.767 4263.031 4832.928 5586.573 6206.654
## [5,] 1435.076 6612.954 2123.449 3617.767 4263.031 4832.928 5586.573 6206.654
## [6,] 1435.076 6612.954 2123.449 3617.767 4263.031 4832.928 5586.573 6206.654
##         [,41]    [,42]    [,43]    [,44]    [,45]    [,46]    [,47]    [,48]
## [1,] 8520.565 9487.346 12521.86 12541.15 39678.95 39387.11 39229.85 41276.94
## [2,] 8520.565 9487.346 12521.86 12541.15 39678.95 39387.11 39229.85 41276.94
## [3,] 8520.565 9487.346 12521.86 12541.15 39678.95 39387.11 39229.85 41276.94
## [4,] 8520.565 9487.346 12521.86 12541.15 39678.95 39387.11 39229.85 41276.94
## [5,] 8520.565 9487.346 12521.86 12541.15 39678.95 39387.11 39229.85 41276.94
## [6,] 8520.565 9487.346 12521.86 12541.15 39678.95 39387.11 39229.85 41276.94
##         [,49]    [,50]    [,51]
## [1,] 40599.56 40123.87 39838.37
## [2,] 40599.56 40123.87 39838.37
## [3,] 40599.56 40123.87 39838.37
## [4,] 40599.56 40123.87 39838.37
## [5,] 40599.56 40123.87 39838.37
## [6,] 40599.56 40123.87 39838.37

dim(tract_to_stn)

## [1] 35145    51

If our threshhold for distance is 5 miles, we can create a vector of length nrow(kingco_pts) that is a binary indicator of whether that tract is within 5 miles of any of the nrow(link_stns) light rail stations.

## 1 mile = 1.609344 km
## 1 km = 1000 m

dist_thresh <- 5*1.609344*1000
## apply will apply a function row-wise (MARGIN = 1)
## or column-wise (MARGIN = 2) to a matrix
tract_within_stn <- apply(tract_to_stn, MARGIN = 1, function(row){
  ifelse(sum(row < dist_thresh) >= 1, 1, 0)
})

table(tract_within_stn, useNA = "ifany")

## tract_within_stn
##     0     1  <NA> 
##  7313 27761    71

## Add indicator of within 5 miles to 
kingco_pop$stn_within5 <- tract_within_stn

kingco_pop %>% 
  ## Filter down to just one row per tract
  filter(race_cat == "Total") %>% 
  erase_water(area_threshold = 0.99) %>% 
  mutate(lightrail = case_when(has_lightrail == 1 ~ "Yes",
                               stn_within5 == 1 ~ "Within 5mi",
                               TRUE ~ "Neither"),
         lightrail = factor(lightrail, levels = c("Yes", "Within 5mi",
                                                  "Neither"))) %>% 
  ggplot() +
  geom_sf(aes(fill = lightrail), color = "grey80", size = 0.2) +
  scale_fill_manual(name = "Contains Lightrail",
                    values = c("Yes" = "dodgerblue",
                               "Within 5mi" = alpha("deeppink", 0.65),
                               "Neither" = "white")) +
  ggtitle("2020 King County Census Tracts",
          subtitle = "LINK Light rail Stations") +
  theme_void()

## Fetching area water data for your dataset's location...

## Erasing water area...
## If this is slow, try a larger area threshold value.

Intro to GIS in R

Jessica Godwin

May 20, 2025

Setup

Load Libraries

GIS in R

`sf` and `sp` packages

Other packages & resources

Spatial Data & `R` Objects

`tidycensus` and Polygons

Variable & table names

Tips for searching

Pulling spatial data

A note on `GEOID`s

Dealing with variable labels

Polygon Plots with `ggplot2`

Multiple variables with the same continuous legend

By hand

Multiple variables with the same discrete legend

By hand

Combining spatial data

Coordinate Reference Systems

Combining layers with `ggplot2`

Assigning Points to Polygons

Calculating distances

Intro to GIS in R

Jessica Godwin

May 20, 2025

Setup

Load Libraries

GIS in R

sf and sp packages

Other packages & resources

Spatial Data & R Objects

tidycensus and Polygons

Variable & table names

Tips for searching

Pulling spatial data

A note on GEOIDs

Dealing with variable labels

Polygon Plots with ggplot2

Multiple variables with the same continuous legend

By hand

Multiple variables with the same discrete legend

By hand

Combining spatial data

Coordinate Reference Systems

Combining layers with ggplot2

Assigning Points to Polygons

Calculating distances

`sf` and `sp` packages

Spatial Data & `R` Objects

`tidycensus` and Polygons

A note on `GEOID`s

Polygon Plots with `ggplot2`

Combining layers with `ggplot2`