2023 Spatial Data Analysis
1 INTRODUCTION
The tidycensus packages offers a set of functions to retrieve census and American Community Survey data. Fortunately, the package offers a wide array of options for retrieving census data and American Community Survey via the Census API. We obtain the data through the get_acs function which contains geometry data for the American Community Survey (2015 - 2019) dataset.
It may be worthwhile to add an progress_bar = FALSE argument to a get_acs function call, especially, working within a RMarkdown or Quarto document. This way, one can avoid progress bar printing when the document is rendered.
1.1 EXPLORATORY ANALYSIS: DISSIMILARITY
With the spatial data in hand, we can explore the data more in-depth. Here, the segregation package offers a dissimilarity index function (conveniently named dissimilarity). The function returns a total segregation between a group and unit using the Index of Dissimilarity Elbers (2023) . Importantly, the dissimilarity index considers differences between two distinct groups. As the first step, we conduct dissimilarity between Hispanic and White residents in San Francisco–Oakland.
see code
stat est
<char> <num>
1: D 0.5135526
To add context on the dissimilarity index above, we can compare regional. Below, we split the data by urban name and apply the function across those groups and finally combine the outputs. The approach below is slightly different to that contained in the book. The book offers a more tidy and succinct method.
see code
Group_Wise <- split(ca_urban_data,ca_urban_data$urban_name)
Group_Wise <- lapply(Group_Wise,function(x){x |>
filter(variable %in% c("white","hispanic"))%>%
dissimilarity(group ="variable",
unit = "GEOID",
weight = "estimate")})
Group_Wise <- do.call(bind_rows,map2(Group_Wise,names(Group_Wise),function(x,y){
x |>
mutate(urban_name = y)
}))Across urban areas, Los Angeles –Long Beach, has the highest dissimilarity index at 0.599. The dissimilarity index ranges from 0 - 1 where 0 represents perfect integration between two groups and 1 represents complete segregation (Walker 2023, 215). The table below provides some context compared to our earlier dissimilarity index value for San Francisco-Oakland.
see code
| urban_name | stat | est |
|---|---|---|
| Los Angeles--Long Beach--Anaheim, CA Urbanized Area (2010) | D | 0.5999229 |
| San Francisco--Oakland, CA Urbanized Area (2010) | D | 0.5135526 |
| San Jose, CA Urbanized Area (2010) | D | 0.4935633 |
| San Diego, CA Urbanized Area (2010) | D | 0.4898184 |
| Riverside--San Bernardino, CA Urbanized Area (2010) | D | 0.4079863 |
| Sacramento, CA Urbanized Area (2010) | D | 0.3687927 |
Among the urban areas Los Angeles and San Francisco are the most segregated areas among Hispanic and White residents. While, San-Bernardino and Sacramento had the least among of segregation. We can expand on the dissimilarity index by considering more than two groups. Again, we rely on the segregation package’s implementation of the Mutual Information Index and Theil’s Entropy Index. The latter indices measure diversity and segregation across multiple groups (Walker 2023, 217) in California urban areas.
see code
mutual_within(data = ca_urban_data,
group = "variable",
unit = "GEOID",
weight = "estimate",
within = "urban_name",
wide = TRUE) |>
arrange(desc(H)) |>
gt() |>
cols_label(
urban_name = "URBAN NAME",
M = "M Index (M)",
H = "H Index (H)",
p = "Proportion of the category (p)",
ent_ratio = "Entropy Ratio"
) |>
gt_theme_espn()| URBAN NAME | M Index (M) | Proportion of the category (p) | H Index (H) | Entropy Ratio |
|---|---|---|---|---|
| Los Angeles--Long Beach--Anaheim, CA Urbanized Area (2010) | 0.3391033 | 0.50163709 | 0.2851662 | 0.9693226 |
| San Francisco--Oakland, CA Urbanized Area (2010) | 0.2685992 | 0.13945223 | 0.2116127 | 1.0346590 |
| San Diego, CA Urbanized Area (2010) | 0.2290891 | 0.12560720 | 0.2025728 | 0.9218445 |
| San Jose, CA Urbanized Area (2010) | 0.2147445 | 0.07282785 | 0.1829190 | 0.9569681 |
| Sacramento, CA Urbanized Area (2010) | 0.1658898 | 0.07369482 | 0.1426804 | 0.9477412 |
| Riverside--San Bernardino, CA Urbanized Area (2010) | 0.1497129 | 0.08678082 | 0.1408461 | 0.8664604 |
The results of the multi-group dissimilarity index are largely similar with Los Angeles remaining the most segregated urban area in California. Los Angeles is large area, hence, it may be worthwhile to extend to local analysis. Local analysis is a more granular approach to understanding the differences.
see code
la_local_seg <- ca_urban_data %>%
filter(str_detect(urban_name,"Los Angeles")) %>%
mutual_local(
group = "variable",
unit = "GEOID",
weight = "estimate",
wide = TRUE
)
la_tracts_seg <- tracts("CA", cb = TRUE, year = 2019) %>%
inner_join(la_local_seg, by = "GEOID")
tmap_mode("view")
tm_shape(la_tracts_seg) +
tm_borders("black",lwd = .5)+
tm_polygons("ls",
palette = "viridis",
title = "Local\nsegregation index")California Area Diversity Gradient
2 REGRESSION MODELLING
We proceed to modelling on the dataset, although our area of interest changes to Dallas-Forth Worth metropolitan area. A good starting point would be fitting an Ordinary Least Squares (OLS) model. (Walker 2023, 221) highlights a few glaring issues with this approach, namely: Collinearity and Spatial Autocorrelation. Collinearity is likely to emerge from the highly correlated of census data; Spatial Autocorrelation may occur since spatial coordinates are not necessary statistically independent. Recall, the above mentioned features violate some OLS assumptions. There methods are several methods to handle Collinearity, think Principal Component Analysis, Partial Least Squares and other Dimension Reduction techniques. We can also employ a number of test for autocorrelation, later in the post. The remainder of the section will be as follows: i) Data Collection, ii) Data Visualisation and iii) Feature Engineering and iv) Spatial Regression.
The code snippet below contains the counties of interest along with the variables to used, data transformation to the appropriate coordinate reference system and data visualisation. Coordinate Reference Systems like Dates are a special type of data (read Rabbit-Hole), (Walker 2023, 109) and (Lovelace, Nowosad, and Münchow 2019, 41: 43) provides a decent introduction and R methods for working with coordinate reference systems. As a result, we don’t adjust the code in any form.
see code
library(sf)
library(units)
library(patchwork)
library(scales)
dfw_counties <- c("Collin County", "Dallas", "Denton",
"Ellis", "Hunt", "Kaufman", "Rockwall",
"Johnson", "Parker", "Tarrant", "Wise")
variables_to_get <- c(
median_value = "B25077_001",
median_rooms = "B25018_001",
median_income = "DP03_0062",
total_population = "B01003_001",
median_age = "B01002_001",
pct_college = "DP02_0068P",
pct_foreign_born = "DP02_0094P",
pct_white = "DP05_0077P",
median_year_built = "B25037_001",
percent_ooh = "DP04_0046P"
)
dfw_data <- get_acs(
geography = "tract",
variables = variables_to_get,
state = "TX",
county = dfw_counties,
geometry = TRUE,
output = "wide",
year = 2020,
progress_bar = FALSE
) %>%
select(-NAME) see code
<==================== meta.auto.margins ===============>
[1] 0.4 0.4 0.4 0.4
</============================================>
Index: <stack_auto>
by1__ by2__ by3__ comp class cell.h cell.v
<num> <int> <int> <list> <char> <char> <char>
1: 1 NA NA <tm_legend_standard_portrait[81]> out right center
pos.h pos.v z facet_row facet_col stack_auto stack legW legH
<char> <char> <int> <char> <char> <lgcl> <char> <num> <num>
1: left top 1 <NA> <NA> TRUE vertical 1.511667 1.485


We have explored our outcome variable. The next step is a bit of data processing. Firstly, we create a population density across a standard unit of
see code
dfw_data_for_model <- dfw_data %>%
mutate(
pop_density = as.numeric(set_units(total_populationE /st_area(.),"1/km2")),
median_structure_age = 2018 - median_year_builtE) %>%
select(!ends_with("M")) %>%
rename_with(.fn = ~str_remove(.x, "E$")) %>%
na.omit()
formula2 <- paste0("log(median_value) ~median_rooms+pct_college+",
"pct_foreign_born +pct_white+median_age+",
"median_structure_age +percent_ooh+pop_density+",
"total_population")
model2 <- lm(formula2,data=dfw_data_for_model)
dfw_data_for_model$residuals <- model2$residualsThe Moran’s I sums the deviation of values at given distance lag from the mean variable (Fortin and Dale 2009, 93) . We can use the test for spatial autocorrelation to determine whether nearby locations affect one other. Effectively, we test a null hypothesis that nearby locations do not affect each other, i.e. they are independent and spatially random. Below, we find positive spatial autocorrelation on the residuals, a violation of the assumption model independence (Walker 2023, 236).
see code
Moran I test under randomisation
data: dfw_data_for_model$residuals
weights: wts
Moran I statistic standard deviate = 14.017, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.2101039271 -0.0006418485 0.0002260366
To expand the analysis, we fit a spatial lag model using the spatialreg package. The authors of the package have also written a a review of software for spatial econometrics (R. Bivand, Millo, and Piras 2021), they provide an overview of the spatial modelling approaches in R, among others including spatial panel models, cross-sectional models, spatial lagged and spatial error models. Below, we fit a spatial lag model utilising the previously created spatial weights.
Call:
lm(formula = formula2, data = dfw_data_for_model)
Residuals:
Min 1Q Median 3Q Max
-1.91753 -0.15319 -0.00222 0.16192 1.58950
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.101e+01 6.202e-02 177.571 < 2e-16 ***
median_rooms 7.326e-02 9.497e-03 7.714 2.18e-14 ***
pct_college 1.775e-02 4.862e-04 36.507 < 2e-16 ***
pct_foreign_born 4.170e-03 8.284e-04 5.034 5.38e-07 ***
pct_white 4.996e-03 4.862e-04 10.275 < 2e-16 ***
median_age 3.527e-03 1.429e-03 2.468 0.0137 *
median_structure_age 2.831e-05 2.696e-05 1.050 0.2939
percent_ooh -3.888e-03 5.798e-04 -6.705 2.81e-11 ***
pop_density -5.479e-06 6.433e-06 -0.852 0.3945
total_population 9.712e-06 4.658e-06 2.085 0.0372 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2815 on 1549 degrees of freedom
Multiple R-squared: 0.7618, Adjusted R-squared: 0.7604
F-statistic: 550.4 on 9 and 1549 DF, p-value: < 2.2e-16
see code
Call:lagsarlm(formula = formula2, data = dfw_data_for_model, listw = wts)
Residuals:
Min 1Q Median 3Q Max
-2.0648709 -0.1377217 -0.0033412 0.1393351 1.4820558
Type: lag
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.0145e+00 2.6904e-01 26.0726 < 2.2e-16
median_rooms 6.1989e-02 8.8554e-03 7.0002 2.556e-12
pct_college 1.2854e-02 5.4694e-04 23.5023 < 2.2e-16
pct_foreign_born 2.0050e-03 7.7480e-04 2.5877 0.009661
pct_white 2.7053e-03 4.7181e-04 5.7340 9.810e-09
median_age 3.4313e-03 1.3162e-03 2.6070 0.009135
median_structure_age 2.6056e-05 2.4826e-05 1.0496 0.293915
percent_ooh -3.0388e-03 5.4319e-04 -5.5944 2.213e-08
pop_density -1.3587e-05 5.9343e-06 -2.2896 0.022045
total_population 8.4044e-06 4.2925e-06 1.9579 0.050238
Rho: 0.35357, LR test value: 211.02, p-value: < 2.22e-16
Asymptotic standard error: 0.023382
z-value: 15.122, p-value: < 2.22e-16
Wald statistic: 228.66, p-value: < 2.22e-16
Log likelihood: -125.2087 for lag model
ML residual variance (sigma squared): 0.067169, (sigma: 0.25917)
Nagelkerke pseudo-R-squared: 0.79195
Number of observations: 1559
Number of parameters estimated: 12
AIC: 274.42, (AIC for lm: 483.43)
LM test for residual autocorrelation
test value: 6.8631, p-value: 0.0087993
see code
Call:errorsarlm(formula = formula2, data = dfw_data_for_model, listw = wts)
Residuals:
Min 1Q Median 3Q Max
-1.97995041 -0.13648465 -0.00053395 0.13927656 1.54937147
Type: error
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1098e+01 6.6718e-02 166.3419 < 2.2e-16
median_rooms 8.3008e-02 9.7138e-03 8.5453 < 2.2e-16
pct_college 1.5855e-02 5.7435e-04 27.6047 < 2.2e-16
pct_foreign_born 3.6574e-03 9.6580e-04 3.7869 0.0001526
pct_white 4.6693e-03 6.1196e-04 7.6301 2.354e-14
median_age 3.9280e-03 1.4129e-03 2.7801 0.0054342
median_structure_age 2.6011e-05 2.5450e-05 1.0221 0.3067460
percent_ooh -4.7624e-03 5.6765e-04 -8.3897 < 2.2e-16
pop_density -1.5043e-05 6.8758e-06 -2.1878 0.0286842
total_population 1.0553e-05 4.4676e-06 2.3621 0.0181721
Lambda: 0.46781, LR test value: 164.17, p-value: < 2.22e-16
Asymptotic standard error: 0.031993
z-value: 14.622, p-value: < 2.22e-16
Wald statistic: 213.81, p-value: < 2.22e-16
Log likelihood: -148.6307 for error model
ML residual variance (sigma squared): 0.067876, (sigma: 0.26053)
Nagelkerke pseudo-R-squared: 0.7856
Number of observations: 1559
Number of parameters estimated: 12
AIC: 321.26, (AIC for lm: 483.43)
Unlike the OLS model, the spatial lag model accounts for spatial spillover effects. We can find these differences by considering the model parameters, in addition, the spatial lag model contains a pseudo-R-square which can serve as a comparison with the OLS model’s R-square value. Importantly, the spatial error model considers spatial lag in the model’s error term. Both the spatial lag and spatial error models are crucial for accounting for spatial autocorrelation. Selecting the appropriate model depends on the type of analysis conducted (Walker 2023, 239). We can also rely on the Moran’s I to compare spatial dependence in our models, in our case, the spatial error model outperforms the rest.
see code
$lag_resid
Moran I test under randomisation
data: x
weights: wts
Moran I statistic standard deviate = 2.0342, p-value = 0.02097
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.0299263361 -0.0006418485 0.0002258128
$error_resid
Moran I test under randomisation
data: x
weights: wts
Moran I statistic standard deviate = -1.618, p-value = 0.9472
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
-0.0249563142 -0.0006418485 0.0002258232
3 CONCLUSION
In this post, we set out to try and understand the process of modelling spatial data. Fortunately, (Walker 2023) provides a detailed book on this topic. The book and the code contained are fantastic resources for a novice like myself. The book is freely available online. R. S. Bivand, Pebesma, and Gómez-Rubio (2013), Schabenberger and Gotway (2017), Lovelace, Nowosad, and Münchow (2019), Oyana (2021) and Pebesma and Bivand (2023) among others offer R titles in Spatial Analysis. The titles offer a good blend of theory and practical application.