2023 Spatial Data Analysis

spatial data

data cleaning

visualisation

Chapter 8 of Analysing US Census Data Methods, Maps and Models in R. In this post, we work through the code contained in the above-mentioned book. This chapter is important as includes sections on spatial feature engineering, modelling and visualisation.

Author

Sivuyile Nzimeni

Published

14 April 2023

1 INTRODUCTION

see code

#|results: hide
#|message: false
#|warning: false
source("2023_Prep_Data.R",
       local = knitr::knit_global(),
       echo = FALSE,
       print.eval = FALSE,
       verbose = FALSE)

The tidycensus packages offers a set of functions to retrieve census and American Community Survey data. Fortunately, the package offers a wide array of options for retrieving census data and American Community Survey via the Census API. We obtain the data through the get_acs function which contains geometry data for the American Community Survey (2015 - 2019) dataset.

PROGRESS BAR

It may be worthwhile to add an progress_bar = FALSE argument to a get_acs function call, especially, working within a RMarkdown or Quarto document. This way, one can avoid progress bar printing when the document is rendered.

1.1 EXPLORATORY ANALYSIS: DISSIMILARITY

With the spatial data in hand, we can explore the data more in-depth. Here, the segregation package offers a dissimilarity index function (conveniently named dissimilarity). The function returns a total segregation between a group and unit using the Index of Dissimilarity Elbers (2023) . Importantly, the dissimilarity index considers differences between two distinct groups. As the first step, we conduct dissimilarity between Hispanic and White residents in San Francisco–Oakland.

see code

ca_urban_data %>%
  filter(variable %in% c("white", "hispanic"),
         str_detect(urban_name,"San Francisco--Oakland"))%>%
  dissimilarity(group = "variable",
                unit = "GEOID",
                weight = "estimate")

     stat       est
   <char>     <num>
1:      D 0.5135526

To add context on the dissimilarity index above, we can compare regional. Below, we split the data by urban name and apply the function across those groups and finally combine the outputs. The approach below is slightly different to that contained in the book. The book offers a more tidy and succinct method.

see code

Group_Wise <- split(ca_urban_data,ca_urban_data$urban_name)

Group_Wise <- lapply(Group_Wise,function(x){x |> 
    filter(variable %in% c("white","hispanic"))%>%
    dissimilarity(group ="variable",
                  unit = "GEOID",
                  weight = "estimate")})

Group_Wise <- do.call(bind_rows,map2(Group_Wise,names(Group_Wise),function(x,y){
  x |> 
    mutate(urban_name = y)
}))

Across urban areas, Los Angeles –Long Beach, has the highest dissimilarity index at 0.599. The dissimilarity index ranges from 0 - 1 where 0 represents perfect integration between two groups and 1 represents complete segregation (Walker 2023, 215). The table below provides some context compared to our earlier dissimilarity index value for San Francisco-Oakland.

see code

library(gt)
library(gtExtras)

Group_Wise |> 
  select(c(urban_name,stat,est)) |> 
  arrange(desc(est)) |> 
  gt() |> 
  gt_theme_espn()

urban_name	stat	est
Los Angeles--Long Beach--Anaheim, CA Urbanized Area (2010)	D	0.5999229
San Francisco--Oakland, CA Urbanized Area (2010)	D	0.5135526
San Jose, CA Urbanized Area (2010)	D	0.4935633
San Diego, CA Urbanized Area (2010)	D	0.4898184
Riverside--San Bernardino, CA Urbanized Area (2010)	D	0.4079863
Sacramento, CA Urbanized Area (2010)	D	0.3687927

Dissimilarity indices for Hispanic and non-Hispanic white population across California urbanised areas.

Among the urban areas Los Angeles and San Francisco are the most segregated areas among Hispanic and White residents. While, San-Bernardino and Sacramento had the least among of segregation. We can expand on the dissimilarity index by considering more than two groups. Again, we rely on the segregation package’s implementation of the Mutual Information Index and Theil’s Entropy Index. The latter indices measure diversity and segregation across multiple groups (Walker 2023, 217) in California urban areas.

see code

mutual_within(data = ca_urban_data,
              group = "variable",
              unit = "GEOID",
              weight = "estimate",
              within = "urban_name",
              wide = TRUE) |> 
  arrange(desc(H)) |> 
  gt() |> 
  cols_label(
    urban_name = "URBAN NAME",
    M = "M Index (M)",
    H = "H Index (H)",
    p = "Proportion of the category (p)",
    ent_ratio = "Entropy Ratio"
  ) |> 
  gt_theme_espn()

URBAN NAME	M Index (M)	Proportion of the category (p)	H Index (H)	Entropy Ratio
Los Angeles--Long Beach--Anaheim, CA Urbanized Area (2010)	0.3391033	0.50163709	0.2851662	0.9693226
San Francisco--Oakland, CA Urbanized Area (2010)	0.2685992	0.13945223	0.2116127	1.0346590
San Diego, CA Urbanized Area (2010)	0.2290891	0.12560720	0.2025728	0.9218445
San Jose, CA Urbanized Area (2010)	0.2147445	0.07282785	0.1829190	0.9569681
Sacramento, CA Urbanized Area (2010)	0.1658898	0.07369482	0.1426804	0.9477412
Riverside--San Bernardino, CA Urbanized Area (2010)	0.1497129	0.08678082	0.1408461	0.8664604

Segregation and Integration across several California urban areas and racial categories

The results of the multi-group dissimilarity index are largely similar with Los Angeles remaining the most segregated urban area in California. Los Angeles is large area, hence, it may be worthwhile to extend to local analysis. Local analysis is a more granular approach to understanding the differences.

see code

la_local_seg <- ca_urban_data %>%
  filter(str_detect(urban_name,"Los Angeles")) %>%
  mutual_local(
    group = "variable",
    unit = "GEOID",
    weight = "estimate",
    wide = TRUE
)

la_tracts_seg <- tracts("CA", cb = TRUE, year = 2019) %>%
  inner_join(la_local_seg, by = "GEOID") 

tmap_mode("view")
tm_shape(la_tracts_seg) + 
  tm_borders("black",lwd = .5)+
  tm_polygons("ls",
              palette = "viridis",
              title = "Local\nsegregation index")

California Area Diversity Gradient

2 REGRESSION MODELLING

We proceed to modelling on the dataset, although our area of interest changes to Dallas-Forth Worth metropolitan area. A good starting point would be fitting an Ordinary Least Squares (OLS) model. (Walker 2023, 221) highlights a few glaring issues with this approach, namely: Collinearity and Spatial Autocorrelation. Collinearity is likely to emerge from the highly correlated of census data; Spatial Autocorrelation may occur since spatial coordinates are not necessary statistically independent. Recall, the above mentioned features violate some OLS assumptions. There methods are several methods to handle Collinearity, think Principal Component Analysis, Partial Least Squares and other Dimension Reduction techniques. We can also employ a number of test for autocorrelation, later in the post. The remainder of the section will be as follows: i) Data Collection, ii) Data Visualisation and iii) Feature Engineering and iv) Spatial Regression.

The code snippet below contains the counties of interest along with the variables to used, data transformation to the appropriate coordinate reference system and data visualisation. Coordinate Reference Systems like Dates are a special type of data (read Rabbit-Hole), (Walker 2023, 109) and (Lovelace, Nowosad, and Münchow 2019, 41: 43) provides a decent introduction and R methods for working with coordinate reference systems. As a result, we don’t adjust the code in any form.

see code

library(sf)
library(units)
library(patchwork)
library(scales)

dfw_counties <- c("Collin County", "Dallas", "Denton", 
                  "Ellis", "Hunt", "Kaufman", "Rockwall", 
                  "Johnson", "Parker", "Tarrant", "Wise")

variables_to_get <- c(
  median_value = "B25077_001",
  median_rooms = "B25018_001",
  median_income = "DP03_0062",
  total_population = "B01003_001",
  median_age = "B01002_001",
  pct_college = "DP02_0068P",
  pct_foreign_born = "DP02_0094P",
  pct_white = "DP05_0077P",
  median_year_built = "B25037_001",
  percent_ooh = "DP04_0046P"
)

dfw_data <- get_acs(
  geography = "tract",
  variables = variables_to_get,
  state = "TX",
  county = dfw_counties,
  geometry = TRUE,
  output = "wide",
  year = 2020,
  progress_bar = FALSE
) %>%
  select(-NAME)

Median Home Value
Histogram Home Value

see code

library(tidyverse)
library(patchwork)

tmap_mode("plot")
mhv_map <- tm_shape(dfw_data) + 
  tm_borders("black",lwd = .5)+
  tm_polygons("median_valueE",
              palette = "viridis",
              title = "Median home value")

see code

mhv_histogram <- ggplot(dfw_data, aes(x = log(median_valueE))) + 
  geom_histogram(alpha = 0.5, fill = "#0f204b", color = "#0f204b",
                 bins = 100) + 
  theme_minimal() + 
  scale_x_continuous(labels = label_number(accuracy = 1)) + 
  labs(x = "Median home value")

mhv_map

<====================  meta.auto.margins ===============>
[1] 0.4 0.4 0.4 0.4
</============================================>
Index: <stack_auto>
   by1__ by2__ by3__                              comp  class cell.h cell.v
   <num> <int> <int>                            <list> <char> <char> <char>
1:     1    NA    NA <tm_legend_standard_portrait[81]>    out  right center
    pos.h  pos.v     z facet_row facet_col stack_auto    stack     legW  legH
   <char> <char> <int>    <char>    <char>     <lgcl>   <char>    <num> <num>
1:   left    top     1      <NA>      <NA>       TRUE vertical 1.511667 1.485

see code

mhv_histogram

We have explored our outcome variable. The next step is a bit of data processing. Firstly, we create a population density across a standard unit of $1 / k m^{2}$ . Secondly, we derive a median age of the structure by deducting the 2018. Thirdly, we remove variables ending with M and clean variable names ending with capital letter E. Finally, we fit an OLS model and store the residuals of the model into our Dataframe. The residuals will be used in the spatial regression models down the line.

see code

dfw_data_for_model <- dfw_data %>% 
  mutate(
    pop_density = as.numeric(set_units(total_populationE /st_area(.),"1/km2")),
    median_structure_age = 2018 - median_year_builtE) %>%
  select(!ends_with("M")) %>%
  rename_with(.fn = ~str_remove(.x, "E$")) %>%
  na.omit()

formula2 <- paste0("log(median_value) ~median_rooms+pct_college+",
"pct_foreign_born +pct_white+median_age+",
"median_structure_age +percent_ooh+pop_density+",
"total_population")

model2 <- lm(formula2,data=dfw_data_for_model)
dfw_data_for_model$residuals <- model2$residuals

The Moran’s I sums the deviation of values at given distance lag from the mean variable (Fortin and Dale 2009, 93) . We can use the test for spatial autocorrelation to determine whether nearby locations affect one other. Effectively, we test a null hypothesis that nearby locations do not affect each other, i.e. they are independent and spatially random. Below, we find positive spatial autocorrelation on the residuals, a violation of the assumption model independence (Walker 2023, 236).

see code

library(spdep)

wts <- dfw_data_for_model |> 
  poly2nb() |> 
  nb2listw()

moran.test(dfw_data_for_model$residuals,
           wts)


    Moran I test under randomisation

data:  dfw_data_for_model$residuals  
weights: wts    

Moran I statistic standard deviate = 14.017, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.2101039271     -0.0006418485      0.0002260366

To expand the analysis, we fit a spatial lag model using the spatialreg package. The authors of the package have also written a a review of software for spatial econometrics (R. Bivand, Millo, and Piras 2021), they provide an overview of the spatial modelling approaches in R, among others including spatial panel models, cross-sectional models, spatial lagged and spatial error models. Below, we fit a spatial lag model utilising the previously created spatial weights.

see code

summary(model2)


Call:
lm(formula = formula2, data = dfw_data_for_model)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.91753 -0.15319 -0.00222  0.16192  1.58950 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           1.101e+01  6.202e-02 177.571  < 2e-16 ***
median_rooms          7.326e-02  9.497e-03   7.714 2.18e-14 ***
pct_college           1.775e-02  4.862e-04  36.507  < 2e-16 ***
pct_foreign_born      4.170e-03  8.284e-04   5.034 5.38e-07 ***
pct_white             4.996e-03  4.862e-04  10.275  < 2e-16 ***
median_age            3.527e-03  1.429e-03   2.468   0.0137 *  
median_structure_age  2.831e-05  2.696e-05   1.050   0.2939    
percent_ooh          -3.888e-03  5.798e-04  -6.705 2.81e-11 ***
pop_density          -5.479e-06  6.433e-06  -0.852   0.3945    
total_population      9.712e-06  4.658e-06   2.085   0.0372 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2815 on 1549 degrees of freedom
Multiple R-squared:  0.7618,    Adjusted R-squared:  0.7604 
F-statistic: 550.4 on 9 and 1549 DF,  p-value: < 2.2e-16

see code

library(spatialreg)
library(gtsummary)

lag_model <- lagsarlm(
  formula = formula2,
  data = dfw_data_for_model,
  listw= wts
)
summary(lag_model,Nagelkerke=TRUE)


Call:lagsarlm(formula = formula2, data = dfw_data_for_model, listw = wts)

Residuals:
       Min         1Q     Median         3Q        Max 
-2.0648709 -0.1377217 -0.0033412  0.1393351  1.4820558 

Type: lag 
Coefficients: (asymptotic standard errors) 
                        Estimate  Std. Error z value  Pr(>|z|)
(Intercept)           7.0145e+00  2.6904e-01 26.0726 < 2.2e-16
median_rooms          6.1989e-02  8.8554e-03  7.0002 2.556e-12
pct_college           1.2854e-02  5.4694e-04 23.5023 < 2.2e-16
pct_foreign_born      2.0050e-03  7.7480e-04  2.5877  0.009661
pct_white             2.7053e-03  4.7181e-04  5.7340 9.810e-09
median_age            3.4313e-03  1.3162e-03  2.6070  0.009135
median_structure_age  2.6056e-05  2.4826e-05  1.0496  0.293915
percent_ooh          -3.0388e-03  5.4319e-04 -5.5944 2.213e-08
pop_density          -1.3587e-05  5.9343e-06 -2.2896  0.022045
total_population      8.4044e-06  4.2925e-06  1.9579  0.050238

Rho: 0.35357, LR test value: 211.02, p-value: < 2.22e-16
Asymptotic standard error: 0.023382
    z-value: 15.122, p-value: < 2.22e-16
Wald statistic: 228.66, p-value: < 2.22e-16

Log likelihood: -125.2087 for lag model
ML residual variance (sigma squared): 0.067169, (sigma: 0.25917)
Nagelkerke pseudo-R-squared: 0.79195 
Number of observations: 1559 
Number of parameters estimated: 12 
AIC: 274.42, (AIC for lm: 483.43)
LM test for residual autocorrelation
test value: 6.8631, p-value: 0.0087993

see code

error_model <- errorsarlm(formula = formula2,
           data = dfw_data_for_model,
           listw = wts)

summary(error_model, Nagelkerke = TRUE)


Call:errorsarlm(formula = formula2, data = dfw_data_for_model, listw = wts)

Residuals:
        Min          1Q      Median          3Q         Max 
-1.97995041 -0.13648465 -0.00053395  0.13927656  1.54937147 

Type: error 
Coefficients: (asymptotic standard errors) 
                        Estimate  Std. Error  z value  Pr(>|z|)
(Intercept)           1.1098e+01  6.6718e-02 166.3419 < 2.2e-16
median_rooms          8.3008e-02  9.7138e-03   8.5453 < 2.2e-16
pct_college           1.5855e-02  5.7435e-04  27.6047 < 2.2e-16
pct_foreign_born      3.6574e-03  9.6580e-04   3.7869 0.0001526
pct_white             4.6693e-03  6.1196e-04   7.6301 2.354e-14
median_age            3.9280e-03  1.4129e-03   2.7801 0.0054342
median_structure_age  2.6011e-05  2.5450e-05   1.0221 0.3067460
percent_ooh          -4.7624e-03  5.6765e-04  -8.3897 < 2.2e-16
pop_density          -1.5043e-05  6.8758e-06  -2.1878 0.0286842
total_population      1.0553e-05  4.4676e-06   2.3621 0.0181721

Lambda: 0.46781, LR test value: 164.17, p-value: < 2.22e-16
Asymptotic standard error: 0.031993
    z-value: 14.622, p-value: < 2.22e-16
Wald statistic: 213.81, p-value: < 2.22e-16

Log likelihood: -148.6307 for error model
ML residual variance (sigma squared): 0.067876, (sigma: 0.26053)
Nagelkerke pseudo-R-squared: 0.7856 
Number of observations: 1559 
Number of parameters estimated: 12 
AIC: 321.26, (AIC for lm: 483.43)

Unlike the OLS model, the spatial lag model accounts for spatial spillover effects. We can find these differences by considering the model parameters, in addition, the spatial lag model contains a pseudo-R-square which can serve as a comparison with the OLS model’s R-square value. Importantly, the spatial error model considers spatial lag in the model’s error term. Both the spatial lag and spatial error models are crucial for accounting for spatial autocorrelation. Selecting the appropriate model depends on the type of analysis conducted (Walker 2023, 239). We can also rely on the Moran’s I to compare spatial dependence in our models, in our case, the spatial error model outperforms the rest.

see code

Model_Resid <- list(lag_resid = lag_model$residuals,
                    error_resid= error_model$residuals)

MI_Compare <- lapply(Model_Resid,
    \(x){
      moran.test(x,wts)
      })

MI_Compare

$lag_resid

    Moran I test under randomisation

data:  x  
weights: wts    

Moran I statistic standard deviate = 2.0342, p-value = 0.02097
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.0299263361     -0.0006418485      0.0002258128 


$error_resid

    Moran I test under randomisation

data:  x  
weights: wts    

Moran I statistic standard deviate = -1.618, p-value = 0.9472
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
    -0.0249563142     -0.0006418485      0.0002258232

3 CONCLUSION

In this post, we set out to try and understand the process of modelling spatial data. Fortunately, (Walker 2023) provides a detailed book on this topic. The book and the code contained are fantastic resources for a novice like myself. The book is freely available online. R. S. Bivand, Pebesma, and Gómez-Rubio (2013), Schabenberger and Gotway (2017), Lovelace, Nowosad, and Münchow (2019), Oyana (2021) and Pebesma and Bivand (2023) among others offer R titles in Spatial Analysis. The titles offer a good blend of theory and practical application.

References

Bivand, Roger S., Edzer Pebesma, and Virgilio Gómez-Rubio. 2013. Applied Spatial Data Analysis with R. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4614-7618-4.

Bivand, Roger, Giovanni Millo, and Gianfranco Piras. 2021. ‘A Review of Software for Spatial Econometrics in R’. Mathematics 9 (11): 1276. https://doi.org/10.3390/math9111276.

Elbers, Benjamin. 2023. ‘A Method for Studying Differences in Segregation Across Time and Space’. Sociological Methods & Research 52 (1): 5–42. https://doi.org/10.1177/0049124121986204.

Fortin, Marie-Josee, and Mark R. T Dale. 2009. ‘Spatial Autocorrelation’. In, 1st ed., 19. London, UK: SAGE Publications Ltd.

Lovelace, Robin, Jakub Nowosad, and Jannes Münchow. 2019. Geocomputation with r. Boca Raton: CRC Press, Taylor; Francis Group, CRC Press is an imprint of theTaylor; Francis Group, an informa Buisness, A Chapman & Hall Book.

Oyana, Tonny J. 2021. Spatial analysis with R: statistics, visualization, and computational methods. Second edtion. Boca Raton London New York: CRC Press, Taylor & Francis Group.

Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With Applications in R. 1st ed. Boca Raton: Chapman; Hall/CRC. https://doi.org/10.1201/9780429459016.

Schabenberger, Oliver, and Carol A. Gotway. 2017. Statistical Methods for Spatial Data Analysis: Texts in Statistical Science. 1st ed. Chapman; Hall/CRC. https://doi.org/10.1201/9781315275086.

Walker, Kyle. 2023. Analyzing US Census Data: Methods, Maps, and Models in R. 1st ed. Boca Raton: Chapman; Hall/CRC. https://doi.org/10.1201/9780203711415.