## Predict your next investment

SOFTWARE (NON-INTERNET/MOBILE) | Application & Data Integration Software
safe.com

## About Safe Software

Safe Software makes the data integration platform (FME) with the support for spatial data. It connect systems, transform data, and automate workflows. It is based in Surrey, British Columbia.

## Safe Software Headquarter Location

9639 137A St. Suite 1200

Surrey, British Columbia, V3T 0M1,

## Latest Safe Software News

A century of decoupling size and structure of urban spaces in the United States

Jan 27, 2021

communications earth & environment Abstract Most cities in the United States of America are thought to have followed similar development trajectories to evolve into their present form. However, data on spatial development of cities are limited prior to 1970. Here we leverage a compilation of high-resolution spatial land use and building data to examine the evolving size and form (shape and structure) of US metropolitan areas since the early twentieth century. Our analysis of building patterns over 100 years reveals strong regularities in the development of the size and density of cities and their surroundings, regardless of timing or location of development. At the same time, we find that trajectories regarding shape and structure are harder to codify and more complex. We conclude that these discrepant developments of urban size- and form-related characteristics are driven, in part, by the long-term decoupling of these two sets of attributes over time. Introduction By 2050, more than two thirds of humans will live in urban areas 1 , a striking projection considering that the estimated share of people living in cities was below 30% in 1950 and less than 16% in 1900 2 . While we do not know what form cities will take in the future, we can be certain that their organization and structure will increasingly influence climate change, human health, economic development and social inequality 3 , 4 , 5 , 6 . Unfortunately, our efforts to envision the cities of the future are constrained by the absence of systematic insight on how today’s cities and their surroundings have changed in the past. Much of this shortfall is attributed to the absence of detailed information on urban spatial change prior to the 1970s. Historical analysis and theory provides conflicting perspectives on long-term urbanization. While technological change is enabling the creation of increasingly complex cities 7 , debate continues as to whether differences across urban areas are decreasing or if regions are retaining distinctive and enduring properties 8 , 9 . This debate has long occupied historical scholars, who are increasingly skeptical of the notion that universal forces are homogenizing our progressively urban world 10 . The homogenization of cities is, therefore, a continuing discussion point for geographers, social theorists and historians. Urban spatial data are enabling investigation into urban development, including the examination of the bottom-up self-organization of city systems, how the characteristics of cities scale with growth, and how different urban attributes change to create related and potentially predictable patterns (“allometry”) 11 , 12 . These scaling and allometric relationships, in terms of urban size, density, and form 13 , 14 , 15 are highly consequential for a wide range of socio-economic and environmental outcomes 16 , 17 , 18 . Unfortunately, the paucity of long-term urban spatial data generally confined studies to relatively short time horizons or selected geographic contexts 19 , 20 , making it difficult to fully understand urban growth, or (changes in) urban form 21 , 22 , 23 , 24 and similarity 25 , 26 , 27 . As such, much of our knowledge of urban change rests on cross-sectional data and relatively short windows of observation that do not fully encompass the complete development trajectories of cities. Our objective is to shed new light on the question of how the spatial characteristics of cities and their surroundings have changed over the last century. In answering this question, our goal is to determine the consistency and variance of long-term urban spatial development with respect to city size, density, shape and structure. We do so by leveraging our recently published Historical Settlement Data Compilation for the US (HISDAC-US) 28 , 29 , 30 , 31 derived from the Zillow Transaction and Assessment dataset (ZTRAX) 32 to examine urban spatial development within contemporary metropolitan statistical areas (hereafter “MSAs”) in terms of size, density, shape, and structure related metrics 33 , 34 . ZTRAX is a compilation of land use, built, valuation and real-estate characteristics for over 200 million US cadastral parcels, and contains rarely examined building attributes, including the year built, indoor area, total housing units, and structural characteristics (e.g., building material, function). We contend that the gridded spatial data layers from HISDAC-US are currently the most compelling source of high-resolution data on the long-term spatial development of the United States. We use HISDAC-US to observe spatial land use changes within the boundaries of MSAs, as defined in 2010. As MSA boundaries delineate the proximate socio-economic influence of urban centers (see Methods section), they provide the most appropriate spatial units for understanding the emergence of the contemporary US urban system. We examine spatiotemporal land use changes for MSAs using our compilation of settlement data (Fig. 1 ) to derive time series of descriptive spatial variables for each MSA in the US at semi-decadal temporal resolution from 1910 to 2010. Using 10 metrics inspired by commonly employed landscape metrics 35 , 36 , 37 to characterize urban areas (Table  1 ; see Methods section for detailed explanations), we utilize descriptive visualizations, statistical methods and cluster analysis to identify common types of urban spatial development over the last century. These metrics are categorized into size-related and form-related (i.e., shape and structure) characteristics that can be used to analyze the interactions between these attributes 11 . Fig. 1: Gridded surfaces from the historical settlement data compilation for the US (HISDAC-US) used in this study. Built-up intensity depicted for the greater Atlanta metropolitan statistical area (MSA) in a 1910, b 1960 and c 2010. First built-up year composite shown for d greater Philadelphia and e for greater Rochester (New York). All datasets are at a spatial resolution of 250 × 250 m. Yellow lines represent MSA boundaries obtained from the US Census Bureau 51 , used as the primary analytical unit for this study. Our analysis proceeds in four steps. First, we examine nationwide and MSA-specific development trends among the time series of our 10 urban-spatial metrics. After describing these development paths, we perform cluster analysis on temporal cross-sections of our data to analyze historical changes in the spatial properties of MSAs and their movement between different clusters over time. Third, to understand variations in urban development, we decompose these trends spatially for different regions. We conclude our analysis with an ordinal assessment of the relationships between size and form related urban-spatial metrics, in order to investigate whether our numerical analysis findings also hold in an ordinal scale. In this study, we will refer to the term “MSA” as the analytical unit used but will use both “MSAs” and “cities” interchangeably in discussing and interpreting results. From this analysis, we provide evidence of a weakening association between the size-related attributes of MSAs with their respective form-related characteristics since the early twentieth century. That is, we find that the correlation between urban size and urban form in our data has declined substantially from 1910 to 2010. This attenuation appears to be driven by the growing similarity of places with respect to their size attributes over time, but a greater endurance of their form-related attributes across the same period. One potential limitation of our analysis is that the underlying ZTRAX data capture the existing building stock of the United States. Thus, we do not observe buildings that have been torn down or urban footprints that have shrunk 38 , 39 . The selective nature of replacement at the building scale 40 , 41 could therefore bias our findings at the metropolitan scale. We find that our main results are highly robust, implying that survivorship bias is not likely to be driving our key results (see Methods section). Nonetheless, the proceeding analysis should be viewed as an examination of buildings that have survived from their initial built year to 2010. Results Nationwide development trends since 1910 We begin by analyzing the trends in the size (Fig. 2 a) and form metrics (Fig. 2 b) across MSAs from 1910 to 2010. With four of the five size-related variables increasing over our study period, the size variables are almost exclusively lower in magnitude in 1910 than in 2010 across all MSAs. Thus, we observe nationwide increases in urban size (BUAREA, NUMBUPROP) and density (BUDENS, NETBUI) over the last century. While this pattern is expected due to the nature of our data, it is consistent with findings from global meta-analyses of city size changes 6 . Fig. 2: Temporal trends of the analyzed urban-spatial metrics across all metropolitan statistical areas. Median (black) and interquartile range (grey) of the variables characterizing a size and density, and b shape and structure, over all analyzed metropolitan areas. Percentages depicted represent the change in median of the respective variable from 1910 to 2010. Also shown is the approximate start of US suburbanization in the late 1950s (vertical dashed lines in red). We calculated Spearman’s rank correlation coefficient between all size and form-related variables in 1910 and in 2010 (Fig. 7 a and b), respectively. These results, along with their corresponding Q-Q plots (Fig. 7 c and d), reveal further detail about the relationship between size and form attributes. While we observe high levels of association between dispersion measures (i.e., SCAT, NUMPATCH) with most size-related variables in 1910, many of the correlations attenuate by 2010, thus providing additional evidence for the disentanglement of size and form over time. However, we do also observe several notable exceptions to this finding. First, we observe an increasingly positive rank correlation between size characteristics and MAXPATCHPROP, which is driven by heavily developed MSAs that tend to exhibit greater connectedness and spatial contiguity. Secondly, we find stationary (negative) rank correlations between size variables and circularity, most notably between BUAREA and CIRC. This result indicates that large places tended to be less compact (i.e., less circular) than small places in the early 20th century, and that this trend is still evident in 2010. Even though this analysis does not explicitly test allometric relationships, these results are suggestive of such relationships between size and form-related urban-spatial metrics (see also animations in Supplementary Movies 2 and 3); formally testing these relationships over time is an important area for future research. Discussion This study leverages information from HISDAC-US to examine urban spatial development in the United States since 1910. Irrespective of where MSAs are situated or when they started to grow, we find strong congruence in their size-related spatial development. That is, we find that more recently developing MSAs have followed similar paths to early-developing MSAs, particularly in terms of their size and density attributes. Even though urbanizing (i.e., not shrinking) MSAs follow more elaborate development trajectories in terms of structure, we also find evidence for strong grouping effects along these dimensions. This finding points to a set of increasingly complex urban forms, which have been emerging across the US. From a broad perspective, we find high degrees of continuity in the spatial development of urbanizing MSAs regarding size-related characteristics, which are quite robust to the timing and location of development. Although urbanizing MSAs do follow similar size-related trajectories, there are some very notable exceptions to these overarching trends. Firstly, MSAs appear to be increasingly stratified, but less extremely distributed based on their size. This pattern is evident from cross-sectional normality tests (i.e., increasing Shapiro-Wilk normality test statistics for size variables over time, Supplementary Fig. 3 ) and from our cluster analysis, which indicates that while differences in city size have narrowed across MSAs over time, they are sorted into an expanding number of clusters or “types.” Furthermore, the form-related attributes of MSAs are growing increasingly independent from their size-related attributes. Secondly, although we find that, irrespective of location or timing of development, most urban spatial attributes are shifting in similar directions, there are some notable deviations across US regions. Most notably, MSAs in the interior census divisions of the United States follow urban forms that are quite different from coastal MSAs. Whether or not this is a temporary transition or indicative of newer and longer-lasting regional differentiation remains to be seen. What do our findings tell us about the existence of systematic development trajectories across cities and their surroundings? Generally, our results suggest that urbanizing MSAs in the US appear to follow similar development trajectories irrespective of whether they developed recently or farther back in time with regards to size attributes. However, our multidimensional analyses reveal that these trajectories can be highly complex. While the source of this local heterogeneity is beyond the scope of this paper, the data and approaches put forward here provide a firm basis from which to examine these patterns, albeit subject to the issues associated with retrospective data. The more detailed aspects of our findings also highlight the potential value of applying spatial scientific analysis to questions in urban history. Notably, major urban spatial transitions, such as post-WWII suburbanization 43 , and the increasing formation of interconnected, “megapolitan” city systems 48 are evident throughout our results. Fig. 1 , for example, shows that from approximately the late 1950s, average housing size and total number of properties sharply increased while measures of spatial compactness and circularity began to quickly decrease. These patterns reflect the sprawl of metropolitan areas beyond their central cities and into the emerging suburbs of the time. The emergence of “megapolitan areas” in the US is, for example, reflected in the emergence of the MSA cluster that is characterized by high levels of contiguity (see e.g., Fig. 5 g). Historical developments such as these, which have been and will be the focus of ongoing research efforts 31 , 49 , undergird many of the broader patterns we observe here. While we do not attempt to grapple with these issues in this study, the approaches we articulate provide vast opportunities for enhanced understanding of the complex social, political and economic interactions that have brought about widespread historical spatial change. City boundaries naturally change over time. In our analysis, we rely on fixed, retrospective boundaries of metropolitan statistical areas as they existed in 2010. While they suffer from a certain degree of arbitrariness (cf. Fig. 1 ), they allow for analysis within temporally consistent spatial units. However, the time series of certain structure- or density-based measures would be different when using temporally adaptive boundaries, and thus, future work could examine how our picture of urban spatial change would differ if we instead used a changing urban footprint or definition. Moreover, future work should include the analysis of smaller cities and towns defined as Micropolitan Statistical Areas (i.e., cities of less than 50,000 inhabitants) and expansion of this work to a more complete range of large and small cities in the US, and should also focus on the quantification of survivorship bias in the underlying ZTRAX data at the building level, e.g., by integrating auxiliary historical data sources. Furthermore, forthcoming work should include additional spatial and temporal characteristics (e.g., differential measures such as expansion and densification of urban areas 31 ), measures of intra- or peri-urban land use and vegetation, urban road networks, and census data (e.g., population, migration and other socio-economic variables). These data integration approaches in combination with advanced machine learning methods could enable predictive modelling of urban development and population changes. The integration of population data and remote sensing data will also allow for a quantification of the bias introduced into such analyses by the lack of information on building teardowns and urban shrinkage. The shrinkage phenomenon affects certain regions in the US and is typically associated with population decline and land conversion from built-up to less developed land. Such analysis will contribute to a broader understanding of drivers of and interactions between human and environmental processes and urban dynamics. Future analysis can generate valuable knowledge as a foundation for complex simulative models predicting and projecting future development of urban areas, population, and the interactions within socio-environmental systems. Moreover, the long-term spatial data used herein will potentially allow for testing urban scaling laws over the long-term, and assessing the impact of cross-sectional versus longitudinal analytical concepts on the outcomes of urban change and scaling analyses. This work provides data and blueprints for broad, long-term investigation into spatial differentiation across urban environments. This study establishes an analytical foundation to measure well-known processes of urbanization, such as sprawl, infilling and densification, over long periods of time and at scales meaningful for analytical and interpretational purposes, both long-standing limitations in urban studies. Moreover, our analytical framework and the underlying data provide valuable insight for city and regional planners, allowing for the identification and forecasting of fine and coarse development trajectories. Knowledge of these trajectories may be particularly valuable in tackling issues related to environmental change, transportation and inequality 49 . We demonstrate how innovative data derivatives can be used to quantitatively assess different urban characteristics and their changes at fine spatial and temporal granularity. Our analysis demonstrates the value of applying data-driven analytical methods to the exploration of large spatial-temporal settlement data in urban geography, which can enhance our understanding of the historical settlement of cities in the US and elsewhere. Methods US metropolitan statistical areas The US Office of Management and Budget defines a Metropolitan Statistical Area (MSA) as a larger commuting area containing at least one urban cluster or urbanized area with a population of at least 50,000. It is comprised of the central county or counties containing the urban core, including adjacent counties characterized by a high degree of social and economic interaction with the central county or counties, which is measured through commuting 50 . Hence, MSA boundaries are spatial units containing urban cores, peri-urban areas and the urban fringe and thus constitute a suitable source of spatial zoning data for the analyses presented herein. In 2010, there were 363 MSAs in the conterminous US 51 which were used in this study, and kept temporally consistent over the study period from 1910 to 2010. Gridded spatial layers Spatial layers consist of time series of gridded data layers at a spatial resolution of 250 m. These layers have been generated at semi-decadal temporal resolution for the study period 1910 to 2010, and include: HOUS: Count of housing units per grid cell and year 52 . BUPROP: Count of built-up properties (i.e., unique locations of housing units) per grid cell and year 53 . Built-up intensity (BUI): Indoor area of all buildings per grid cell and year (unit: m2) 54 . First built-up year (FBUY): Temporal composite containing the built year of the oldest structure per grid cell (unit: year) 55 . These layers are available in the Historical Settlement Data Compilation for the US (HISDAC-US 28 , 29 , 30 , see Fig. 1 and Supplementary Fig. 9 ). Time series of descriptive variables per MSA The urban-spatial metrics used herein allow for a multi-faceted spatial perspective on urban systems, given the available data. We reduced an initial set of 50+ urban-spatial metrics to the 10 metrics used herein, based on a visual-analytical assessment involving plausibility, cross-correlation and variability analysis. Each of these variables is derived for each MSA and each 5-year interval for the study period 1910 - 2010, resulting in a multivariate time series characterizing the evolution of each MSA from different perspectives. We grouped these time series into two categories: Size-related variables: Characterizing the horizontal and vertical dimensions of urban systems, as well as the average size of the units comprising an urban system. In this category, we included density-based measures since they describe the relationship between horizontal and vertical extension of an urban system. Form-related variables: Characterizing the shape (i.e., geometric properties) of the urban footprint and its morphological structure (i.e., properties of the spatial entities constituting the urban footprint). Similar categorizations are commonly used and suggested in urban studies 11 , 56 and allow for separate analysis, as well as studying interactions between these categories. Below, we describe the urban-spatial metrics used, adjusted to the nature and the volume of our data. Size related variables Based on zonal statistics of the spatial layer series within MSA boundaries (see processing workflow in Supplementary Fig. 9 ), we derived the following MSA-level time series characterizing the horizontal and vertical dimensions of urban systems 57 and their interactions, guided by the information contained in our data: BUAREA: The planar area of the grid cells occupied by at least one building in a given year, as an absolute measure of horizontal extension of urban areas. NETBUI: The net built-up intensity, calculated as the total indoor building area in an MSA polygon in a given year. This measure allows for quantifying the total built-up volume 57 . BUDENS: The net built-up intensity per built-up ground area in an MSA in a given year. This measure relates to the commonly used floor-area-ratio (i.e., the quotient of building indoor area and area built upon, i.e., plot or parcel area 58 ). However, due to the lack of footprint area or multi-temporal parcel area in our data, we used the area of the built-up grid cells as a proxy for plot area. By using the built-up area as denominator, rather than the MSA area, we overcome the issue of arbitrariness of MSA boundaries. NUMBUPROP: The number of built-up properties per MSA and year, an approximate, absolute measure of the size of the building stock in an MSA. AVGHUSIZE: The total BUI divided by the number of housing units within an MSA in a given year, as a measure of granularity of the units constituting the built environment. Form-related variables Through segmentation and vectorization of the BUA layer series 59 , which was derived from the FBUY layer (see Supplementary Fig. 9 ), we obtained spatial vector objects of contiguous built-up areas within each MSA per year and derived the following variables: NUMPATCH: The number of built-up patches within an MSA per year, with a patch consisting of at least two grid cells, as a measure of fragmentation 60 . SCAT: The number of spatially isolated built-up grid cells (i.e., none of the 8 adjacent grid cells in a Moore neighborhood being built-up) within an MSA per year, as a measure of spatial scatteredness of the built-up area. This metric was adopted from the "scatter development" 61 metric which represents the count of grid cells with less than 30% built-up area in their neighborhood. Since our BUI data measures total indoor area without specifying the area of building footprints, we chose to set this threshold to 0%. MAXPATCHPROP: The proportion of the largest built-up patch from the total built-up area within an MSA per year, as a percent-based measure of built-up area contiguity and dominance. This measure is related to the "largest patch index" (LPI) 60 . However, while LPI measures the proportion of the largest patch area with respect to the landscape area (i.e., the MSA area), we decided to use the total built-up area as denominator, in order to be independent from the (arbitrary) MSA area. CIRC: The circularity of the ten largest built-up patches per MSA and year. There are several approaches for measuring spatial compactness such as circularity in the geospatial sciences 62 ; for example, the commonly used CIRCLE metric, relying on comparing the area of a shape to the area of its circumcircle 63 . Herein, we used the isoperimetric quotient, a commonly used circularity measure 64 , 65 defined as the ratio of a polygon’s area A to the area of a circle with the same perimeter p as the polygon, in percent: $${\rm{CIRC}}=100* \frac{4{\rm{\pi }}A}{{p}^{2}}$$ (1) which we employed as an alternative measure of circularity, assumed to be computationally inexpensive to process the large amount of built-up patches in this study (>250,000 patches in all MSAs in 2010). Thus, CIRC characterizes the compactness of the largest patches constituting an urban system and is used as a measure of spatial complexity. CLUST: The clusteredness of built-up grid cells per MSA and year. This measure is the average nearest neighbor index (ANNI) based on the centroids of built-up grid cells. ANNI compares the observed average nearest neighbor distance $${\overline{d}}_{{\rm{NN}},{\rm{O}}}$$ of a set of point locations to the expected average nearest neighbor distance $${\overline{d}}_{{\rm{NN}},{\rm{E}}}$$ in a random point distribution of the same number of points and spatial extent 66 and has been applied for long-term urban change studies and other spatial analyses 67 , 68 . The ANNI is calculated as: $${\rm{ANNI}}=\frac{{\overline{d}}_{{\rm{NN}},{\rm{O}}}}{{\overline{d}}_{{\rm{NN}},{\rm{E}}}}$$ (2) with small values corresponding to highly clustered point patterns. Thus, we calculated our measure of clusteredness CLUST by subtracting the ANNI from the global maximum ANNI across all MSAs and points in time, to obtain large values for highly clustered point patterns: $${\rm{CLUST}}=max({\rm{ANNI}})-{\rm{ANNI}}$$ (3) Data uncertainty handling External validation of HISDAC-US source data The quality of built-up areas extracted from the HISDAC-US spatial layer series may suffer from missing housing records or from spatial offsets in the geolocations contained in the ZTRAX database, causing grid cells falsely labelled as "not built-up". To quantify these positional uncertainties, we conducted a multi-temporal accuracy assessment against a highly accurate reference database created from cadastral parcel records and building footprint data in 31 US counties, yielding F-measures of 0.9 or higher for each evaluated year in the time period 1910 - 2010 29 . Moreover, we conducted a US-wide accuracy assessment of built-up areas in 2016 derived from the HISDAC-US data against a remote-sensing derived building footprint dataset, yielding similarly high accuracy values in urban areas 30 . Furthermore, we cross-compared temporal trajectories of housing unit counts (HOUS) against census-based housing statistics 30 . Incompleteness of temporal information and locational uncertainty As shown in 29 the HISDAC-US inherits issues of data incompleteness from the original ZTRAX data. For example, 82 counties (i.e., approximately 2.5% of the land area) in the conterminous US do not contain any settlement-related ZTRAX data records. Approximately 20% of the built-up areas in the conterminous US are lacking temporal information (i.e., the year when a structure has been built) necessary to map built-up areas at a given point in time. However, the large majority of these areas is located in rural regions 30 . In this study, we excluded MSAs that have an area proportion of more than 5% of a county of known data missingness. Furthermore, we excluded MSAs that have a temporal missingness of more than 50%. Besides data missingness, there are issues due to generalization of the geospatial locations when dummy coordinates are used (i.e., the use of identical coordinates for numerous settlement locations within an MSA). In order to account for this, we computed the ratio between the NUMBUPROP and HOUS layers in 2010. We excluded the MSAs exceeding the 95th percentile of this ratio, i.e., MSAs containing extremely high proportions of housing units assigned to an identical geolocation, indicating the presence of dummy coordinates. In total, 31 of 363 MSAs were excluded from this study (see map of MSA-level data completeness, Supplementary Fig. 10 ). Time series correction For the remaining MSAs, we developed a correction procedure applied to the time series that works as follows: We created a spatial layer indicating the indoor area, the number, and the presence of properties per grid cell without built year information 29 and added those grid cell values to the contemporary layers, BUI2015, NUMBUPROP2015, and BUA2015, respectively. The key assumption is that buildings without built year information exist in 2015. Based on these corrected layers, correction terms are calculated and the variables are recomputed for each MSA in 2015. We use the resulting correction terms, proportionally, to adjust the variable values in the previous years, while preserving the relative change between two subsequent points in time (see 31 for details). We evaluated the accuracy of this correction procedure in an experiment in which we artificially removed a random sample of up to 50% of the temporal information from MSAs of originally high built year attribute completeness, and compared the original and the corrected time series per variable and across time. Boxplots in Supplementary Fig. 11 shows the relative error distributions over time, based on N = 100 realizations of the described procedure. We found high accuracies in our corrected time series for the MSAs under test for the time period 1910–2010. Temporal mismatch and survivorship bias Neither does the ZTRAX database contain information on teardowns or building replacements, nor does it provide sufficient information on remodeling activities. Thus, attributes of built-up structures may refer to a contemporarily existing building and may not reflect the characteristics in the year when a location became first built-up (i.e., the built year information on record). This may result in a selection bias, or survivorship bias, in a sense that information on first settlements, typically small buildings, has been replaced by the characteristics of larger buildings that replaced the original buildings. This issue is also seen as a selection bias, introduced by the non-randomness of building replacements, e.g., the survival of a building depends heavily on its characteristics (size, material), and the time period in which it was built 40 , 41 . This effect likely causes a biased view on the building stock in early years, and cannot be easily quantified nor mitigated, thus remaining the largest source of uncertainty in this study. The lack of information on teardowns and building replacements likely causes an increasing underestimation of the building stock towards early points in time which can, to some degree, be measured by comparing our results to population data. We carried out such a comparative analysis and found that the overall trends reported herein are largely unaffected by this latter issue. To do so, we obtained decadal census population counts per MSA since 1940, and calculated the average population-building ratio (PBR) per MSA and year. We compared these ratios to census-based average population-household ratios (PHR) and observed a similar decreasing trend for both PBR and PHR over time (Supplementary Fig. 2 a), although PBR has much larger values than PHR in earlier years. While these dynamics may be a superposed effect of a variety of processes, and the differences may partially be the result of different reference units (i.e., buildings versus households) and geographic coverages (MSAs versus all US including rural areas), we partially attribute the divergence observed in earlier years to underreporting in the ZTRAX data. We used the maximum PBR (PBRmax) per MSA, calculated over all years, as an MSA-level measure of underreporting. While the PBRmax is not a function of absolute MSA size, measured by NETBUI2010, as Supplementary Fig. 2 b suggests, we apply a range of thresholds to PBRmax and assessed the sensitivity of the overall trends reported in Fig. 2 . Results suggest that these overall trends are mostly unaffected by early-year underreporting in ZTRAX. That is, even MSAs with PBRmax below the 5th percentile, and thus, least affected by underreporting, exhibit similar trends compared to the baseline trend using all MSAs. (Supplementary Fig. 2 c). Sensitivity to spatial resolution The urban spatial metrics used herein may be sensitive to the spatial resolution of the underlying gridded data layers. To assess whether the observed trends hold across different spatial resolutions, we randomly selected 25 of the 300+ MSAs and computed the metrics for a range of cell sizes (i.e., 30 m, 100 m, 400 m, 500 m) besides the cell size of 250 m used herein, and visualized the resulting trends for each cell size (Supplementary Fig. 1 ). Despite different magnitudes for some metrics, we observe relatively similar trends over time, and find that the chosen resolution of 250 m represents an acceptable tradeoff between spatial generalization (i.e., capturing most components of the urban environment, such as impervious surfaces) and uncertainty, while preserving the characteristic shapes of urban extents. Analytical methods Cross-temporal quantitative trajectory analysis Each MSA at any given year can be represented as a point in a 5-dimensional attribute space defined by the variables derived for each of the two categories. In order to analyze similarity between the MSAs within each category, we used t-distributed stochastic neighbor embedding (t-SNE) 69 to reduce the dimensionality of these attribute spaces. T-SNE allows for mapping higher-dimensional data into low-dimensional spaces where Euclidean distance among nearby data points in the target space represents their similarity in the original space. We performed t-SNE for each category using the described variables extracted for each MSA and each year to visually assess the similarity between MSAs at different points in time in a two-dimensional space. See Fig. 3 and Supplementary Figs. 4 , 5 . Variable selection and dimensionality reduction In order to reduce the complexity of the subsequent analyses and enable legible and comprehensive interpretation of the results, we decreased the number of variables to three per category using principal component analysis applied to the full data (i.e., all MSAs and all years). We retained those three variables per category that showed the highest factor loadings of the first principal component (PC1) (Supplementary Table  1 ). These variables represent the highest contribution to the variability in the respective data distributions. Multi-temporal thematic cluster analysis Whereas the described t-SNE based trajectory visualization method reveals spatio-temporal MSA-level patterns of similarity and variability, they do not inform about the number, homogeneity, characteristics and size of thematic clusters and their variation over time. Thus, we developed a framework for cluster analysis of the temporal cross-sections of MSAs. For each point in time and each of the two variable categories, we identified clusters of MSAs using the BIRCH clustering algorithm (balanced iterative reducing and clustering using hierarchies), a tree-based clustering algorithm suitable for non-uniformly distributed data 70 . BIRCH allowed us to determine the number of clusters based on the data (rather than specifying a fixed number of clusters) and the size of each cluster for each temporal cross-section. We ran BIRCH without specifying a number of clusters, thus using the subclusters returned by BIRCH before the final, global clustering step when the user-specified threshold value is reached (i.e., the so-called tree-BIRCH method 71 ). We adjusted this threshold empirically for each category of variables. Since the same threshold was used for all temporal cross-sections of the data, the resulting clustering sequences are independent from the threshold and intercomparable. The thresholds used as stopping criteria were 0.15 (size-related variables), and 0.125 (form-related variables). A branching factor of 50 was used for all cluster analyses. The resulting temporal sequence of clusters allowed us to assess how MSAs maintain or switch memberships between clusters in subsequent years. Finally, we registered the MSA representing the medoid of each cluster in the three-dimensional space spanned by the used variables in a given year, in order to derive MSAs typical for a cluster at each point in time (see Fig. 5 ). The variable magnitudes of these medoid MSAs were RGB color-coded for the visualizations of the clusters (see Fig. 4 ). Moreover, we visualized these clusters in geographic space for selected years in order to reveal spatial patterns of thematic clusters (see Fig. 5 ). We assessed clustering validity by generating time series of within-cluster variability and between-cluster variation 72 , and, likewise, used these metrics to evaluate distributions of size-variables within form-based clusters, and vice-versa (Fig. 4 d,e). To further assess these interactions, we conducted Dunn’s test of pairwise comparisons on size variable distributions within form-based clusters, and vice-versa, for each point in time (Supplementary Fig. 7 ) and assessed within-cluster dispersion (Supplementary Fig. 6 , Supplementary Table  3 ). Regionalized time series generation We generated median time series per variable and geographic region in the US, adapted from the US census divisions 73 . Moreover, we used Kruskal-Wallis (KW) tests to analyze whether regional median time series exhibit statistically significant differences, and how statistical significance behaves over time. We used non-parametric KW tests since non-normality of most variables can be assumed based on Shapiro-Wilk tests applied to temporal cross-sections of the data (Supplementary Fig. 3 ). While it is common practice to use the p-value returned by the KW test as a measure of statistical significance, we also analyzed time series of the H-statistic returned by each KW test (Fig. 6 d) as a measure of difference between group (i.e., region) distributions. Since the sample sizes per variable and year are identical (i.e., the number of MSAs included in the study), the H-statistic is based on the same degrees of freedom and thus, comparable across variables and over time. Moreover, we conducted subsequent post-hoc Dunn’s test of pairwise comparisons per year to analyze between which pairs of regions the identified statistically significant differences occurred (see Fig. 6 e,f). Limitations of this work Whereas the results are promising, several limitations of the presented work need to be mentioned: First, the ZTRAX database and the derived HISDAC-US data do not take into account teardowns and building renovations. Thus, built year information may refer to the first existing building, whereas other attributes, such as building size, may refer to the structure existent at present, allowing only for analyzing the "surviving" building stock. Moreover, this limits our analysis to the measurement and quantification of urban growth, and does not take into account the shrinkage of built-up areas, with Detroit as a popular example. This limitation may slightly distort variables that take into account age and built-up intensity in an integrated manner. However, as Supplementary Fig. 2 suggests, our main findings are largely invariant to these effects. Moreover, buildings in public lands may suffer from lower levels of coverage in the ZTRAX database. Second, some MSAs were excluded in this study due to data incompleteness (Supplementary Fig. 10 ). In the future, more sophisticated data correction methods (e.g., based on ancillary data and machine-learning based predictive models) will be employed to fill these data gaps more reliably, allowing for a complete analysis of urban evolution in the US using the proposed methods. Data availability The HISDAC-US geospatial data layers used for this study have been made publicly available under https://dataverse.harvard.edu/dataverse/hisdacus . Moreover, the urban spatial metrics per MSA and year are available at https://doi.org/10.6084/m9.figshare.13303091 74 . Code availability References 1. United Nations, D. o. E. & Social Affairs, P. D. World Urbanization Prospects: The 2018 Revision, Methodology. Tech. Rep. ESA/P/WP.252, United Nations, New York (2018). 2. Klein Goldewijk, K., Beusen, A. & Janssen, P. Long-term dynamic modeling of global population and built-up area in a spatially explicit way: HYDE 3.1. The Holocene 20, 565–573 (2010). 3. Alonso, W. The historic and the structural theories of urban form: their implications for urban renewal. Land Econ. 40, 227–231 (1964). 4. Ewing, R. & Rong, F. The impact of urban form on US residential energy use. Hous. Policy Debate 19, 1–30 (2008). 6. Seto, K. C., Fragkias, M., Güneralp, B. & Reilly, M. K. A meta-analysis of global urban land expansion. PLoS ONE 6, e23777 (2011). 8. 9. Sampson, R. J. Great American city: Chicago and the enduring neighborhood effect (University of Chicago Press, 2012). 10. Geyer, M. & Bright, C. World history in a global age. Am. Hist. Rev. 100, 1034–1060 (1995). 13. Lee, Y. An allometric analysis of the us urban system: 1960–80. Environ. Planning A 21, 463–476 (1989). 15. Longley, P. A., Batty, M. & Shepherd, J. The size, shape and dimension of urban settlements. Trans. Inst. Br. Geogr. 75–94 (1991). 16. Bettencourt, L. M. A., Lobo, J., Helbing, D., Kühnert, C. & West, G. B. Growth, innovation, scaling, and the pace of life in cities. Proc. Natl. Acad. Sci. USA 104, 7301–7306 (2007). 17. Khiali-Miab, A., van Strien, M. J., Axhausen, K. W. & Grêt-Regamey, A. Combining urban scaling and polycentricity to explain socio-economic status of urban regions. PLoS ONE 14, e0218022 (2019). Groffman, P. M. et al. Ecological homogenization of urban USA. Front. Ecol. Environ. 12, 74–81 (2014). 19. Barrington-Leigh, C. & Millard-Ball, A. A century of sprawl in the United States. Proc. Natl. Acad. Sci. USA 112, 8244–8249 (2015). 20. Abarca-Alvarez, F. J., Campos-Sánchez, F. S. & Osuna-Pérez, F. Urban shape and built density metrics through the analysis of european urban fabrics using artificial intelligence. Sustainability 11, 6622 (2019). 21. Huang, J., Lu, X. X. & Sellers, J. M. A global comparative analysis of urban form: applying spatial metrics and remote sensing. Landsc. Urban Plan. 82, 184–197 (2007). 22. Schneider, A. & Woodcock, C. E. Compact, dispersed, fragmented, extensive? A comparison of urban growth in twenty-five global cities using remotely sensed data, pattern metrics and census information. Urban Stud. 45, 659–692 (2008). Schwarz, N. Urban form revisited-Selecting indicators for characterising European cities. Landsc. Urban Plan. 96, 29–47 (2010). Acknowledgements We gratefully acknowledge access to the Zillow Transaction and Assessment Dataset (ZTRAX) through a data use agreement between the University of Colorado Boulder (UCB) and Zillow Group, Inc. Funding for this work was provided by Earth Lab through UCB’s Grand Challenge Initiative, the Cooperative Institute for Research in Environmental Sciences (CIRES) at UCB, the Innovative Seed Grant program at UCB, and NSF’s Humans, Disasters, and the Built Environment program (award no. 1924670 to CU Boulder). Moreover, we would like to thank Amy Frazier, Maxwell Joseph, and Keith Burghardt for their advice. Lastly, support by Safe Software Inc. for providing Feature Manipulation Engine (FME) licenses used for ZTRAX data processing is highly appreciated. Publication of this article was funded by the University of Colorado Boulder Libraries Open Access Fund. Author information Rights and permissions Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

## Predict your next investment

The CB Insights tech market intelligence platform analyzes millions of data points on venture capital, startups, patents , partnerships and news mentions to help you see tomorrow's opportunities, today.

## Safe Software Web Traffic

Rank
Page Views per User (PVPU)
Page Views per Million (PVPM)
Reach per Million (RPM)

# CB Insights uses Cookies

CBI websites generally use certain cookies to enable better interactions with our sites and services. Use of these cookies, which may be stored on your device, permits us to improve and customize your experience. You can read more about your cookie choices at our privacy policy here. By continuing to use this site you are consenting to these choices.