© 2006 American Public Health Association DOI: 10.2105/AJPH.2005.071373
Paul Ong is with the School of Public Affairs and the Ralph and Goldy Lewis Center for Regional Policy Studies, University of California, Los Angeles. Matthew Graham is with Abt Associates, Cambridge, Mass. Douglas Houston is a doctoral student in urban planning at the University of California, Los Angeles. Correspondence: Requests for reprints should be sent to Paul Ong, PhD, UCLA School of Public Affairs, 405 Hilgard Ave, Los Angeles, CA 90095 (e-mail: pmong{at}ucla.edu).
Geographic information systems have proven instrumental in assessing environmental impacts on individual and community health, but numerous methodological challenges are associated with analyses of highly localized phenomena in which spatially misaligned data are used. In a case study based on child care facility and traffic data for the Los Angeles metropolitan area, we assessed the extent of facility misclassification with spatially unreconciled data from 3 different governmental agencies in an attempt to identify child care centers in which young children are at risk from high concentrations of toxic vehicle-exhaust pollutants. Relative to geographically corrected data, unreconciled information produced a modest bias in terms of aggregated number of facilities at risk and a substantial number of false positives and negatives.
GEOGRAPHIC INFORMATION systems (GIS) have proven instrumental in assessing environmental impacts on individual and community health.113 Recent studies have begun to systematically address technological limitations associated with GIS by enhancing accuracy of positional, attribute, and temporal data; by tracking demographics and disease as geographic boundaries change over time; by identifying the best household and area measures of socioeconomic status; and by determining appropriate scales for studying links between environmental exposures and health outcomes.1425 Improved data and advances in techniques have enabled epidemiological and atmospheric researchers to apply GIS to highly localized problems, but such analyses present numerous methodological challenges, especially when data of different pedigrees are not collocated at small geographic scales. We assessed the impact of using geographically unreconciled traffic volume data along with census-based street data in a case study of child care centers whose locations near major roadways could put young children at risk from high concentrations of toxic vehicle exhaust pollutants. Recent epidemiological evidence indicates a heightened prevalence of respiratory morbidity and mortality among people living near high-traffic roadways, and childhood cancer, brain cancer, leukemia, and preterm and low-weight births have been positively associated with traffic density among those living near such roadways.2631 Although other environmental risk factors may be present in high-traffic areas, air pollution studies point to the significance of high concentrations of vehicle-generated pollutants such as carbon monoxide and ultrafine particles. Typically, pollutants decline exponentially to near background levels within as little as 150 m of major roadways, with the greatest decrease occurring within 50 m.3234 Because dispersed monitoring stations are insufficient to determine pollutant concentrations at nonadjacent locations, and given the expense of directly measuring pollutants at multiple sites, researchers conducting epidemiological and distributional studies have used traffic volume line data and census-based line data to approximate exposure to vehicle-related pollutants.30,3538 However, this method can result in exposure misclassifications if these data sets are not precisely "aligned" with each other in GIS analyses. (Such discrepancies are not uncommon in health-related research, especially when data from different sources are used. Detailed statistics on misalignments in this study are described later.) Such geographic misalignments can result from the underlying data source, data cleaning processes, or the original intended scale of the data. We examined the effects of reassigning attribute data from 1 geographic data source to census-based line data on estimated exposure levels of facilities geo-coded via census-based line data. A more general issue not broached in this article is the question of "georeferencing," or determination of spatial accuracy relative to the earth. Previous studies have addressed misalignment problems associated with geographic data sets in different ways. One approach is to increase buffer areas beyond the ideal criterion distance to avoid false negatives, but this method can produce false positives.37,38 Wilhelm and Ritz addressed such misalignments by transferring traffic count values from the original traffic line geography to census-based line segmentsa method similar to that described herebut only for a select set of neighborhoods.30 Green et al.35 and Houston et al.37 did not correct misalignments but assumed that discrepancies between traffic volume and census-based geographies are randomly distributed and do not produce spatial biases.35 Although the value of spatially aligned data was recognized in these studies, none of them included systematic comparisons of results from reconciled and unreconciled data sets. We evaluated the impact of reassigning traffic counts to a census-based geography for Los Angeles County, which is home to 9.5 million people and covers approximately 12 300 km2. Our evaluation took the form of a case study designed to identify licensed child care facilities close to major roadways with high traffic volumes. Assessments were made at both the policy level ("What is the prevalence of the problem?") and the programmatic level ("Which facilities are affected?"). Results suggested that use of reconciled data provided valuable methodological enhancements in terms of identification of "at-risk" centers.
Data Traffic volume data and associated street geography data were obtained from the California Department of Transportation (CalTrans). We gathered child care facility information from Californias Department of Social Services, and we geo-coded addresses to the Topologically Integrated Geographic Encoding and Referencing (TIGER) street file, a standard geographical reference widely used by public health researchers and social scientists. (We used CalTrans traffic volume data from various years depending on when traffic for each road segment was recorded. Child care facility and TIGER data were from 2000.) We transformed geographic data into a common geographic projection, Universal Transverse Mercator, so that we could construct geographic overlays and make consistent distance calculations. Two collective data sets were assembled: one overlaying the geocoded child care facilities on the original traffic data "as is"without reconciling spatial misalignmentsand one overlaying the facility points onto traffic data assigned to the common TIGER street file, thus eliminating spatial misalignments.
CalTrans Data
TIGER Data
Child Care Facility Data
Data Reconciliation We created a "reconciled data set" by reassigning the traffic volume data from the CalTrans network to corresponding TIGER street segments. This allowed both the geocoded child care facilities and the traffic data to be spatially referenced to a common street layer. Processing consisted of multiple steps to reconcile these 2 layers. The first, automated step matched streets on the basis of proximity and associated name (or route number). Because of the automated nature of this process, secondary errors resulted from (1) miscoded street names, either in TIGER or in CalTrans; (2) large displacements between matching streets; and (3) similarly named segments being too close in proximity. Two more rounds of visual comparisons were performed to correct manually these mismatching errors. Corrections consisted of reviewing the 2 street layers in ArcMap 8.3 (ESRI, Redlands, Calif), identifying TIGER streets that had not yet been matched or identifying secondary problems, and manually correcting the linking table between Cal-Trans and TIGER. The time required for this processing was approximately 130 person-hours. A similar matching process has been described at greater length by Wu et al.25
Analytic Method
We used vehicle miles traveled (VMT)defined as the product of AADT and the length of each street segmentas our measure of traffic volume. Aggregate VMT for a site is the sum of VMT for each segment within a circular buffer; in this study, buffers had a radius of 50 m, 100 m, or 150 m (Figure 1
Two tests, one based on Cal-Trans traffic data and one based on reconciled TIGER data, were used to classify facility risk for each combination of threshold and buffer distance. We expected the number of facilities classified as at risk to decrease with increasing threshold and with decreasing buffer distance.
Spatial misalignments.
Spatial misalignments between Cal-Trans and TIGER can produce 2 opposite classification errors: a positive classification by the Cal-Trans test but a negative classification by TIGER, as well as the converse case. Of course, a facility might be correctly classified even when the 2 networks are misaligned if both CalTrans and TIGER tests measure traffic volume either above or below the criterion threshold. We defined 4 classification types according to the scenarios described and the ability of the unreconciled CalTrans data to correctly classify child care facilities at risk: consistent positive, false positive, false negative, and consistent negative. These classifications are described in Table 1
We evaluated the effects of misclassification resulting from the use of unreconciled data in 2 ways. For policy considerations, prevalence of risk is central, so the key statistic is the ratio of at-risk facilities to all facilities. Both the reconciled and unreconciled data sets involved the same denominator (all geocoded child care facilities) but different numerators (facilities classified as at risk according to each test). If spatial misalignments were randomly distributed with an expected offset distance of zero, false positives and false negatives would net out, and use of unreconciled data would not affect calculated prevalences. Without evidence to support such an assumption, however, the impact of misalignments must be determined empirically. From a programmatic perspective, the absolute and relative numbers of false positives and false negatives are the important statistics. Limited intervention budgets necessitate identifying the facilities truly at risk, and an excessive number of false positives would divert resources from programmatic objectives. Alternatively, false negatives would impose intervention costs on sites not actually at risk. Unlike aggregating data to estimate prevalence of risk for policy considerations, the costs associated with false positives and false negatives are cumulative rather than offsetting. Consequently, use of unreconciled data may have greater effects at the programmatic level than at the policy level.
The results of the analysis of risk classification and prevalence rates are presented in Table 2 r2) was nonlinear, with the aggregate number of at-risk facilities increasing at a greater rate than the buffer area. Also, doubling the VMT threshold from moderate (4500 VMT) to high (9000 VMT) dramatically reduced the number of at-risk facilities in most cases. Although these patterns are not central to our focus, they strongly suggest that classification criteria have a significant impact on the reported magnitude of the risk associated with close proximity to high levels of traffic.
Policy Concerns More germane to our focus is a comparison between the relative and aggregate number of at-risk facilities calculated via the 2 street geographies. For most permutations of threshold and buffer size, the reconciled data identified fewer at-risk sites than did the unreconciled data, the exception being the moderate threshold at 50 m. As a percentage of facilities identified by reconciled data, the relative number of discrepancies tended to decrease with increasing buffer size; here the exception was the high threshold for 100 m. This pattern suggests that the offset error between the reconciled and unreconciled data sets was not randomly distributed; otherwise, the relative discrepancies (which ranged from 1% to 30%) would have been close to zero in all cases. The most restrictive buffer (50 m) tended to result in the highest relative errors (an under-count of 9% with the moderate threshold and an overcount of 30% with the high threshold), but such a small buffer is useful only in identifying facilities in immediate proximity to the largest pollution sources (freeways and major thoroughfares). Larger buffers tend to capture more of the at-risk facilities from moderately large or multiple sources. Also, for these larger buffers, the reconciled data showed improvement over the unreconciled data, although this improvement was not as large as that associated with the 50-m buffer. In policy terms, these findings demonstrate that errors due to spatial misalignment are not randomly distributed; consequently, estimation of the number of child care facilities at risk is biased when unreconciled data are used at the specified buffer distances. With a 150-m buffer, the distance at which vehicle-related air pollutants drop close to "background" concentrations, the bias was small, and the unreconciled data may have been sufficiently accurate to gauge the overall magnitude of the problem. This buffer size, however, identifies facilities that are at a low level of exposure. The finding of the nonlinear increase in the count with greater criterion distance indicates that padding the buffer area to account for spatial misalignment could artificially produce a substantial overcount.
Programmatic Concerns
For the least restrictive risk criteria (150 m, moderate threshold), 12.5% (5.8% as false negative and 6.7% as false positive) of 2202 at-risk facilities (identified through reconciled data) were misclassified when unreconciled data were used. For the 100-m buffer, the resulting misclassifications were 25% with the moderate threshold and 48% with the high threshold. For the smallest buffer, the corresponding misclassification rates were 97% and 110%. Overall, the absolute number of misclassified facilities increased with increasing circle size; however, the misclassification rate (as a percentage of correctly identified at-risk facilities) generally decreased as circle size increased. Essentially, although the unreconciled data involved fewer total misses as buffer size decreased, the odds of facilities being incorrectly identified increased. The results for the disaggregated statistics on classification type revealed that spatial misalignment creates significant discrepancies in risk classification. The problem becomes increasingly severe as criteria become more restrictive. This trend is capped by the case of the high threshold at 50 m, for which the unreconciled data incorrectly identified more sites than they correctly identified. Such instances in which a given list of at-risk sites can be trusted less than half of the time are a significant problem from a programmatic perspective. Even in cases in which the problem of discrepancies is not as large, they still can represent an economically inefficient increase in allocating scarce funds, weakening the effectiveness of targeting limited resources. In fact, considering the case of the moderate threshold for 150 m (a distance at which pollutants are still considered a risk), in which "only" 12.5% of facilities were misclassified, 275 child care sites were either missed or incorrectly identified as at risk, a significant number of locations when limited resources for enforcement and remediation are available. Clearly, the reconciled data set makes its greatest contribution at the programmatic level.
With the development of less costly and more user-friendly software, decreased costs of computing power, and increased availability of geographically coded data, GIS is proving to be a useful tool in studying the potential health effects of spatially localized environmental hazards. Such trends will continue into the future, encouraging and facilitating more spatially oriented analyses. An important task for improving the usefulness of this approach is identifying problems associated with merging data. One of the strengths of the technology is its capacity to overlay and analyze information from disparate sources, but this is also a potential weakness in that there is no assurance that data sets are properly aligned. This problem is not new to the broader GIS field, but it is worth exploring in the context of public health research. Our findings reveal some serious discrepancies when data sets are not spatially aligned and suggest that use of unreconciled data has policy and programmatic implications. Of course, it is impossible to say whether our results can be generalized to other data sets and spatial analyses. Moreover, the Los Angeles metropolitan area may have unique land-use and siting rules that affect the number of at-risk child care facilities. Despite these limitations, the discrepancies between the facilities identified by the reconciled and unreconciled data sets are sufficiently large that our findings should raise a red flag for all public health researchers using GIS. Explicitly, we found that by reconciling traffic volume data with other geographical data on siting of child care centers in Los Angeles County, we could improve estimations of the centers at risk from mobile source pollutants. This data reconciliation altered the results from both the policy and the programmatic perspective. In terms of policy, we found that the actual numbers of sites considered at risk according to our measures were marginally lower than revealed in a similar analysis involving unreconciled data. From a programmatic perspective, we found that using unreconciled data produced a dramatic miscount of those sites incorrectly classified as at risk as well as those misclassified as safe. We have several recommendations. At a minimum, researchers should assess how well various GIS data sets are spatially referenced to each other. This assessment would include evaluating data from the same agency but for different time periods. Even TIGER geography changes with time as errors are identified and corrected. If data sets are not corrected, it is important to determine whether they should be geographically reconciled. Unfortunately, there is no simple rule for determining when it is necessary to absorb the costly task of eliminating spatiotemporal discrepancies. This issue must be considered on a case-by-case basis. (As one reviewer noted, a potentially important empirical question with policy implications is whether the magnitude of the spatial discrepancy between 2 data sets varies systematically across neighborhoods according to socioeconomic status. If such a pattern exists, any analysis of socioeconomic status disparities in exposure to air pollutants involving unreconciled data would produce systematically biased results. The direction of the bias is an empirical issue that requires additional research.) Our findings do point to 1 guideline: scale matters. The more localized the effect ("pollution footprint"), the more likely it is that an analysis will benefit from such reconciliation. Regardless of the decision, it is important for researchers to explicitly discuss spatial referencing issues related to the data sets they are using, which will provide readers with a sense of any potential limitations of the findings produced. Although it is important for individual researchers to seriously consider these issues, there is also a need for the field as a whole to develop and adopt standards for geographical data. High-quality GIS data represent a collective good that would enhance future public health research.
We wish to thank Jun Wu and 3 anonymous reviewers for their helpful suggestions and the Ralph and Goldy Lewis Center for Regional Policy Studies at the University of California, Los Angeles, which supported this research.
Peer Reviewed
Contributors Accepted for publication September 24, 2005.
1. Acevedo-Garcia D. Zip code-level risk factors for tuberculosis: neighborhood environment and residential segregation in New Jersey, 19851992. Am J Public Health. 2001;91:734741.[Abstract] 2. Arno PS, Gourevitch MN, Drucker E, et al. Analysis of a population-based Pneumocystis carinii pneumonia index as an outcome measure of access and quality of care for the treatment of HIV disease. Am J Public Health. 2002;92:395398. 3. Cervero R, Duncan M. Walking, bicycling, and urban landscapes: evidence from the San Francisco Bay Area. Am J Public Health. 2003;93:14781483. 4. Cohen D, Spear S, Scribner R, Kissinger P, Mason K, Wildgen J. "Broken windows" and the risk of gonorrhea. Am J Public Health. 2000;90:230236. 5. Curriero FC, Patz JA, Rose JB, Lele S. The association between extreme precipitation and waterborne disease outbreaks in the United States, 19481994. Am J Public Health. 2001; 91:11941199. 6. Greenberg M, Mayer H, Miller KT, Hordon R, Knee D. Reestablishing public health and land use planning to protect public water supplies. Am J Public Health. 2003;93:15221526. 7. Holcomb CA, Lin M-C. Geographic variation in the prevalence of macular disease among elderly Medicare beneficiaries in Kansas. Am J Public Health. 2005;95:7577. 8. James RC, Mustard CA. Geographic location of commercial plasma donation clinics in the United States, 19801995. Am J Public Health. 2004; 94:12241229. 9. Krieger N, Chen JT, Waterman PD, Rehkopf DH, Subramanian SV. Race/ethnicity, gender, and monitoring socioeconomic gradients in health: a comparison of area-based socioeconomic measuresthe Public Health Disparities Geocoding Project. Am J Public Health. 2003;93:16551671. 10. Lee RE, Cubbin C. Neighborhood context and youth cardiovascular health behaviors. Am J Public Health. 2002;92:428436. 11. Maantay J. Zoning, equity, and public health. Am J Public Health. 2001; 91:10331041.[Abstract] 12. Oyana TJ, Rogerson P, Lwebuga-Mukasa JS. Geographic clustering of adult asthma hospitalization and residential exposure to pollution at a United StatesCanada border crossing. Am J Public Health. 2004;94:12501257. 13. Pearl M, Braveman P, Abrams B. The relationship of neighborhood socioeconomic characteristics to birthweight among 5 ethnic groups in California. Am J Public Health. 2001;91:18081814. 14. Ali M, Park J-K, Thiem V, Canh D, Emch M, Clemens J. Neighborhood size and local geographic variation of health and social determinants. Int J Health Geogr. 2005;4:12.[CrossRef][Medline] 15. Carretta HJ, Mick SS. Geocoding public health data [letter]. Am J Public Health. 2003;93:699. 16. Diez Roux AV. Investigating neighborhood and area effects on health. Am J Public Health. 2001;91:17831789. 17. Dudley G. Scale, aggregation, and the modifiable area unit problem. Operational Geographer. 1991;9:2833. 18. Elliott P, Wartenberg D. Spatial epidemiology: current approaches and future challenges. Environ Health Perspect. 2004;112:9981006.[Web of Science][Medline] 19. Krieger N, Waterman P, Chen JT, Soobader M-J, Subramanian SV, Carson R. Zip code caveat: bias due to spatiotemporal mismatches between zip codes and US census-defined geographic areasthe Public Health Disparities Geocoding Project. Am J Public Health. 2002;92:11001102. 20. Krieger N, Waterman P, Chen JT, Soobader M-J, Subramanian SV, Carson R. Krieger et al. respond. Am J Public Health. 2003;93:699700. 21. Krieger N, Waterman P, Lemieux K, Zierler S, Hogan JW. On the wrong side of the tracts? Evaluating the accuracy of geocoding in public health research. Am J Public Health. 2001;91:11141116.[Abstract] 22. Nuckols JR, Ward MH, Jarup L. Using geographic information systems for exposure assessment in environmental epidemiology studies. Environ Health Perspect. 2004;112:10071015.[Web of Science][Medline] 23. Openshaw S, Taylor P. The modifiable area unit problem. In: Wrigley N, Bennett RJ, eds. Quantitative Geography: A British View. London, England: Routledge & Kegan Paul; 1981:6069. 24. Soobader M, LeClere FB, Hadden W, Maury B. Using aggregate geographic data to proxy individual socioeconomic status: does size matter? Am J Public Health. 2001;91:632636.[Abstract] 25. Wu J, Funk TH, Lurmann F, Winer A. Improving spatial accuracy of roadway networks and geocoded addresses. Transactions in GIS. In press. 26. Edwards J, Walters S, Griffiths RK. Hospital admissions for asthma in pre-school childrenrelationship to major roads in Birmingham, United Kingdom. Arch Environ Health. 1994;49:223227.[Web of Science][Medline] 27. English P, Neutra R, Scalf R, Sullivan M, Waller L, Zhu L. Examining associations between childhood asthma and traffic flow using a geographic information system. Environ Health Perspect. 1999;107:761767.[Web of Science][Medline] 28. Garshick E, Laden F, Hart JE, Caron A. Residence near a major road and respiratory symptoms in US veterans. Epidemiology. 2003;14:728736.[CrossRef][Web of Science][Medline] 29. Pearson RL, Wachtel H, Ebi KL. Distance-weighted traffic density in proximity to a home is a risk factor for leukemia and other childhood cancers. J Air Waste Manage Assoc. 2000;50:175180.[Web of Science][Medline] 30. Wilhelm M, Ritz B. Residential proximity to traffic and adverse birth outcomes in Los Angeles County, California, 19941996. Environ Health Perspect. 2003;111:207216.[Web of Science][Medline] 31. Wjst M, Reitmeir P, Dold S, et al. Road traffic and adverse effects on respiratory health in children. BMJ. 1993; 307:596600. 32. Hitchins J, Morawska L, Wolff R, Gilbert D. Concentrations of submicrometre particles from vehicle emissions near a major road. Atmospheric Environment. 2000;34:5159.[CrossRef] 33. Zhu YF, Hinds WC, Kim S, Shen S, Sioutas C. Study of ultrafine particles near a major highway with heavy-duty diesel traffic. Atmospheric Environment. 2002;36:43234335.[CrossRef] 34. Zhu YF, Hinds WC, Kim S, Sioutas C. Concentration and size distribution of ultrafine particles near a major highway. J Air Waste Manage Assoc. 2002;52:10321042. 35. Houston D, Ong PM, Wu J, Winer A. Proximity of licensed childcare to near-roadway vehicle pollution. Am J Public Health. In press. 36. Green RS, Smorodinsky S, Kim JJ, McLaughlin R, Ostro B. Proximity of California public schools to busy roads. Environ Health Perspect. 2004;112:6166.[Web of Science][Medline] 37. Gunier RB, Hertz A, Von Behren J, Reynolds P. Traffic density in California: socioeconomic and ethnic differences among potentially exposed children. J Expo Anal Environ Epidemiol. 2003; 13:240246.[CrossRef][Web of Science][Medline] 38. Houston D, Wu J, Ong P, Winer A. Structural disparities of urban traffic in Southern California: implications for vehicle-related air pollution exposure in minority and high-poverty neighborhoods. J Urban Aff. 2004;26:565592.[CrossRef] 39. California Office of Geographic Information Systems. Functionally classified roads metadata. Available at: http://www.dot.ca.gov/hq/tsip/TSIPGSC/library/libdatalist.htm. Accessed March 11, 2005. 40. US Census Bureau. TIGER FAQ question 22. Available at: http://www.census.gov/cgi-bin/geo/tigerfaq?Q22. Accessed February 25, 2005. 41. Cayo MR, Talbot TO. Positional error in automated geocoding of residential addresses. Int J Health Geogr. 2003;2:10.[CrossRef][Medline] 42. Whitsel EA, Rose KM, Wood JL, Henley AC, Liao DP, Heiss G. Accuracy and repeatability of commercial geocoding. Am J Epidemiol. 2004;160:10231029. This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||