|
|
|||||||||
Dr. Pickle is Senior Mathematical Statistician and Coordinator of Geographic Research, Statistical Research and Applications Branch, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD.
Dr. Hao is GIS Analyst and Statistician, Department of Epidemiology and Surveillance Research, American Cancer Society, Atlanta, GA.
Dr. Jemal is Strategic Director, Cancer Occurrence, Department of Epidemiology and Surveillance Research, American Cancer Society, Atlanta, GA.
Mr. Zou is Statistical Programmer, Information Management Services, Inc., Silver Spring, MD.
Dr. Tiwari is Mathematical Statistician and Program Director, Statistical Research and Applications Branch, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD.
Dr. Ward is Managing Director, Surveillance Research, Department of Epidemiology and Surveillance Research, American Cancer Society, Atlanta, GA.
Mr. Hachey is Statistical Programmer, Information Management Services, Inc., Silver Spring, MD.
Dr. Howe is Executive Director, North American Association of Central Cancer Registries, Inc., Springfield, IL.
Dr. Feuer is Chief, Statistical Research and Applications Branch, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD.
This article is available online at http://CAonline.AmCancerSoc.org
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
The method to produce the ACS estimates has been refined as more incidence data have become available and statistical methods have improved. Beginning with the 1998 estimates, the statistical projection methods for cancer cases and deaths were changed from linear projections to an autoregressive quadratic time trend model.3 The projection method for deaths was further changed to a state-space model (SSM) beginning with the 2004 estimates, after a study demonstrated that the SSM produced more accurate predictions than the autoregressive quadratic time trend model.4
In order for the methods now used by the ACS to project accurate estimates of new cases and deaths to the current year, long-term data (8 or more years) must be available for all US states or for a subset of states that are representative of the entire United States. Long-term cancer mortality data exist for all US states since 1933, while long-term incidence data are available since 1975 only from the original registries included in the National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) Program (SEER9), covering about 10% of the population.5 The ACS method projects the total number of cases in the United States to the current year by a two-step process. First, the annual age-specific rates in the 9 oldest SEER areas are applied to the corresponding age-specific population from 1979 to the most current year for which data are available to estimate the number of new cancer cases diagnosed in each of those years. Then, a quadratic autoregressive time series model is applied to these estimates to project 4 years ahead to produce the projected total number of cases in the current year. State estimates are derived by apportioning the total US case estimates by state, based on the distribution of estimated cancer deaths. Underlying assumptions of this method are that age-specific incidence rates from the combined 9 oldest SEER cancer registries are representative of the US population and that the incidence-to-mortality ratios are constant across all states.
Cancer registries have now been established in every state and territory in the United States, and high-quality incidence data are available for several years for most, providing the opportunity to improve the ACS case projections by taking geographic variability of incidence rates into account. However, since only about half of states outside the SEER9 areas have incidence data that have met national criteria of high quality and completeness for 8 or more years,6 with no data available at all for some states, a new method for case projection was developed.
The new method uses statistical models of cancer incidence that incorporate potential predictors and spatial and temporal variation of cancer occurrence and that account for delay in case reporting. This paper describes the new method and compares its case projections for 2007 to those using the existing ACS method. Based on evidence that the new method produces more accurate estimates of the number of new cancer cases for years and areas for which data are available for comparison, the ACS has elected to use it to estimate the number of new cancer cases in CFF 2007 and in Cancer Statistics, 2007.7,8
| MATERIALS AND METHODS |
|---|
|
|
|---|
To validate the proposed methods for estimating the numbers of new cases in 2007, the spatial and temporal components of the method were tested separately. First, the spatial model described above was used to estimate the numbers of new cases in every US state for four major cancer sites (breast, prostate, lung and bronchus, colon and rectum) in each year for which state-specific results were available in the U.S. Cancer Statistics Report (USCS).11–13 USCS reports included the numbers of cases for 25 types of cancer reported by 42 states in 1999 and 2000 and by 44 states in 2001. This test was based on the 17 SEER registries with data available for each test year. Output from this model consisted of the numbers of cases estimated for each state that year; these are either modeled estimates for states that have data or "spatial projections," ie, estimates for states that have no observed data for a given year, based on data available from other registries. For comparison, the numbers of cases were also estimated for each state and year using the previous ACS method. Results from each method were compared with the observed numbers of cases as published in the USCS reports either by the squared deviations (square of the estimated minus observed counts) of the total summed over available states or by the sum of the squared deviations for each state.
As a second step in the validation process, output from the spatial projection model applied to each of a number of years was used to find which temporal projection method was best for projecting incidence counts 4 years ahead in time. This study was based on observed numbers of malignant cases from the SEER registries beginning in 1988 (with varying numbers of registries over time as SEER expanded from SEER9 to SEER17).5 Data from 1988 to 1995 were used to predict the 1999 estimated number of new cases, from 1988 to 1996 to predict 2000, and from 1988 to 1997 to predict 2001.
Four different methods for temporal projection of model-based estimates were tested: the previous ACS quadratic time series method (PROC FORECAST [PF]), a state-space method (SSM) currently used to project mortality counts ahead in time for CFF,4 a piecewise linear regression method (joinpoint method [JP])14,15 currently used to describe trends in incidence and mortality in many cancer registry reports,16 and a newly proposed semiparametric Dirichlet process method (DIR).17 Each of these methods was used to determine the time trends in the estimated counts across the available data years, then to project the number of cases 4 years ahead. The projected state-specific numbers of cases from each method were compared with the observed numbers of cases as published in the USCS reports on the basis of the sums of squared deviations.
Following the determination of the best spatial models and temporal projection method, the model was extended to incorporate time trends over the data period (L. W. P., unpublished data, 2006). The time trend was modeled as a quadratic function, similar to the previous ACS method, but the temporal effect could vary by geographic region or by county characteristic (eg, time trends could differ in urban and rural counties). The model included extra variation due to correlation of the numbers of cases over time and place (county, state, and region) and an additional term to account for any remaining "overdispersion," ie, greater than expected variation in Poisson-distributed counts. This model was implemented using SAS PROC GLIMMIX software with its optional spline-based approximation for spatial and temporal autocorrelation18 (also L. W. P., O. Schabenberger, A. Stephens, unpublished data, 2006). One advantage of this more complex spatio-temporal model is that only a single application of the model to data for the entire time span is required, rather than separate applications of the model to each year's data. More importantly, the spatio-temporal model shares information across nearby points of time and place simultaneously to provide the best results.
The spatial projection component of the model, ie, estimation of numbers of new cases in states without observed data, requires good spatial coverage in all regions of the United States, so utilizing data from a large and geographically dispersed portion of the United States was critical. For the 2007 projection, an incidence database covering 1995 to 2003 was obtained through an agreement with the North American Association of Central Cancer Registries (NAACCR). The data source was the response to the NAACCR Call for Data submissions as of December 2005. US cancer registries reporting data to NAACCR participate in the National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) Program or the Centers for Disease Control and Prevention (CDC) National Program of Cancer Registries (NPCR), or both, and receive support from the state, province, or territory where they are located. Registries for 40 states, the District of Columbia (DC), and the Detroit metropolitan area (Figure 1) met NAACCR registry certification standards as providing complete, accurate, and timely data for at least 3 consecutive years during 1995 to 20036 and agreed to release county-level incidence data for this project. Together, these registries cover 86% of the US population, although not every state included in this modeling effort had data for every year.
|
The cancer site was coded according to the SEER Program recodes in the same manner used for previous CFF reports.21 Race was grouped as White, Black, and Other. Although the NAACCR file identifies much finer race categories, the numbers of cases observed among Hispanics and Asian American/Pacific Islanders, for example, were too low in most regions of the United States to permit stratification of individual cases beyond three broad categories. However, the percentages of Hispanics, Asian American/Pacific Islanders, and American Indian/Alaskan Natives in each county were included in the model to capture variations in incidence due to different racial mixes of the population. Age at diagnosis was initially coded to age groups 0 to 4 years, 5 to 14 years, 15 to 24 years, 25 to 34 years, 35 to 44 years, 45 to 54 years, 55 to 64 years, 65 to 74 years, 75 to 84 years, and 85+ years; younger age groups were usually aggregated to ensure adequate numbers of cases in each stratum for analysis, typically age 0 to 34 years, depending on the cancer site. Input to the models consisted of numbers of new cases stratified by site, sex, race, age group, county or HSA of residence, and year of diagnosis rather than individual case records. Similarly stratified populations were obtained from the Census Bureau.22
Approximately 35 covariates were considered as potential predictors of incidence in the new models. Only age, sex, race, county of residence, and type of cancer were available for the individual cases. All other predictors were population characteristics for the county or HSA, including measures of income, education, housing, racial distribution, urban/rural status, availability of physicians and cancer screening facilities, health insurance coverage, cigarette smoking, obesity, cancer screening rates, and mortality rates. These covariates were available for every US county from a variety of sources, including the Census Bureau, Area Resource File,23 CDC,24 and the National Center for Health Statistics.25 Behavioral risk factor and screening variables from the CDC Behavioral Risk Factor Surveillance System were calculated as mean proportions at the state level for each year. Differences between each county's calculated proportion and its state value for the aggregated period 1994 to 2003 were also calculated to measure within-state variation of the risk and cancer screening behaviors. Annual values for all other covariates were calculated by linear interpolation between available data years and linear extrapolation to 2003 beyond the last available year.
Results of the spatio-temporal models are cancer- and sex-specific smoothed annual estimates for registries that provided data and annual modeled estimates for registries with missing data for each year (1995 to 2003). The assumed spatial and temporal autocorrelation plus covariates included in the model result in a sharing of information across areas that are similar in location, time, and county characteristics. For example, the number of new cases for registries with no input data at all will be estimated using several years of data from neighboring states and from other states and counties with similar sociodemographic and lifestyle profiles; estimated numbers for a registry with a single missing year of data are based on observations from that registry before and after the missing time point, as well as from states that are neighbors or have similar characteristics.
Model estimates were added over age, race, and county to produce state-year-cancer-specific estimates for the time span of the available incidence data. These estimated numbers were then adjusted to account for the delay expected in reporting cancer cases to the registry.26 The number of new cases reported to the SEER registries in the most recent data year are on average 3.5% to 4.5% below what they eventually will be after case finding by the registry is complete, but can range as high as 21% (for leukemia), depending on the type of cancer and the sex, race, and age of the patient. The delay adjustment modifies the observed numbers more in the most recent reporting years to account for future anticipated corrections to the data.27 To date, delay adjustment estimates have only been developed for the long-running SEER9 registries. However, results from all registries, not just SEER9, were delay adjusted, assuming that these SEER-derived factors hold for the entire United States. As longer incidence time series are available from more registries, more appropriate delay factors can be developed. Although the factors used in this new method are not ideal, without any adjustment at all the number of new cases could falsely appear to be trending downward in the most recent years, impacting the projected trend into the future. The delay-adjusted numbers were then projected ahead to 2007.
| RESULTS |
|---|
|
|
|---|
|
|
The JP is more flexible than the PF because it fits multiple linear segments to the time series, and thus is more sensitive to sudden changes in trend than the presumed quadratic time trend used by PF. The semiparametric method (DIR) and the SSM apparently require a longer time series than was available in order to project several years ahead and cannot provide state-specific estimates for missing data states. On the basis of this validation study, the JP is the preferred method to project the number of new cases ahead in time, at least until a much longer time series is available for most states. Therefore, the projected numbers of cancer cases in 2007 for each sex/cancer site combination were produced by the following steps:
Table 3 presents the results projected for the total United States by cancer site using the new and old methods for 2007. The new method projects that there will be 1,444,913 new cancer cases among men and women in 2007, which is 1.8% higher than the 2007 projection using the previous ACS method. The total number of cases estimated by the old and new methods are quite similar over the period of 1995 to 2003, although the faster increase over time estimated by the new method leads to the slightly higher projected number of cases in 2007 (Figure 2). However, there are substantial differences between the two methods in the number of cases projected by site. Among the 4 most common cancer sites, projections from the new method compared with the old method are 15.3% higher for lung cancer, 3.7% higher for colorectal cancer, 5.5% lower for prostate cancer, and 15.2% lower for female breast cancer (Table 2). Cancer site groupings where the estimates of new cases are more than 10% higher than the previous CFF method predicted are oral cavity and pharynx (+11.1%), with a notable increase in pharyngeal cancer (+29.6%); respiratory system cancers (+17.1%), with notable increases in all 3 cancer site categories; urinary system (+11.7%), with a notable increase in cancer of the kidney and renal pelvis (+23.1%); multiple myeloma (+18.4%); and leukemia (+23.4%), with notable excesses in all 4 major subtypes of leukemia. There were 2 cancer site groupings where the estimates of new cases are more than 10% lower than the previous CFF method predicted; these are bones and joints (–11.1%) and female breast cancer (–15.2%), as previously noted. Differences of more than 10% lower were also observed for relatively uncommon cancer sites (Table 3).
|
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
Projections for specific cancer sites vary more substantially than projections of total cases. There are several reasons why the projections from the new method are likely to be more accurate than those from the earlier method:
Like any method for projecting the number of new cancer cases 4 years ahead from observed data, the new method also has some limitations. Not all states and cancer sites are predicted equally well. The accuracy of the model results is dependent on inclusion of a sufficient set of covariates to explain the incidence patterns across the United States. The numbers of new cancer cases can be adequately predicted for most states using the new model, even without observations from them, but the presence of unmeasured riskfactors or effective cancer control programs can impact the number of cases in ways that cannot be predicted. For example, a model using data from NAACCR 1995 to 2002, which did not include data from Pennsylvania, substantially underestimated the number of new lung cancer cases among males in Pennsylvania, but when Pennsylvania data were included in the expanded dataset used for the 2007 projections, its predicted count was very close to the observed count for 2003. North Carolina, on the other hand, was well estimated whether or not its observed data were included as input to the model.
Another limitation of the new model as implemented for 2007 is its assumption of a quadratic time trend over the short time span of data (1995 to 2003). Although no evidence was seen for a lack of fit, this assumption may impose a curvature onto the time trend that is not present in the observed data and which limits the sensitivity of the model to short-term variations or sudden changes in the trend. In the future, as the time span of the data available from most state registries lengthens, improved time series models can be used.
Inaccurate projections of the numbers of cases to 2007 may result from applying delay-adjustment factors that are based on case finding patterns in SEER registries to all registry data. When additional information on cumulative reporting patterns is available for other areas, more appropriate factors can be used.
Large differences in projections by the old and the new methods for the major cancer sites are of special importance since they have the greatest impact on the cancer burden. The 15.3% increase in estimates of lung cancer cases in the new compared with the old method most likely results from recognized differences in tobacco use patterns between the SEER9 areas and the fuller geographic data set used in the new model. Average annual age-standardized lung cancer incidence rates (1999 to 2003) for males and for females in the 41 states providing input to the new method are 11% (male) and 5% (female) higher than those in the 9 oldest SEER areas used by the old method. Several other smoking-related cancers showed similar patterns.
The greater number of cases projected for leukemia and all of its subtypes appears to be due to the effect of delay adjustment, which was not included in the previous ACS method. Before projection to 2007, model estimates of the number of leukemia cases in 2003 were inflated by 10% for cases under age 45 years, by 21% for age 45 to 64 years, and by 18% for cases over age 64 years, resulting in a 12% greater total number of leukemia cases estimated in 2003 and 23% greater in 2007 by the new method. These factors have been used for several years to adjust SEER incidence rates that, for leukemia, can result in an apparent increasing trend when the observed rate trend is declining.28 The long estimated delay in case reporting is due to the nature of cancers of the hematopoietic system. Because no surgery is required for diagnosis or treatment of leukemia, many cases are not seen in a hospital, making case finding more difficult for the cancer registry. Also, children and young adults are diagnosed with acute more often than chronic leukemia. These younger cases often initially present with a medical crisis and so are identified by a hospital record more often than older cases with chronic disease. Because of the new adjustment for these expected delays in case finding, the number of cases projected by the new method should better reflect the actual number of new leukemia cases.
For breast cancer, the reasons for the 15.2% decrease in projected cases for 2007 using the new compared with the old method may be somewhat more complex. Age-adjusted rates in SEER9 registries, which are the basis for the previous ACS method, were about 6% higher than similarly adjusted rates in the geographic areas used for the spatio-temporal model(Figure 4), suggesting that use of an expanded registry database is at least partly responsible for the lower projected number of breast cancer cases. Another factor that may contribute to the differences is the uncertainty in projecting ahead in time when the underlying incidence trends appear to be changing. Trends in breast cancer incidence rates in most geographic areas used as input to the spatio-temporal model have shown a recent stabilization, possibly even a downturn, after increasing for several years.16 These changes have been modeled differently by the methods used to project numbers of cases to 2007 (Figure 3), and at the present time it is unclear which method is more accurate. However, for 1999 to 2003, the observed numbers of new cases in the geographic areas whose incidence data were used in the spatio-temporal model were well fit by the new model (Figure 5).
The lower case estimate (by 5.5%) for prostate cancer by the new method is in part due to regional differences in prostate cancer incidence rates covered by the 2 methods. Average annual age-standardized prostate cancer incidence rates for 1999 to 2003 in the 41 states providing input to the new method are 8.8% lower than that of the 9 oldest SEER areas used by the old method, perhaps reflecting regional differences in utilization of prostate specific antigen.
Despite some limitations, the new spatio-temporal model plus JP regression for temporal extrapolation appears to provide improved estimates of the numbers of new cases, both for individual states and for the nation, even for the less common cancers. Based on these results, the ACS has decided to use this method to project incidence number of new cases for CFF 2007.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S. Kusmartsev, E. Eruslanov, H. Kubler, T. Tseng, Y. Sakai, Z. Su, S. Kaliberov, A. Heiser, C. Rosser, P. Dahm, et al. Oxidative Stress Regulates Expression of VEGFR1 in Myeloid Cells: Link to Tumor-Induced Immune Suppression in Renal Cell Carcinoma J. Immunol., July 1, 2008; 181(1): 346 - 353. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Jemal, R. Siegel, E. Ward, Y. Hao, J. Xu, T. Murray, and M. J. Thun Cancer Statistics, 2008 CA Cancer J Clin, February 20, 2008; (2008) CA.2007.0010v1. [Abstract] [Full Text] |
||||
![]() |
A. Jemal, R. Siegel, E. Ward, T. Murray, J. Xu, and M. J. Thun Cancer Statistics, 2007 CA Cancer J Clin, January 1, 2007; 57(1): 43 - 66. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | COVER ARCHIVE | SEARCH | TABLE OF CONTENTS |