Biostatistical Methods: The Assessment of Relative Risks
John M. Lachin
John Wiley and Sons, 2000
Thank you for your interest in this book. Throughout the book I describe the implementation of the various methods using SAS procedures, when available. Here I have also included my own SAS programs and macros referred to in the book that were used to perform supplemental computations. My programs are "workmanlike" â€“ they get the job done without a lot of bells and whistles. Thus, the programs may require a little study to see how they work. Rather than allow the user to specify various options, and to customize the program to meet the needs of an analysis, my programs do whole sets of computations simultaneously. For example, the program for the analysis of multiple 2x2 tables does all analyses for the risk difference, odds ratio and relative risk scales simulatneously. Other than for exercises, the user should decide a priori exactly which elements of the output are to be employed. It is cheating to inspect all the various computations and then choose that one which provides the most "favorable" results.
The materials are organized by chapter. The individual elements may be downloaded directly to your local computer and then accessed using a basic word processor or by SAS.
One of the main features of the book are extensive problem sets for each chapter. However, I am sure that many more could have been included. I welcome suggestions for additional Problems (or Examples). Any appropriate suggestions will also be posted on this site (with attribution) so that others may also benefit. Check the supplement with additional examples, problems and data sets. I have also included here some additional programs for computations related to those presented in the text.
In our graduate program, I teach a one semester course from this text (Statistics 225) that is required of all of our MS and PhD students in Biostatistics, and all PhD students in Epidemiology. All students are required to have completed a two semester course in mathematical statistics at the level of Hogg and Craig. See the syllabus for my one semester course.
The following are the reference materials for each chapter, principally SAS programs and macros. This link contains the data sets employed for the analyses presented in the book. These are "flat files" which can be downloaded and accessed directly.
This introductory chapter describes the natural history of diabetic nephropathy and the results of the Diabetes Control and Complications Trial. This link provides, among other information, the complete DCCT bibliography, with links to the abstract of each paper, and in some cases, to the complete manuscript. The following are the links for the specific papers referred to in this book.
- The DCCT Research Group. The Diabetes Control and Complications Trial (DCCT): Update. Diabetes Care, 13:427-433, 1990. Abstract
- The DCCT Research Group. The Diabetes Control and Complications Trial (DCCT): Update. Diabetes Care, 13:427-433, 1990. Abstract
- The Diabetes Control and Complications Trial Research Group. The effect of intensive treatment of diabetes on the development and progression of long-term complications in insulin-dependent diabetes mellitus. The New England Journal of Medicine, 329:977-986, 1993. Abstract
- The Diabetes Control and Complications (DCCT) Research Group. Effect of intensive therapy on the development and progression of diabetic nephropathy in the Diabetes Control and Complications Trial. Kidney International, 47:1703-1720, 1995. Abstract
- The Diabetes Control and Complications Trial Research Group. The relationship of glycemic exposure (HbA1c) to the risk of development and progression of retinopathy in the Diabetes Control and Complications Trial. Diabetes, 44:968-983, 1995. Abstract
- The Diabetes Control and Complications Trial Research Group. Adverse events and their association with treatment regimens in the Diabetes Control and Complications Trial. Diabetes Care, 18:1415-1427, 1995. Abstract
- The Diabetes Control and Complications Trial Research Group. The absence of a glycemic threshold for the development of long-term complications: The perspective of the Diabetes Control and Complications Trial. Diabetes, 45:1289-1298, 1996. Abstract
- The Diabetes Control and Complications Trial Research Group. Hypoglycemia in the Diabetes Control and Complications Trail. Diabetes, 46:271-286, 1997. Abstract
- Diabetes Control and Complications Trial Research Group (DCCT) (2000). The effect of pregnancy on microvascular complications in the diabetes control and complications trial. Diabetes Care, 23:1084-1091. Abstract
This chapter describes the basic analysis of proportions and a 2x2 table. The methods in f 4 are generalizations of the methods in this chapter to allow for a stratified analysis. The book describes analyses using the SAS PROC FREQ. I have also written macros to perform many of these analyses. See the descriptions under Chapter 4. The following programs are provided:
exact2x2.sas: This program takes the exact confidence limits on the odds ratio from StatXact (or SAS) and then computes the corresponding limits on the probabilities, and the exact corresponding limits on the risk difference and the relative risk. For illustration, the exact computations in SAS PROC FREQ are also presented for the odds ratio, as well as the asymptotic limits for the other scales.
Figure21.sas: generates figures relating the odds ratio, relative risk and risk difference when one measure is held constant over a range of values of p2. A version of this was used to generate Figure 2.1 where the risk difference was held constant.
hypr2x2.sas: computes hypergeometric probabilites for all possible 2x2 tables with fixed margins, as a function of a specified odds ratio.
par2x2.sas: computes the population attributable risk and C.I. using Walter's limits, Leung-Kupper limits, and the logit limits derived in the text.
tab2x2.sas: SAS Proc FREQ analysis of basic 2x2 tables.
Table23.sas: SAS Proc FREQ analysis of basic 2x2 tables.
This chapter describes the relationships between power and sample size in general and for the test for proportions. It also describes Pitman efficiency and the ARE of competing tests. In later chapters, these principles are used for the determination of sample size and the derivation of asymptotically efficient tests. Programs for sample size and power are also presented in later chapters for tests developed in those chapters. Here programs are grouped according to the topic.
Precision of an estimate:
NCon1Pr.SAS: Sample size N to provide a desired precision for a confidence interval for a single probability.
Prec1Prb.SAS: The precision for a confidence interval for a single probability evaluated over the range 0.5-1.0 with plots for a given sample size.
NConProb.SAS: N for a confidence interval for the difference between probabilities for 2 groups.
Test for proportions:
NPwrProb.SAS: Sample size based on the power function for the test of 2 proportions.
SSNProb.SAS: Sample size based on power for the test of 2 proportions over a specified range of control group probabilities and relative risks.
PowrProb.SAS: Power for the test of 2 proportions over a specified range of control group probabilities and relative risks.
PPwrPlot.SAS: Plot the power function for the test for two proportions for different sample sizes as a function of the risk difference, relative risk or odds ratios for different sample sizes.
Test for means:
TpwrPlot.SAS: Plot the power function for the test for two means for different sample sizes as a function of the non-central factor for different sample sizes. A variation of this was used to generate Figure 3.2.
Non-centrality parameter for proportions
SSNProbK.sas: Used to generate Table 3.1 which presents the non-centrality parameter for the test for two proportions as a function of the probabilities in each group.
Chapter 4: Fixed Effects Model
Freqanal.sas: The principal macro %freqanal that is used for the computation of fixed-effects model estimates and tests in a stratified analysis of 2x2 tables. These include the Mantel-Haenszel estimates and confidence limits (using the Robins, Breslow and Greenland variance), MVLE's and confidence limits, Cochran and Mantel-Haenszel tests, Radhakrishna tests, the Gastwirth MERT, and the Wei-Lachin test of stochastic ordering. The program then calls SAS PROC FREQ to perform a Cochran-Mantel-Haenszel analysis. Then the Cochran test of homogeneity is computed for each scale. This routine should be "submitted" or compiled before running another job, like KTab2x2.sas, to invoke the macro for a specific set of tables.
Freqmh.sas: A version of Freqanal that also computes the Hauck estimates of the variance of the Mantel-Haenszel estimators along with those of Robins, Breslow and Greenland. This also computes the Breslow-Day test of homogeneity of odds ratios and the corrected test of Tarrone. This does not compute the MVLE's etc.
homogen.sas: A PROC IML macro that computes the MVLE and a Wald test based on the MVLE and its estimated variance for a parameter estimate within independent strata, and also computes the contrast Wald test of the hypothesis of homogeneity. Here the parameter estimates could be any quantity. The estimate and is variance within each stratum are input using "cards."
Random Effects Model
freqrand.sas: This is the principal macro %freqrand that computes the random effects model one-step adjusted estimates for multiple 2x2 tables. The macro starts with the computation of the variance component estimate for each scale, followed by the random effects model computations. The macro %freqanal must be invoked before this macro. See Ktab2x2.sas for an example.
randiter.sas: The macro %random that performs a fixed-point iterative random effects model analysis of multiple 2x2 tables. The analysis is performed simultaneously for each of the three scales until convergence on all scales is reached. This is a stand-alone macro and does not require that freqanal or freqrand be run beforehand.
KTab2x2.sas: This job invokes the above macros for the analysis of the data in Examples 4.1, 4.6 and 4.24 using fixed and random effects models.
tab4-4.sas: The SAS program that appears in Table 4.4.
Power and Sample Size
CMHTstSN.sas: computes the sample size for the Cochran-Mantel-Haenszel test for stratified 2x2 tables and was used for the computations in Example 4.25 and Table 4.11. The program also computes the sample size from the unconditional marginal parameters, but without an adjustment for bias.
CMHTstpw.sas: is the companion program that computes the power of the the Cochran-Mantel-Haenszel test for stratified 2x2 tables for fixed total sample size.
Chapter 5: Retrospective Studies
parretro.sas: A macro %retropar to compute the Population Attributable Risk for a case-control 2x2 Table and its confidence limits.
The frequency matched data presented in Example 5.2 are included as part of Ktab2x2.sas used in Chapter 4.
matchtab.sas: SAS PROC FREQ analysis of matched 2 x 2 tables as described in Table 5.1 and applied to the data from Examples 5.3 and 5.4.
pair2x2.sas: The macro %paired for the analysis of a matched or paired 2x2 table with McNemar's test, the conditional odds ratio and its confidence limits, and the population averaged relative risk and its limits. This is used for Examples 5.7 and 5.8
Kpair2x2.sas: The macro %Kpaired for the stratified analysis of K paired or matched 2x2 tables. This is used in Examples 5.12 and 5.13.
KpairTab.sas: Analysis of the data in Example 5.12 using the macro %Kpaired for the stratified-adjusted analysis of K matched or paired 2x2 tables.
Breslow-Day Mack et al. Data
These data are employed both in Chapter 5 and again in Chapter 7. Here we only employ matched pairs consisting of each case and the first matched control (among 4). There are different versions of the data set available. That used herein was downloaded from Norman Breslow's web site in the fall of 1999 and is contained in the data sets directory cited above.
MackSetP.sas: generates the SAS data set jml/biostatmethods/datasets/pairs with the information for each pair. The stratification variable is the hypertension status at baseline. Run this job on your platform to generate a saved SAS data set.
MackPair.sas: generates the marginal summary 2x2 table shown in Example 5.8 and calls macro %paired, and also generates the tables within strata defined by the baseline hypertension status as described in Example 5.13. The program calls %Kpaired but the analysis fails due to a zero frequency within one stratum.
Power and Sample Size for McNemar's Test
McNemarN.sas: Sample size calculations for McNemar's test for paired or matched 2x2 tables using the unconditional power function, as illustrated in Example 5.9.
McNemUPw.sas: Power for McNemar's test for paired or matched 2x2 tables using the unconditional power function. This calculation is not illustrated in the book.
McNemCPw.sas: Power for McNemar's test for paired or matched 2x2 tables using the conditional power function, as illustrated in Example 5.10.
Additional programs are provided in the supplement that performs all of the various approaches to the computation of sample size and power described in Lachin (1992b).
Newton.sas: The macro %Newton that provides the iterative solution of a scalar function using either Newton-Raphson or Fisher scoring. See Fisher Recomb.sas for the use of this macro to solve for the recombination fraction in the Fisher maize data using either Newton-Raphson or Fisher scoring. You must also specify: 1) A macro %FOFX(x) which evaluates the function of x to be solved and returns the value using the variable name "yofx". FOFX must be such that the value is zero at the solution. 2) A macro %DFOFX(x) which evaluates the derivative of the function at x and returns the value using the variable name "dyofx". For Newton-Raphson the hessian is used, for Fisher scoring the negative expected information is used. 3) A macro variable ERROR which gives the desired error of the result which when reached terminates the iteration. 4) A starting value in the variable x0 (not a macro variable). 5) A macro variable maxitn that fixes the maximum number of iterations at which the program aborts. Macro variables are defined using a %let statement, e.g. %let error=0.00000001.
Recomb.sas: calls the macro %Newton to obtain an iterative estimate of the recombination fraction for Fisher's maize data presented in Example 6.6. Newton-Raphson and Fisher scoring are illustrated, each starting from the moment estimate starting value and also from the null value. This generates the computations described in Tables 6.1 and 6.2.
Khypr2x2.sas: computes MLE of conditional odds ratio for K 2x2 tables using the Ulcer clinical trial data as presented in Example 6.7. This program uses a modification of the macro %Newton that performs the Newton-Raphson iterative solution for a scalar function. In this version, %FOFX also evaluates the derivative of the function at x and returns the value using the variable name "dyofx". A macro DFOFX is not used.
Chapter 7: Logistic Regression for i.i.d. Observations
CornfIML.sas: applies the binomial logit model to the Cornfield data as shown in Example 7.1 using the SAS PROC IML sample library routine LOGIT. The output was used to generate the iterative solution presented in Table 7.2.
Cornf.sas: The SAS Program for the Framingham CHD Data presented in Table 7.3. The output from this program appears in Table 7.4. This also contains the interaction model described in Example 7.10;
Renal.sas: The SAS program and data set presented in Table 7.5 for the analysis of a subset of the DCCT nephropathy data used in Example 7.4. This is used to generate the output presented in Table 7.6. This also includes the call of PROC GENMOD presented in Example 7.7 and the output presented in Table 7.7. Further it includes the computaion of the sums of squares (through Proc Univariate) described in Example 7.14.
This renal data set is a subset of the DCCT nephropathy data set. The complete nephropathy data set is available from the National Technical Information Service (see DCCT link for Chapter 1 above).
DickSton.sas: Program for the logistic regression analysis of the unmatched retrospective study of Dick and Stone (1973) presented in Example 7.5.
LogiScor.sas: computes the score vector, and the estimated expected information under the null hyothesis and computes the model score test as illustrated in Example 7.6. First run Renal.sas to set up the renal data set.
LogiRob.sas: computes the robust information sandwich estimate of the covariance matrix of the coefficient estimates and robust confidence intervals and Wald tests for the renal data as shown in Example 7.8. This also computes the score vector and the information sandwich estimate under the null hypothesis, and the robust efficient score test. First run Renal.sas to set up the renal data set.
Sample Size and Power for Logistic Models
LogiBeta.sas: For a given design matrix and sampling fractions, this routine computes the expected logistic model parameters (betas) using two approaches. The first is to use the iterative maximum likelihood solution using a large set of hypothetical observations. The other is the WLS or GSK method described by Rochon (1989) based on a one-step WLS computation. This is used for Example 7.9.
LogiSSN.sas: computes sample size for logistic regression based on the power of a Wald test for a given design matrix, sample fractions and set of coefficients. This is used for the computations described in Example 7.9. In addition to the 1 df test described in the Example, the determination of sample size for the model Wald test is also illustrated.
LogiPwr.sas: computes the power of a Wald test in logistic regression for a given sample size as described in Example 7.9.
The Cornfield analysis in Example 7.10 is contained in Cornf.sas.
UlcrLogi.sas: fits the interaction model to the ulcer clinical trial data presented in Example 7.11.
RenalMod.sas: fits the logistic regression models with interactions for the renal data as shown in Example 7.12. The job renal.sas must be run to set-up the renal data before running this job.
RenalORs.sas: performed the computations of the odds ratio for a unit increase in HbA1c as a function of the level of systolic blood pressure presented in Figure 7.1 of Example 7.13.
Conditional Logistic Regression Models
HLBwt.sas: Low birthweight data from Hosmer and Lemeshow (1989) and the SAS program presented in Table 7.8 that generates the output presented in Table 7.9. See the documentation in DataSets.
Prostate.sas: includes the prostate cancer data set from Collett (1991) presented in Table 7.11. This data set was obtained (downloaded) from the SAS online data sets for the SAS book Logistic Regression Examples Using the SAS System, where it appears on p. 17-18.
MackLogi.sas: uses the Mack et al. data described in Breslow and Day (1980) from a matched case control study. The data set is provided by Norman Breslow on his web site at the University of Washington Department of Biostatistics. The data set downloaded from his site in the fall of 1999 is contained in bresmack.txt and the documentation provided by Dr. Breslow is provided in bresmack.text. This program computes additional variables used in Problem 7.18.
Chapter 8: Rates and Relative Risks
rate.sas: Two macros for the computation of rates and relative risks with an adjustment for over-dispersion. The macro %rate computes event rates and relative risks, with an adjustment for over-dispersion using the moment estimator for the dispersion variance. The macro %adjrate conducts a stratified analysis of relative risks over strata, computing the MVLE with the adjustment for overdispersion. The latter is not illustrated in the text. This program must be submitted before calling the macros. Instructions for use of the macros are provided in the this program.
rate0.sas: The macro %rate0 that computes the variance of the mean rate within each group under the null hypothesis with an adjustment for over-dispersion within each group, and the z test of the differences between rates.
Hyporrs.sas: The analysis of the DCCT rates of hypoglycemia as in Examples 8.1, 8.2 and 8.3 using the macros %rate and %rate0. The programs for these macros must be submitted before submitting this program, or %include statements must be added. This program also conducts a stratified-adjusted analysis, stratifying by adult versus adolescent, using the macro %adjrate. These results are not shown in text. The analysis uses the data set dccthypo.dat with the path /jml/biostatmethods/datasets/hypoglycemia/dccthypo.dat.
Note that the data in Table 8.1 were not used as the basis for the computations. Rather the event rate for each subject is computed in the data statement, as is the years of follow-up, so that the computations use greater precision for each number for each subject.
This hypoglycemia data set is a subset of the complete DCCT data set that is available from the National Technical Information Service (see DCCT link for Chapter 1 above).
Table81.sas: generates the data set displayed in Table 8.1 derived from the data set dccthypo.dat. Note that all computations are based on the latter data set, not the data summarized in Table81.dat. The latter is not provided as a data set.
RandRate.sas: is a supplemental program that uses a fixed point algorithm to compute convergent estimates of the over-dispersion parameters, the mean rates and their variances, both under the alternative hypothesis and also under the null hypothesis. The latter could be used as the basis for a Z-test of the difference between two groups. These computations are not described in the text, but follow from those described in Section 4.10.2. The program is applied to the DCCT data used in Example 8.3.
Poisson Regression Models
HypoPois.sas: fits the Poisson regression models using the SAS program shown in Table 8.2 that generates the output shown in Tables 8.3, 8.4 and 8.5. The output generated differs slightly from that shown in the tables. This program uses the class variable for treatment group defined using the categories "exp" (experimental) versus "std" (standard), older names for the treatment groups that are labeled as intensive and conventional respectively. Since "std" has the higher alphabetical order, this group is used as the reference category for the model effects shown in Tables 8.4 and 8.5. These group labels in the tables were later changed to intensive and conventional, respectively. Thus, I edited the program output and manually changed "Exp" to "Int" and "Std" to "Conv".
Note that if the statements in the program are changed such as (if grp=1 then group='Int ';) and (if grp=0 then group='Conv';) then in this case the "Int" group has the higher order and it, rather than "Conv", is used as the reference category. In this case, the sign of the group coefficient is opposite that shown in the tables.
Additional computations in the program compute the log likelihoods for the null, full and saturated models, deviances for the null and full model, and additional computations that provide the measures of association described in Example 8.5.
HypoRob.sas: fits the over-dispersed quasi-likelihood robust Poisson regression models described in Example 8.6, and the robust information sandwich analysis using PROC GENMOD that is described in Example 8.7. Additional computatons using a PROC IML program, as in LogiRob.sas, provide the robust sandwich estimate plus robust inferences;
GailCnt.sas: This reads the data from Gail, Santner and Brown, 1980 that are used in Problem 8.4. In their paper, the actual times start at 60. The data here start with 60 as time zero. A11 animals were exposed for 122 days (182-60).
FrCh.sas: reads the data from Frome and Checkoway, 1985, that are used in Problem 8.6.
FHCnt.sas: reads the data file containing the detailed information on the recurrent infections from Fleming and Harrington (1991) and computes the numbers of events and exposure time for each individual. Additional computations can then be performed.
FHCnt.txt: Is a data file constructed from the output of the above program. This can be input or accessed directly using a filename and %include statements.
FHCntAn.sas: reads the file fhcnt.dat and defines additional variables that can be used for analyses of numbers of events using rates and Poisson models as requested in Problem 8.10.
Chapter 9: Survival Function and Tests of Significance
Survanal.sas is a set of survival analysis macros that provides the basic computations of survival probabilities and hazard rates in each of two groups, either for data in continuous time or using the actuarial method, with generalized Mantel-Haenszel or G-rho tests for differences between groups. Instructions for use are given in the program.
SWL-anal.sas uses the macro %survanal to perform survival analyses of the Lagakos squamous cell data in the complete cohort, and within subgroups defined by performance status. The analysis within the non-ambulatory subgroup is presented in Example 9.1 (Tables 9.1-3 and Figure 9.1) and Example 9.4.
DCCTneph.sas. This program uses the data set nephdata.dat with the macros provided in survanal.sas to perform the modified Kaplan-Meier lifetable analyses presented tables 9.5-5 of Example 9.2 and to compute the tests presented in Example 9.5. The dataset contains additional variables not mentioned in the book and could be used for supplemental exercises.
Proportional Hazards Model
SWL-PH.sas conducts the various proportional hazards regression analyses presented in Example 9.6, including those with covariate interactions, nested effects, and stratified analyses.
SWL-R2.sas computes the Schemper and Kent-O'Quigley R-square measures of explained variation in the proportional hazards regression analysis of the Lagakos data presented in Example 9.6.
SWL-IS.sas computes the robust information sandwich estimate of the covariance matrix of the coefficient estimates for the proportional hazards regression analysis of the Lagakos data presented in Example 9.6.
SWL-Gof.sas assesses the PH model assumptions using tests of interaction with time, plots of the log(-log(survival)) function within strata, and the Lin (1991) test of the PH assumption using the SAS macro %gofcox. This performs the computations described in Example 9.8 and generates the functions plotted in Figure 9.3.
GofCox.sas is a macro written by Dr. Oliver Bautista to evaluate the proportional hazards assumption for each covariate in the model using the method of Lin (1991). Instructions for use of the macro are provided in the program as comments.
NephA1c.sas fits the PH model for the effect of the current annual mean HbA1c on the risk of developing microalbuminuria in the conventional treatment group of the secondary cohort of the DCCT. This was used to generate the results presented in Example 9.9.
This program uses the data set dcctneph in the datasets directory. The analyses presented in the text used the raw SAS DCCT data file, rather than the "flat" file dcctneph Thus the numerical results obtained using these data are slightly different from those presented in the text.
Note that the data set also contains additional variables not employed in the text that could be used for supplemental exercises.
Sample Size and Power
exsslogl.sas determines the sample size to provide a desired level of power for a test of the equality of the survival distributions for two groups under an exponential model assuming uniform entry over the period (0,R) and continued follow-up up to T>R years in all subjects. This program uses the log hazard ratio as the basis for the test, rather than difference in hazards. The power function is presented in equation (2.4) of Lachin and Foulkes (1986). The program also provides for losses to follow-up that are also exponentially distributed, using (4.1) of Lachin and Foulkes. This is used for computations presented in Example 9.10.
expwlogl.sas is the companion program to exsslogl.sas that determines the level of power provided by a specified sample size for a test of the equality of the survival distributions for two groups under an exponential model assuming uniform entry over the period (0,R) and continued follow-up up to T>R years in all subjects. This program uses the log hazard ratio as the basis for the test, rather than difference in hazards. The power function is presented in equation (2.4) of Lachin and Foulkes (1986). The program also provides for losses to follow-up that are also exponentially distributed, using (4.1) of Lachin and Foulkes. This is used for computations presented in Example 9.10. Note that the power values stated in the book for a 40% and 33% reduction are those with no losses to follow-up. The power with losses is slightly less than that stated.
The supplemental programs include the program expwplot.sas that plots the power function over a range of relative risks for a given sample size, for a test using the log hazard ratio. The programs exsndiff.sas and expwdiff.sas are comparable to the above programs except that the power function is based on the test of the difference in hazards, rather than the test of the log hazard ratio. These supplemental programs provide power estimates slightly less, and sample size estimates slightly larger, than those provided by the above programs. Two additional programs ssnopdok.sas and pwropdok.sas provide the determination of sample size and power, respectively, for a stratified analysis of differences in hazards as described by Lachin and Foulkes (1986).
KernelSm.sas contains the macro smooth that computes the kernel smoothed estimate of a hazard function or intensity function for a counting process based on possibly recurrent event time data. The macro uses the Epanechnikov kernel. The estimate, its variance and the kernel are presented in equations (9.137)-(9.139). The program allows for computations for two groups of subjects. The usage is described in the macro; see hypokrnl.sas as an example. Macro variables specify the band width, the number of distinct event times and the axis specifications for the plots. The input data must consist of a single observation with 5 arrays of variables contining the successive time values, numbers of events and numbers at risk in each group at each time. The input data set may be constructed using the program hypotime.sas.
hypotime.sas is the program used to generate the data set hytimes that contains a single observation with the numbers at risk and numbers of events at each distinct event time. This data set is then used with the macro smooth in the program kernelsm.sas to generate kernel smoothed estimates of the intensity function over time, or the macro AGtests in AGtests.sas to compute the estimates of the cumulative intensity functions and the Aalen-Gill tests for recurrent event processes. This program uses the input data set hyevents that has one observation for each event for each subject, or a single observation if no events.
AGtimes.sas is a more general macro that can be used to generate the appropriate data set with the arrays for event times, numbers of events and numbers at risk from any data set with recurrent events. In FG-CGDtm.sas, this program is applied to the Fleming and Harrington (1991) CGD data described in Problem 9.18. The macro requires that the data set contain a patient id variable, a group variable (1=experimental, 2=control), both of which are specified as macro variables, and additional variables to represent an event time (etime), the maximum follow-up time (ftime), and an indicator delta to indicate whether an event was observed at the current time (1) or the observation is right censored (0) at that time.
hypokrnl.sas. This program uses the macro smooth in kernelsm.sas to compute the kernel smoothed estimate of the intensity intensity function for the recurrent hypoglycemia counting process described in Example 9.11 and presented in Figure 9.4.
AGtests.sas. This program contains the macro AGtests that computes the estimated cumulative intensity for possibly recurrent event times in two groups as presented in (9.132). These are plotted. The program then computes the Aalen-Gill test statistics for possibly recurrent event times with an allowance for ties, as in (9.149)-(9.150). The logrank and Gehan-Wilcoxon tests are presented. The usage is described in the macro; see hypotest.sas as an example. A macro variable specifies the number of distinct event times. The input data must consist of a single observation with 5 arrays of variables contining the successive time values, numbers of events and numbers at risk in each group at each time. The input data set may be constructed using the program hypotime.sas.
hypotest.sas. This program uses the macro AGtests in AGtests.sas to compute the cumulative intensities and the Aalen-Gill tests for the recurrent hypoglycemia counting process described in Example 9.11.
hypomim.sas fits the multiplicative intensity model to the recurrent hypoglycemia events in the intensive group patients of the secondary cohort of the DCCT as described in Example 9.12. The data set hypomimi in the datasets directory contains the data from the intensive group subjects.
The data set hypomimc contains the data from the conventional group subjects. The latter may be used for additional problems. Both data sets contain a number of additional baseline covariates for each subject.
Table9-8.sas reads the data in Table 9.8 in a format suitable for use with the other macros provided or with SAS procedures.
Prentice.sas reads the data set in Table 9.9 from Prentice (1973) in a format suitable for use with the other macros provided or with SAS procedures.
VACURG85.sas reads the data from the VA Cooperative Urology Research Group study of prostate cancer described by Byar (1985) that are used in Problem 9.17. The variables described in Problem 9.17 are computed from the raw data file VACURG85.dat that is provided in the datasets directory. The data are also available on StatLib as Table 46 of the book edited by Andrews and Herzberg.
FH-CGDph.sas reads the data from Fleming and Harrington (1991) with the times of recurrent infections among children in a clinical trial of interferon versus placebo. This uses the data set FH-cgd.dat that is suitable for use with PHREG to fit multiplicative intensity proportonal intensity models.
FH-CGDtm.sas uses the macro %timeset in AGtimes.sas to generate the data set CGDtimes in a format required to compute the kernel smoothed intensity estimates and the Aalen-Gill tests. This data set may then be used with the other macros and programs to perform additional analyses as specified in Problem 9.18.
FH-CGDag.sas reads the data from Fleming and Harrington (1991) with the times of recurrent infections among children in a clinical trial of interferon versus placebo. This uses the data set cgdtimes that is suitable for use with the macro AGtests to compute Aalen-Gill test statistics for recurrent events.
Mack, et al. (Breslow-Day)
BresMack.text: is the documentation to Appendix III of Breslow and Day (1980) that presents data from the matched case-control study of endometrial cancer described in Mack et al. (1976). This file was downloaded in the fall of 1999 from Norman Breslow's web site at the Department of Biostatistics of the University of Washington. It describes the variables and corrections to the descriptions in Breslow and Day.
BresMack.txt: is the data set downloaded from Norman Breslow's web site. The programs described in Chapters 5 and 7 use this data set.
Hosmer-Lemeshow Low Birthweight Data
HosLem.sas: generates the data set used as the basis for the analyses in the text, Tables 7.8 and 7.9. The original source of the data was a table in the SAS Technical Report P-229, p. 465-6. The data were scanned and used in this program to generate the data set HLData.dat that is used in the program HLBwt.sas for the analyses shown in Chapter 7. Because the data were scanned from a secondary source, the data and analyses may differ from those shown by Hosmer and Lemeshow.
dccthypo.txt: is a flat (text) file used in Examples 8.1-8.3. This is used as the basis for the data presented in Table 8.1, see the program Table81.sas in Chapter 8.
Fleming-Harrington CGD Data
FH-CGD.txt. Fleming and Harrington (1991) present the data from a clinical trial of gamma interferon versus placebo in the treatment of children with chronic granulamatous disease (CGD) to reduce the incidence of recurrent pyogenic infections. The data set includes multiple records for each subject to record the time of each successive infection or the date of right censoring. The variables include
id the patient ID
IDT either the date of onset of a serious infection, or the date follow-up ended
Z2 Inheritance pattern: X-linked (1) versus autosomal recessive (2)
Z4 Height (cm)
Z6 Corticosteroid use on entry: yes (1) versus no (2)
Z8 Gender: male (1) versus female (2); and
T1 elapsed time from randomization to the value of IDT in the current observation, i.e. the time to an infection or the number of days of follow-up (IDT-RDT)
d indicator (1) for an infection at the date IDT or (2) for censoring at that time (end of follow-up)
RDT the date of randomization into the study, in mmddyy format
Z1 treatment group: interferon (1) versus placebo (2)
Z3 Age (years)
Z5 Weight (kg)
Z7 Antibiotic use on entry: yes (1) versus no (2)
Z9 Type of hospital: NIH (1), other US (2), Amsterdam (3), other European (4)
T2 the start time for the current interval of risk, either 0 for the first record of each subject, or the time IDT+1 from the previous infection time for that subject
S the sequence number for the current infection (if any) for this subject. This is used for analyses of recurrent events in Chapter 9.
Note:FHcnt.txt is a data set created by FHcnt.sas that has one record per subject contianing the additonal variables nevents: number of severe infections experienced, and futime: the number of days of follow-up. This is used for analyses using Poisson regression in Chapter 8.
Lagakos Squamous Cell Carcinoma
Lagakos.sas reads the data from Lagakos (1978) and creates a SAS data set that is used for the analyses in Chapter 9. This job should be run on your platform to create the SAS data set. The data set was originally used by Lagakos to describe an approach to the analysis of competing risks, there being two modes or causes of failure (spread of disease) - metastatic versus not. For the analyses herein, however, a single outcome is employed - spread of disease of any cause.
DCCT Nephropathy (Microalbuminuria) Data
nephdata.txt contains data related to the onset of microalbuminuria in the DCCT. These data are used for simple survival analyses as presented in Example 9.2. The data set, however, contains additional variables that could be used for supplemental exercises. See DCCTneph.sas. The variables in the data set are
Patient ID number (a dummy number to mask the patient's identity)
primary for primary prevention cohort (1) versus secondary intervantion cohort/td>
neur for neuropathy present on entry (1) versus not (0)
neph2flg the indicator for the development of microalbuminuris during the study (1) versus censored
duration the months duration of diabetes on entry
age in years
bcval5 the entry level of stimulated C-peptide, a measure of residual endogenous insulin secretory function
bmi a measure of obesity calculated as weight/(height**2), and the array of variables
int for intensive (1) versus conventional (0) treatment
etdpatb the baseline ETDRS grade of retinopathy severity (see DCCT, 1995)
aer0 the entry level of albumin excretion rate (mg/24 h)
neph2vis the quarterly visit number at which microalbuminuria was first observed or the last observation visit
female (1) versus male (0)
adult (1) (>17 years of age) versus adolescent (0)
hbael the baseline level of HbA1c
mhba1-mhba9 that represent the current mean HbA1c over the period since randomization up to the current annual visit (1-9).
DCCT Hypoglycemia Recurrent Event Data
Due to their size, the four data sets in this section are provided as a single SAS export file. You can download this file as an uncompressed file (17.6 Mbytes), as a gzip-compressed file (952 Kbytes), or as a zip-compressed file (938 Kbytes). Please run the program impthypo.sas to generate the following SAS data sets on your platform.
Contains one record per hypoglycemia event for each subject. The variables are
ETIME the day number since randomization when an event occurred missing if no event in this observation
EVENTDAY the calendar date of the event in MMDDYY8. format
FTIME the total follow-up time of the subject
NEVENTS the total number of events for this subject
RANDSAS the calendar date of randomization into the study in MMDDYY6. format.
EVENT an indicator for whether an event occurred at this time (1=yes, .=no)
EVNUM the cumulative event number since randomization
INTGROUP an indicator for intensive (1) versus conventional (2) treatment group
PATIENT the patient ID number (masked)
This data set contains a single observation with six sets of array variables:
MAXJ is the number of elements in the array that equals the total number of distinct event times in the data set (1565 in this case)
XE1-XE1565 are the numbers of events in the intensive (experimental) group at each time
YE1-YE1565 are the numbers of subjects at risk in the intensive (experimental) group at each time
Y1-Y1565 are the total numbers of subjects at risk in both groups at each time.
T1-T1565 are the times at which events occurred
XC1-XC1565 are the numbers of events in the conventional group at each time
YC1-YC1565 are the numbers of subjects at risk in the conventional group at each time
Contains DCCT intensive group recurrent hypoglycemia event observations with time dependent covariate data as described in Example 9.12. Each observation is defined in terms of start and stop times, the associated time dependent covariate (mhba) and the number of events at the stop time, if any. The covariates in the data set are
ADULT Adult >=18 (0=no/1=yes)
AGE Age at entry
CALORIES Calories (kcal) per day
DURATION Duration of IDDM (months) at Baseline
FAMIDDM Family History of IDDM (0=no/1=yes)
FULLIQ Full Scale IQ
HBAEL HbA1c at Eligibility
HYPOFLG 1 if had a hypoglycemia event at this time
LAER00 Log of Baseline AER
LHBA1C time dependent Log of the current mean HbA1c since randomization
MARRIED Marital Status (0=NOT Married,1=Married)
NEUR0FLG Clinical Neuropathy at Baseline (0,1)
PATIENT Patient ID number (masked)
PRIMARY Base retinopathy strata (0=Scnd, 1=Prim) for primary versus secondary cohort
RET20FLG Baseline ETDRS 20/20 (0=No,1=Yes)
RET43FLG Baseline ETDRS 43/<43 + (0=No,1=Yes)
SMOKER Smoking Status at Baseline (0=No,1=Yes)
STOPS End of Interval (in study time)
WPMEAN Within-Profile Mean Blood Glucose(mg/dl). All covariates are DCCT baseline values except lhba1c and nprior which are time dependent covariates.
AER00 Albumin Excretion Rate (mg/24hr) at Baseline
BMI Body Mass Index (kg/m**2)
CPEPTIDE Stimulated C-Peptide(pmol/ml)
EDUCAT Mean Education (Years) - Form 013
FEMALE Female (0=no/1=yes)
GROUP treatment group 'EXPERIMENTAL' for intensive or 'STANDARD' for conventional
HDL HDL Cholesterol (serum,mg/dl)
INSULIN Total Insulin Dosage Units/Weight (kg)
LDL LDL Cholesterol (serum,mg/dl)
LHBAEL Log of HbA1c at Eligibility
MBP Mean Arterial Pressure
NPRIOR time dependent cumulative number of hypoglycemia events since randomization prior to the current interval
PHASE2 Randomization in phase 2 (1) versus phase 3 (0)
PROTEIN Dietary Protein (gm)
RET35FLG Baseline ETDRS 35/<=35 (0=No,1=Yes)
RETBASE Baseline Retinopathy Strata 'PRIM' for primary or 'SCND' for secondary
STARTS Start of Interval (in study time)
TRG Triglycerides (serum,mg/dl)
Contains DCCT conventional group recurrent hypoglycemia event observations with time dependent covariate data as described in Example 9.12. Each observation is defined in terms of start and stop times, the associated time dependent covariate (mhba) and the number of events at the stop time, if any. The variables in the data set are the same as those described above.
Veterans Administration Cooperative Urological Research Group
VACURG85.txt presents the data from the VACURG study of prostate cancer described by Byar in the book edited by Andrews and Herzberg (1985) which gives the variable descriptions. These data have been used by many, including Thall and Lachin (1986). The data are also available from StatLib in a slightly different format as Table46.dat of the file Andrews. The variables included are
patid patient number
rx treatment group 1=placebo, 2=0.2 mg. estrogen, 3=1 mg., 4=5 mg.
mosfu months of follow-up
age in years, 89=>88
pf performance status 0=normal, 1=<50% time in bed, 2=(50-<100%) time, 3=confined to bed
sbp systolic blood pressure/10 mm/hg (e.g. 118 recorded as 12)
ekg EKG 0=normal, 1=benign, 2=rhythmic disturbances and electrolyte changes, 3=heart blocks or conduction defects, 4=heart strain, 5=old myocardial infarct (MI), 6=recent MI
sz tumor size in cm**2 (0=none palpable)
stage tumor stage
startm startd starty randomization (m d y)
ap alkaline phosphatase in King-Armstrong units *10, a measure of liver function
status survival status 0=alive, 1=dead from prostate cancer, 2=dead from heart/vascular disease, 3=dead from cerebrovascular disease, 4=dead from pulmonary embolus, 5=dead from other cancer, 6=dead from respiratory disease, 7=dead from other specific non-cancer cause, 8=dead from unspecified non-cancer cause, 9=dead from unknown cause
wt weight index = kg - cm. height +200
hx history of cardiovascular disease 0=no, 1=yes
dbp diastolic BP/10
hg serum hemoglobin in g/100 ml *10
sg combined index of tumor stage and histology grade
bm bone metastases 0=no, 1=yes