Data Sets

Mack, et al. (Breslow-Day) 

BresMack.text : is the documentation to Appendix III of Breslow and Day (1980) that presents data from the matched case-control study of endometrial cancer described in Mack et al. (1976). This file was downloaded in the fall of 1999 from Norman Breslow's web site at the Department of Biostatistics of the University of Washington. It describes the variables and corrections to the descriptions in Breslow and Day. 

BresMack.txt : is the data set downloaded from Norman Breslow's web site. The programs described in Chapters 5 and 7 use this data set. 

Hosmer-Lemeshow Low Birthweight Data : generates the data set used as the basis for the analyses in the text, Tables 7.8 and 7.9. The original source of the data was a table in the SAS Technical Report P-229, p. 465-6. The data were scanned and used in this program to generate the data set HLData.dat that is used in the program for the analyses shown in Chapter 7. Because the data were scanned from a secondary source, the data and analyses may differ from those shown by Hosmer and Lemeshow. 

DCCT Hypoglycemia 

dccthypo.txt : is a flat (text) file used in Examples 8.1-8.3. This is used as the basis for the data presented in Table 8.1, see the program in Chapter 8. 

Fleming-Harrington CGD Data 

FH-CGD.txt . Fleming and Harrington (1991) present the data from a clinical trial of gamma interferon versus placebo in the treatment of children with chronic granulamatous disease (CGD) to reduce the incidence of recurrent pyogenic infections. The data set includes multiple records for each subject to record the time of each successive infection or the date of right censoring. The variables include

id the patient ID
IDT either the date of onset of a serious infection, or the date follow-up ended
Z2 Inheritance pattern: X-linked (1) versus autosomal recessive (2)
Z4 Height (cm)
Z6 Corticosteroid use on entry: yes (1) versus no (2)
Z8 Gender: male (1) versus female (2); and
T1 elapsed time from randomization to the value of IDT in the current observation, i.e. the time to an infection or the number of days of follow-up (IDT-RDT)
d indicator (1) for an infection at the date IDT or (2) for censoring at that time (end of follow-up)
RDT the date of randomization into the study, in mmddyy format
Z1 treatment group: interferon (1) versus placebo (2)
Z3 Age (years)
Z5 Weight (kg)
Z7 Antibiotic use on entry: yes (1) versus no (2)
Z9 Type of hospital: NIH (1), other US (2), Amsterdam (3), other European (4)
T2 the start time for the current interval of risk, either 0 for the first record of each subject, or the time IDT+1 from the previous infection time for that subject
S the sequence number for the current infection (if any) for this subject. This is used for analyses of recurrent events in Chapter 9.


Note:  FHcnt.txt  is a data set created by  that has one record per subject contianing the additonal variables nevents: number of severe infections experienced, and futime: the number of days of follow-up. This is used for analyses using Poisson regression in Chapter 8. 

Lagakos Squamous Cell Carcinoma  reads the data from Lagakos (1978) and creates a SAS data set that is used for the analyses in Chapter 9. This job should be run on your platform to create the SAS data set. The data set was originally used by Lagakos to describe an approach to the analysis of competing risks, there being two modes or causes of failure (spread of disease) - metastatic versus not. For the analyses herein, however, a single outcome is employed - spread of disease of any cause. 

DCCT Nephropathy (Microalbuminuria) Data 

nephdata.txt  contains data related to the onset of microalbuminuria in the DCCT. These data are used for simple survival analyses as presented in Example 9.2. The data set, however, contains additional variables that could be used for supplemental exercises. See The variables in the data set are

Patient ID number (a dummy number to mask the patient's identity)
primary for primary prevention cohort (1) versus secondary intervantion cohort/td>
neur for neuropathy present on entry (1) versus not (0)
neph2flg the indicator for the development of microalbuminuris during the study (1) versus censored
duration the months duration of diabetes on entry
age in years
bcval5 the entry level of stimulated C-peptide, a measure of residual endogenous insulin secretory function
bmi a measure of obesity calculated as weight/(height**2), and the array of variables
int for intensive (1) versus conventional (0) treatment
etdpatb the baseline ETDRS grade of retinopathy severity (see DCCT, 1995)
aer0 the entry level of albumin excretion rate (mg/24 h)
neph2vis the quarterly visit number at which microalbuminuria was first observed or the last observation visit
female (1) versus male (0)
adult (1) (>17 years of age) versus adolescent (0)
hbael the baseline level of HbA1c
mhba1-mhba9 that represent the current mean HbA1c over the period since randomization up to the current annual visit (1-9).

DCCT Hypoglycemia Recurrent Event Data 

Due to their size, the four data sets in this section are provided as a single SAS export file. You can download this file as an  uncompressed file  (17.6 Mbytes), as a  gzip-compressed file  (952 Kbytes), or as a  zip-compressed file  (938 Kbytes). Please run the program  to generate the following SAS data sets on your platform. 

Dataset hyevents 

Contains one record per hypoglycemia event for each subject. The variables are

ETIME the day number since randomization when an event occurred missing if no event in this observation
EVENTDAY the calendar date of the event in MMDDYY8. format
FTIME the total follow-up time of the subject
NEVENTS the total number of events for this subject
RANDSAS the calendar date of randomization into the study in MMDDYY6. format.
EVENT an indicator for whether an event occurred at this time (1=yes, .=no)
EVNUM the cumulative event number since randomization
INTGROUP an indicator for intensive (1) versus conventional (2) treatment group
PATIENT the patient ID number (masked)

Dataset hytimes 

This data set contains a single observation with six sets of array variables:

MAXJ is the number of elements in the array that equals the total number of distinct event times in the data set (1565 in this case)
XE1-XE1565 are the numbers of events in the intensive (experimental) group at each time
YE1-YE1565 are the numbers of subjects at risk in the intensive (experimental) group at each time
Y1-Y1565 are the total numbers of subjects at risk in both groups at each time.
T1-T1565 are the times at which events occurred
XC1-XC1565 are the numbers of events in the conventional group at each time
YC1-YC1565 are the numbers of subjects at risk in the conventional group at each time

Dataset hypomimi 

Contains DCCT intensive group recurrent hypoglycemia event observations with time dependent covariate data as described in Example 9.12. Each observation is defined in terms of start and stop times, the associated time dependent covariate (mhba) and the number of events at the stop time, if any. The covariates in the data set are

ADULT Adult >=18 (0=no/1=yes)
AGE Age at entry
CALORIES Calories (kcal) per day
DURATION Duration of IDDM (months) at Baseline
FAMIDDM Family History of IDDM (0=no/1=yes)
FULLIQ Full Scale IQ
HBAEL HbA1c at Eligibility
HYPOFLG 1 if had a hypoglycemia event at this time
LAER00 Log of Baseline AER
LHBA1C time dependent Log of the current mean HbA1c since randomization
MARRIED Marital Status (0=NOT Married,1=Married)
NEUR0FLG Clinical Neuropathy at Baseline (0,1)
PATIENT Patient ID number (masked)
PRIMARY Base retinopathy strata (0=Scnd, 1=Prim) for primary versus secondary cohort
RET20FLG Baseline ETDRS 20/20 (0=No,1=Yes)
RET43FLG Baseline ETDRS 43/<43 + (0=No,1=Yes)
SMOKER Smoking Status at Baseline (0=No,1=Yes)
STOPS End of Interval (in study time)
WPMEAN Within-Profile Mean Blood Glucose(mg/dl). All covariates are DCCT baseline values except lhba1c and nprior which are time dependent covariates.
AER00 Albumin Excretion Rate (mg/24hr) at Baseline
BMI Body Mass Index (kg/m**2)
CPEPTIDE Stimulated C-Peptide(pmol/ml)
EDUCAT Mean Education (Years) - Form 013
FEMALE Female (0=no/1=yes)
GROUP treatment group 'EXPERIMENTAL' for intensive or 'STANDARD' for conventional
HDL HDL Cholesterol (serum,mg/dl)
INSULIN Total Insulin Dosage Units/Weight (kg)
LDL LDL Cholesterol (serum,mg/dl)
LHBAEL Log of HbA1c at Eligibility
MBP Mean Arterial Pressure
NPRIOR time dependent cumulative number of hypoglycemia events since randomization prior to the current interval
PHASE2 Randomization in phase 2 (1) versus phase 3 (0)
PROTEIN Dietary Protein (gm)
RET35FLG Baseline ETDRS 35/<=35 (0=No,1=Yes)
RETBASE Baseline Retinopathy Strata 'PRIM' for primary or 'SCND' for secondary
STARTS Start of Interval (in study time)
TRG Triglycerides (serum,mg/dl)

Dataset hypomimc 

Contains DCCT conventional group recurrent hypoglycemia event observations with time dependent covariate data as described in Example 9.12. Each observation is defined in terms of start and stop times, the associated time dependent covariate (mhba) and the number of events at the stop time, if any. The variables in the data set are the same as those described above. 

Veterans Administration Cooperative Urological Research Group 

VACURG85.txt  presents the data from the VACURG study of prostate cancer described by Byar in the book edited by Andrews and Herzberg (1985) which gives the variable descriptions. These data have been used by many, including Thall and Lachin (1986). The data are also available from StatLib in a slightly different format as Table46.dat of the file Andrews. The variables included are

patid patient number
rx treatment group 1=placebo, 2=0.2 mg. estrogen, 3=1 mg., 4=5 mg.
mosfu months of follow-up
age in years, 89=>88
pf performance status 0=normal, 1=<50% time in bed, 2=(50-<100%) time, 3=confined to bed
sbp systolic blood pressure/10 mm/hg (e.g. 118 recorded as 12)
ekg EKG 0=normal, 1=benign, 2=rhythmic disturbances and electrolyte changes, 3=heart blocks or conduction defects, 4=heart strain, 5=old myocardial infarct (MI), 6=recent MI
sz tumor size in cm**2 (0=none palpable)
stage tumor stage
startm startd starty randomization (m d y)
ap alkaline phosphatase in King-Armstrong units *10, a measure of liver function
status survival status 0=alive, 1=dead from prostate cancer, 2=dead from heart/vascular disease, 3=dead from cerebrovascular disease, 4=dead from pulmonary embolus, 5=dead from other cancer, 6=dead from respiratory disease, 7=dead from other specific non-cancer cause, 8=dead from unspecified non-cancer cause, 9=dead from unknown cause
wt weight index = kg - cm. height +200
hx history of cardiovascular disease 0=no, 1=yes
dbp diastolic BP/10
hg serum hemoglobin in g/100 ml *10
sg combined index of tumor stage and histology grade
bm bone metastases 0=no, 1=yes