The purpose of this chapter is to describe key aspects of the design and analysis of phase III clinical trials focusing on the treatment of breast cancer. Phase III clinical trials are 1 of 5 basic hierarchical phases of human research activity designed to ascertain information regarding the biological processing, safety, and efficacy of a new treatment.1,2 Details regarding phase III trials are provided later in the chapter. To provide a perspective from which to understand the relative importance of the type of information that is gained from a phase III clinical trial, before we begin our presentation on the design and analysis of phase III trials, we present a synopsis of all of the different phases of clinical trial research. We use the term “new treatment” as a general term to refer to the use of one or several methods for treating a disease or condition with a new surgical technique, a new medical device, or a new form of chemical, physical, or biological agent.
Phase 0 trials are a relatively newly defined type of trial involving the administration of a subtherapeutic dose of a new treatment to a single group of usually 10 to 15 human subjects. The purpose of this type of trial is to determine early in the development process if a new treatment has the pharmacokinetic and pharmacodynamic properties in humans as would be anticipated from the findings from laboratory and animal studies. Because the dosing is subtherapeutic, these trials are also referred to as microdosing trials. Due to the subtherapeutic nature of the dosing, phase 0 trials are not intended to provide relevant information on safety or efficacy. Discussions on the methodology and issues in phase 0 trials are presented by Takimoto, Murgo and associates, and others.2-5
Phase I trials also involve the administration of a new treatment to a single group of human subjects in the range of 20 or more, but rarely over 100. The purpose of this phase of research is to obtain some initial information regarding the therapeutic dosing of the treatment. Until the point of initiating a phase 1 trial, information has been limited to that from laboratory investigations, studies among animals, and perhaps the pharmacokinetics in humans of subtherapeutic doses. The objectives of a phase I trial can include obtaining information on the best mode of treatment delivery, determining adequate dosing levels, describing the nature of side effects from treatment and detecting some signal that the new treatment has activity on the disease or condition for which the treatment is planned to be used. Eisenhauer and coworkers, Miller, Horstmann and colleagues, and others present discussions regarding methodology and issues in phase1 trials.6-11
If the findings from phase I research show some promise and are within safety parameters, phase II trial research is initiated. This type of trial typically involves 100 or more participants and can, but does not always, include a control group with randomization of treatment assignment to the control treatment or new treatment. The goals of phase II studies are to learn more about the method of delivery, best dosing levels, safety, and effectiveness the new treatment. Herson and associates, Fleming and colleagues, Simon and coworkers, and several others provide more discussion regarding the design and analysis issues in phase II trials.12-25
A new treatment will be taken into a phase III trial only if the results from phase I and phase II research demonstrate that the new treatment is within safety parameters and has good potential to be an effective treatment. This type of trial involves many hundreds or even many thousands of human subjects, and always includes a control group with randomized treatment assignment. The control group could be one receiving the treatment that represents the current standard of care or, if there is no standard, one receiving a placebo. If possible, the treatments administered are done so in a double-blinded fashion where neither the participant nor the treating health care provider is aware of which treatment (new or control) the participant is receiving. The primary goals of a phase III trial are to definitively determine the efficacy of the new therapy and refine safety information. In some circumstances, the goal regarding efficacy is to show that the new treatment is better than the current treatment. In other circumstances, the goal is to show that the new therapy is as effective as the current treatment, but may be less costly, have fewer or less intense side effects, and/or is easier to administer. Results from phase III trials are the primary information used by the U.S. Food and Drug Administration (FDA) and other similar agencies throughout the world to approve a new treatment for general use among individuals who have a specific disease or condition. Methodologic issues involved with phase III studies are the focus of this chapter.
Phase IV trials are conducted after a new treatment has been approved by the FDA and are often referred to as post-marketing surveillance trials. This type of trial is usually conducted by the company that developed the new treatment. The primary purpose of phase IV research is to obtain long-term information on side effects of the new treatment or learn more about rare side effects of the new treatment. Sometimes phase IV research involves the continued follow-up of the participants in a phase III trial that was used as the basis for FDA approval of the new treatment. Phase IV research can also involve the establishment of a surveillance system to collect data from the general population who begin taking the new treatment or the establishment of other activities needed to collect information to refine risk/benefit profile of the treatment or to identify subgroups for whom treatment should be limited or withheld.
Defining the primary hypothesis of a trial is the first step to designing the trial. The primary hypothesis is meant to reflect the main objective of the trial. Most often the primary objective is to compare effects of 2 treatments. In this case, the primary hypothesis would be to evaluate the efficacy of a new treatment as compared to the control treatment in terms of some undesirable health outcome. However, trials can involve the comparison of more than 2 treatments, and in this circumstance the hypothesis would identify more than one comparison of a new treatment to a control treatment and/or comparisons among several of the new treatments. The undesirable health outcome that is used as the basis for comparing treatment is referred to as the primary end point. When dealing with phase III breast cancer treatment trials, possible end points include breast cancer recurrence-free interval, invasive disease-free survival, overall survival, or one of several others.26 Once the primary end point has been selected, one must define the specific types of events that are to be included as part of the end point. For example, events under disease-free survival would include cancer recurrence at the original anatomic site of diagnosis, recurrence at a different anatomic site, diagnosis of a second primary cancer, and death without evidence of recurrence at any anatomic site or a second primary cancer. A lack of standardization of the specific types of events included as part of a particular end point has caused difficultly when comparing results across trials evaluating similar treatments. In an effort to rectify this situation, there have been recent harmonization efforts to standardize the types of events included as part of end points used in the cancer clinical trials.26,27 The recommendations from these efforts should be used as the basis for defining the health outcomes selected for trials.
Most often, the anticipated results from testing the study hypothesis is that the new treatment would be superior to the control treatment. This type of hypothesis is referred to a superiority hypothesis. In some circumstances, one may not wish to demonstrate the superiority of a new treatment, but rather to demonstrate the new treatment is no less effective that a control treatment. This type of hypothesis is known as a non-inferiority hypothesis. It would be considered when it is anticipated that, within a given tolerance, the new and control treatments may have similar efficacy but the new treatment would be more desirable because it may have fewer or less severe side effects, may be less costly, and/or may be easier to administer.28-31
It is rare that there is only one objective of a phase III clinical trial. Thus, as must be accomplished for the primary hypothesis, end points for secondary objectives must also be defined. This should be accomplished using the same considerations as those for the primary objective.
Once the primary hypothesis has been defined there are several key parameters that must be established. These parameters are key factors that taken together become the basis for determining the sample size of participants needed to perform a statistically adequate test of the primary hypothesis. There are 2 types of parameters that must be established; those that are statistical in nature and those that are operational in nature.
The statistical parameters that must be established before the trial is initiated are the α-level, the statistical power, the baseline rate of the primary end point for the control group, and the effect size. The α-level is also referred to as the type I error rate. It is the threshold p-value that is used for determining statistical significance in hypothesis testing. It represents the likelihood of concluding that there is a difference between the new and control treatments when there really is no difference. As one wants to minimize the probability of such an error, one usually chooses a low α-level. Traditionally, the α-level that is used for hypothesis testing is set at 0.05. In situations where there are more than 2 treatments being compared or formal testing of secondary hypotheses is planned, the α-level for any particular comparison would be reduced or the testing would be performed in an hierarchical, conditional manner to maintain the experiment-wise α-level at the 0.05 level.32 The statistical power for the test of hypothesis is the likelihood of concluding that there is a difference between the new and control treatments when a difference of a specific magnitude really does exist. As one wants to maximize this circumstance, one chooses a high statistical power. Practically speaking, the statistical power is usually set in the range of 0.8 to 0.9. A value of less than 0.8 is not recommended. The baseline rate of the primary end point is the anticipated rate of the end point among those receiving the control treatment. This can be determined from reports in the literature for populations similar to that which is likely to be accrued to the trial. The effect size is the magnitude of the difference in rates of the primary end point that is anticipated between the new and control treatments. For example, if the hypothesis involves evaluating the rate of breast cancer recurrence and one anticipates that the rate among those receiving the new treatment will be 25% less than the rate among those receiving the control treatment, then the effect size is a 25% reduction. The effect size that is chosen must be one that is biologically meaningful and also one that can be justified as a reasonable magnitude of effect that could be achieved with the new treatment.
The key operational parameters that must be established include the anticipated participants lost to follow-up rate, nonadherence rate, and patient accrual pattern. To the extent that is possible, these can be determined from prior experiences in trials of a similar nature. The lost to follow-up rate is the proportion of those randomized who formally withdraw their consent to participate, cannot be located, or decide to discontinue returning for clinical assessments. The nonadherence rate is the proportions of those randomized who continue participation in clinical assessments, but discontinue the study treatment to which they were randomly assigned before the full planned course is completed. The determination of these 2 rates is important, as these parameters could affect the observed rate of the study outcome and therefore increase the sample size needed to be studied. The projected patient accrual pattern is the anticipated number of individuals who would be randomized over a specified unit of time. Consideration should be given to the time-dependent nature of accrual in that accrual for a clinical trial will often start out at a less than maximum level and that it may be several months before reaching a peak level and possibly staying constant or dropping thereafter.
A fundamental part of the design of any clinical trial is the consideration of the number of individuals who are required to appropriately conduct the trial. Once the trial parameters are set, the investigators then calculate a sample size based on the power requirement. In a few cases, due to cost constraints, ethical considerations, and so forth, one can only obtain a fixed sample size for a study. In this latter case, one may wish to calculate the power of a test, given the prespecified sample size.
An important distinction that must be made in calculating sample size and power is whether or not the investigators wish to frame their results in terms of confidence intervals or in terms of hypothesis testing. Another consideration is whether or not the investigators hope to establish that one treatment is more efficacious than other treatments or, alternatively, whether or not 2 or more treatments are equivalent. In the latter case, one would hope that one of the treatments or interventions would have fewer side effects than the other.
In this section, we focus on issues related to sample size and power calculations when hypothesis testing is employed and when the question of interest is that of whether or not efficacy is different between 2 treatments or interventions. Also, we focus on issues related to time to event outcomes, although we recognize that one must also consider power for other types of end points. However, in most phase III trials, the primary interest is to determine whether one treatment as compared to another increases the time to some event (eg, death, relapse, adverse reaction). Such studies are analyzed using “survival” analysis. In most such studies, at the time of the definitive analysis, the outcome (time to event) is not observed for all of the individuals. Those observations for which no event is observed are considered to be “censored.” In analyzing such data, each individual is still “at risk” up until the time he or she was observed to have an event or censoring has occurred. In survival analysis, the power and sample size are determined by the number of events observed in the study. From the timing of these events, one can estimate hazard rates. The hazard rate is related to the instantaneous failure rate in a time interval (t,t + Δt) given that an individual is still at risk at the beginning of the interval (ie, at time t). In studies of chronic diseases, it is often the case that the hazard rate, denoted λ, is constant over time, that is the risk of failure for those who have not yet had the event of interest does not change as time goes on. In such studies, the hazard rate for a cohort can be approximated simply by taking the number of events and dividing through by the total number of person-years in the cohort, that is,
Notational NoteZα denotes that value Z of a normal distribution having the property that Pr (Z < z) = α. For example, Pr (Z < Z1 – α/2) = 1 – α/2. In the case that α = 0.05, Z1 – α/2 = 1.96. Also, “Φ” represents the cumulative standard normal distribution, that is, Φ(z) = Pr(Z ≤ z), where Z follows a standard normal distribution, that is, Z ~ N(0,1). For example, Φ(1.645) = 0.95, Φ(0) = 0.50, Φ(−∞) = 0 and Φ(∞) = 1.
In a clinical trial where the outcome of interest is time to an event, the sample size is driven by the number of events. Hence, for example, the power associated with observing 300 events will be the same regardless if the trial has 5000 or 500 subjects. Of course, there would be a big difference in the cost of the trial depending on how many individuals are accrued. Consequently, it is important to be able to estimate the hazard or failure rate of at least the control arm of the trial. For comparing the event-free rates in 2 groups, where the hazard rates, λ1 and λ2 are assumed to be constant, George and Desu33 and Piantadosi34 showed that the total number of events in the study is given by
where p and q are the proportions of patients in each treatment arm, “ln” denotes the natural logarithm, Δ = λ2/λ1, and the α-level and power, 1 – β, are prespecified. From equation [1], we can derive a formula for power as
Example 1 In a breast cancer trial among women who have at least one axillary node with cancer, an experimental agent is only expected to reduce the hazard rate of mortality by 25%. Thus, Δ = λ2/λ1 = 0.75. Often the mortality rates in such a population are approximately exponential. In order to have 80% power to detect the above-mentioned 25 reduction in hazard rates for a 2-sided test, one must observe
Consequently, in order to ensure 80% power to detect such a difference, one would need to observe 380 deaths in the study.
In cases where there is no assumption about the event time distributions but where the hazards are assumed to be proportional, that is, Δ = λ1(t)/λ2(t) is a constant for all t, Freedman35 derived an expression for the total number of deaths as
This latter expression gives a slightly more conservative estimate than the expression in equation [1] for determining the number of events needed to achieve a specified power.
Example 1 (Continued) In the breast cancer clinical trial described above, one would need to observe
so that 385 deaths would have to be observed to achieve 80% power to detect a 25% mortality reduction.
From equation [3], an expression for power can be derived as
The above equations are written with the assumption that the alternative hypotheses are 2-sided. For a 1-sided alternative hypothesis, Z1 – α/2 is replaced by Z1 – α, which results in an increased power but which must be further justified based on scientific and ethical considerations.
To get a quick estimate of the total number of patients to achieve one’s power goals assuming a fixed follow-up time for each patient, one can use a formula from Freedman35:
where Np,i is the total number of patients in group i, i = 1,2; and p1 = the proportion of patients in group 1 who are event free at a given time t, p2 = the proportion of patients in group 2 who are event free at t and Δ = ln(p1)/ln(p2).
Example 2 Suppose that the projected proportion of patients alive in the control group at 5 years is 77.9% based on an average annual mortality rate of 5%. Suppose also that we wish to test a 2-sided hypothesis with α = 0.05 and that we wish to calculate the number of patients needed per group to have 80% power to detect a difference if the hazard rate is reduced by 25%. The number of patients per group needed in this trial is approximately: