In any hypothesis test, there are four possible outcomes. The table below illustrates the only possibilities.
Decisions
Accepting Ho when it is false; Type II error (p = b)
Rejecting Ho when in fact it is true; Type 1 error (p = a or significance level)
Rejecting Ho that is not true; good decision (p = 1 – b or power of the test)
What should every good hypothesis test ensure? Ideally, it should make the probabilities of both a Type I error and Type II error very small. The probability of a Type I error is denoted as a and the probability of a Type II error is denoted as b.
Recall that in every test, a significance level is set, normally a= 0.05. In other words, that means one is willing to accept a probability of 0.05 of being wrong when rejecting the null hypothesis. This is the a risk that one is willing to take, and setting a at 0.05, or 5 percent, means one is willing to be wrong 5 out of 100 times when one rejects Ho. Hence, once the significance level is set, there is really nothing more that can be done about a.
Suppose the null hypothesis is false. One would want the hypothesis test to reject it all the time. Unfortunately, no test is foolproof, and there will be cases where the null hypothesis is in fact false but the test fails to reject it. In this case, a Type II error would be made. The probability of making a Type II error and b should be as small as possible. Consequently, 1 -b is the probability of rejecting a null hypothesis correctly (because in fact it is false), and this number should be as large as possible.
Rejecting a null hypothesis when it is false is what every good hypothesis test should do. Having a high value for 1 -b (near 1.0) means it is a good test, and having a low value (near 0.0) means it is a bad test. Hence, 1 -b is a measure of how good a test is, and it is known as the “power of the test.”
The power of the test is the probability that the test will reject Ho when in fact it is false. Conventionally, a test with a power of 0.8 is considered good.
Consider the following when doing a power analysis:
The computation of power depends on the test used. One of the simplest examples for power computation is the t-test. Assume that there is a population mean of m = 20 and a sample is collected of n = 44 and that a sample mean of and sample standard deviation of s = 4 are found. Did this sample come from a population of mean = 20 if it is set that a= 0.05?
Ho: m does equal 20
Ha: m does not equal 20
a = 0.05, two-tailed test
The next example is testing an effect size of 2 . Since this is the absolute value, it needs to be standardized into a t-value using the standard error of the mean .
The critical value of t at 0.05 (two-tailed) for DF = 43 is 2.0167 (using spreadsheet software [e.g., Excel], TINV [0.05,43] = 2.0167). Since the t is greater than the critical value, the null hypothesis is rejected. But how powerful was this test?
The critical value of t at 0.05 (two tailed) for DF = 43 is 2.0167. The following figure illustrates this graphically.
This t = +/-2.0167 equals in the hypothesized distribution = 20 +/- (2.0167) = 20 + 0.603(2.0167) = 21.216 and 20 – 0.603(2.0167) = 18.784.
The next figure shows an alternative distribution of m = 22 and s = 4. This is the original distribution shift by two units to the right.
What is the probability of being less than -21.216 in this alternative distribution? That probability is b, accepting Ho when in fact it is false. This is because with any value within that region, in the original probability distribution, one would have accepted Ho. How does one find this b? What is the t value of 21.216 in the alternative distribution?
What is the corresponding probability of being less than t = -1.3? From the t-tables, using one-tailed, DF = 43, t = 1.3, one finds 0.10026 (using spreadsheet software TDIST, it is 0.10026). Hence b = 0.10026 and 1 -b = 0.9, which was the power of the test in this example.
Below is the statistical software output (Minitab) using the same example:
Three key factors affect the power of the test.
The difference or effect size affects power. If the difference that one was trying to detect was not 2 but 1, the overlap between the original distribution and the alternative distribution would have been greater. Hence, b would increase and 1 -b or power would decrease.
Hence, as effect size increases, power will also increase.
Significance level or a affects power. Imagine in the example using the significance level of 0.1 instead. What would happen?
Significance Level
DF
Critical t
Value in Original Distribution