R provides several methods of testing the independence of the categorical variables. In my tutorial, I will show three tests such as the chi-square test of independence, the Fisher exact test, and the Cochran-Mantel–Haenszel test.
Chi-Square test is a statistical method used to determine if two categorical variables have a significant correlation between them. The two variables are selected from the same population. Furthermore, these variables are then categorised as Male/Female, True/False, etc.
The function chisq.test()
is used to perform this operation. I will show an example with builtin data on vcd
package. You can always import data into R using CSV, Excel or SPSS data file. Also, we will see how to interpret the results of the Chi-square test.
Hypotheses of Chi-Square test
Null hypothesis – Assumes that there is no association between the two variables.
Alternative hypothesis – Assumes that there is an association between the two variables.
Let us see an example now.
Example
To install vcd package use the command install.packages("vcd")
. Then use the following code to performs Chi-Square test in R for two different sets of variables and to understand when to accept and when to reject the hypothesis.
> library(vcd) > chisq.test(Arthritis$Treatment,Arthritis$Improved) Pearson's Chi-squared test data: Arthritis$Treatment and Arthritis$Improved X-squared = 13.055, df = 2, p-value = 0.001463 > chisq.test(Arthritis$Improved,Arthritis$Sex) Pearson's Chi-squared test data: Arthritis$Improved and Arthritis$Sex X-squared = 4.8407, df = 2, p-value = 0.08889 Warning message: In chisq.test(Arthritis$Improved, Arthritis$Sex) : Chi-squared approximation may be incorrect
From the result of chisq.test(Arthritis$Treatment,Arthritis$Improved)
, there appears to be a relationship between treatment received and level of improvement, We come to this conclusion because the p-value is less than 0.01. i.e, p < 0.01. Hence, we reject the null hypothesis and accept the alternative hypothesis.
But the result of chisq.test(Arthritis$Improved,Arthritis$Sex)
shows that there doesn’t appear to be a relationship between patient sex and improvement because the p-value is greater than 0.01 or 0.05 i.e, p > 0.05. Hence, we reject the alternative hypothesis and accept the null hypothesis.
The warning message is produced because one of the six cells in the table (male-some improvement) has an expected value of less than five, which may invalidate the chi-square approximation. Use the code head(Arthritis)
to check this.
So, this is how you can perform a Chi-Square test in R and interpret the result.