Chi-square with R

R can do all the usual chi-square tests, with either raw data or tables of counts. We start with the heart attack data and the Goodness-of-Fit test.

> names(heartatk)
[1] "Patient"   "DIAGNOSIS" "SEX"       "DRG"       "DIED"      "CHARGES"  
[7] "LOS"       "AGE"   
> attach(heartatk)

> table(SEX)
   F    M 
5065 7779 

> chisq.test(table(SEX))

        Chi-squared test for given probabilities

data:  table(SEX) 
X-squared = 573.4815, df = 1, p-value < 2.2e-16

The command chisq.test(table(SEX)) does a chi-square goodness of fit test on the table for the SEX variable. The default is to test for equal expected counts in every cell. That is hardly the case here. 2.2e-16 means 2.2X10-16 = 0.00000000000000022, which is small.

If you do not want equal proportions, you need to give a set of proportions for each cell. For example, genetic theory predicts that certain fruit flies will fall into four categories in proportions 9:3:3:1. Data showed counts of 59, 20, 11 and 10.

> chisq.test(c(59,20,11,10),p=c(9/16,3/16,3/16,1/16))

        Chi-squared test for given probabilities

data:  c(59, 20, 11, 10) 
X-squared = 5.6711, df = 3, p-value = 0.1288

We would not reject the theoretical hypothesis with these data.

You can make a cross-tabulation of two categorical variables with table and do a test of independence or homogeneity with chisq.test. (We return to the heart attack data.)

> table(SEX,DIED)
SEX    0    1
  F 4298  767
  M 7136  643
> chisq.test(table(SEX,DIED))

        Pearson's Chi-squared test with Yates' continuity correction

data:  table(SEX, DIED) 
X-squared = 147.7612, df = 1, p-value < 2.2e-16

We have seen this p-value before! It is probably the smallest non-zero number R can handle and hence not very accurate. However, p is definitely small! Hence we reject the hypothesis that the mortality rate is the same for men and women. Looking at the data, it is higher for men.

With 12,844 observations, getting the table is a lot more work than computing chi-square, and it is best to let the computer do it. If you have an existing table, R will analyze it -- but not without putting up a fight. You need to enter the table one row at a time and use rbind to combine the rows into a table. Here is some data on hepatitis C incidence (yes,no) and tattoo "status": from a parlor, elsewhere, and no tattoo.

> hep=rbind(c(17,35),c(8,53),c(22,491))
> hep
     [,1] [,2]
[1,]   17   35
[2,]    8   53
[3,]   22  491
> chisq.test(hep)

        Pearson's Chi-squared test

data:  hep 
X-squared = 57.9122, df = 2, p-value = 2.658e-13

Warning message:
Chi-squared approximation may be incorrect in: chisq.test(hep) 

The warning message probably refers to the small expected count the cell where the 8 appears. You could overcome that by pooling the two tattoo sources together. Either way, getting tattoos seems to greatly increase the risk of hepatitis C.

© 2006 Robert W. Hayden