# Math 263, Section 001 and 003 - Excel/R Assignment 5

Last updated on December 3, 12:58PM.

## Excel/R Assignment 5

In this assignment you will:
• Read data from a CSV file using R.
• Cross-tabulate the data using two-factors.
• Conduct the $$\chi^2$$-test for various null hypotheses.

## Software used

The statistical package R. Although RExcel may be used, the best way is to use R without RExcel.

## The data file

The data file is Dataset5.csv. This file uses semicolon as a separator of columns! You may need to specify this fact when your spreadsheet program reads this file. You can view it with Excel, or a text editor such as Notepad. There is an alternative version, Dataset5.txt, which can be viewed with a browser.

## About the dataset

The dataset is a famous dataset from CMU of a survey of kids on their personal goals. The sample consisted of 478 subjects. Please read the description of the dataset at the original website.

## Sample script

There is a sample script included, which using the $$\chi^2$$-test, tries to answer the question whether the girls and boys have different personal goals. The script is here. You need to understand how the script works and you need to be able to run it with R. The basic procedure for running a script with R is this:
• Download the script and save it in a local file on your computer.
• Start R
• Type in the R-console
	  source("script.R", echo = T)

You should replace "script.R" with the full path of your file on your computer, e.g.
 C:\\tmp\\script.R
if your file is run on Windows, and it has been saved in the folder
 C:\tmp
Note the double backslashes, necessary when you use R under Windows. Alternatively, you can use the File menu to change the working folder to to the folder where the file resides. If you succeed doing so, you do not need to change the command above.

## Answer the following questions using $$\chi^2$$-test

### Instructions for each question

Perform a sequence of $$\chi^2$$ tests, with different data.

For each test, include the output of the command

table
which creates the cross-tabulated data (2-way table). Alternatively, you may use the command
xtabs
which uses the formula syntax, as in examples below. The latter is more powerful cross-tabulation command than the prior.

For each test, include the output of the command

chisq.test
which prints the $$\chi^2$$-statistic, degrees of freedom, and P-value.

For every question, formulate precisely the null and alternative hypothesis. State whether the null hypothesis is rejected, based on the test conducted. Use 95% confidence level.

### Important!

Sometimes R will print this message:
	Warning message:
In chisq.test(table(Age, Money)) :
Chi-squared approximation may be incorrect

This may mean that the $$\chi^2$$-test may not work for this data due to violations of "rules of thumb". In this case you can call the chisq.test as follows:
	chisq.test(table(Age, Money), simulate.p.value=T, B=100000)

In this case, R will modify the method by performing a Monte Carlo simulation to find the P-value. The parameter B controls the size of a random sample used in the Monte Carlo simulation. Its size defaults to 2000, but we may increase the accuracy of the simulation by raising this value. However, very large values will result in slowing down the computation, so keep B below a million.

### The questions

• Is the opinion on the importance of money independent of gender?
• Is the opinion on the importance of money independent of school?
• Do girls and boys have the same attitude about the importance of good looks (being pretty/handsome)?
• Do the personal goals change with Grade (4,5 or 6)?
• Are personal goals of students the same in urban and rural schools?

### Your own question?

There are 55 possible questions to be asked. I suggest you investigate a question or two that interests you about this dataset. Extra credit may be awarded for creative question/discussion.

## Calculate marginal distributions

Calculate the marginal distribution of boys and girls conditioned upon their attitude towards money. You should have 4 distributions, each consisting of two complementary probabilities.

In this part, you may follow the script presented in class: chi_alt.R

## An example - using 'xtabs' and 'chisq.test' together

A quick way to combine cross-tabulation of two (of 11) variables is presented below. Note that there are ${11 \choose 2} = 55$ distinct ways to build a two-way table out of 11 variables, by cross-tabulation. For example, lets decide whether the goals of students depend on the school. We do the following:
      > chisq.test(print(xtabs(~ School + Goals)))
Goals
School                 Grades Popular Sports
Brentwood Elementary     40      17     10
Brentwood Middle         47      25     12
Brown Middle             22      14     16
Elm                       4      11      6
Main                     45      12     11
Portage                  30      20     11
Ridge                    16      18     14
Sand                     15       7      6
Westdale Middle          28      17      4

Pearson's Chi-squared test

data:  print(xtabs(~School + Goals))
X-squared = 34.5069, df = 16, p-value = 0.00464

Warning message:
In chisq.test(print(xtabs(~School + Goals))) :
Chi-squared approximation may be incorrect

Note that we inserted 'print' between 'xtabs' (cross-tabulation function of R) and 'chisq.test' (the $$\chi^2$$ testing function of R). This has the effect of printing the cross-tabulated data.

## An example - computing a marginal distribution

This is how we can compute the marginal distribution of gender in each school:
      > round(prop.table(xtabs(~ School + Gender),1),2)
Gender
School                  boy girl
Brentwood Elementary 0.63 0.37
Brentwood Middle     0.56 0.44
Brown Middle         0.50 0.50
Elm                  0.24 0.76
Main                 0.46 0.54
Portage              0.43 0.57
Ridge                0.48 0.52
Sand                 0.43 0.57
Westdale Middle      0.31 0.69

Here, we first cross-tabulate with 'xtabs', then we compute the marginal distribution with 'prop.table' and finally we round the final result to two decimal places with 'round', all in one line of code!

## Troubleshooting

Below I address some issues that were reported to me.

### The "/" in the header of Dataset5.txt causes error

I cannot confirm that this is an issue, but if you suspect it is, edit Dataset5.txt after downloading and replace "/" in "Urban/Rural" with a period. I am able to do this:
> x <- read.table("Dataset5.txt", header=T)
> names(x)
 "Gender"      "Grade"       "Age"         "Race"        "Urban.Rural"
 "School"      "Goals"       "Grades"      "Sports"      "Looks"
 "Money"
>


### Working directory

If you have problem reading data files or sourcing scripts, probably your working directory is not set to the location of the data/script files. You can use the command 'getwd' to find out current working directory, and 'setwd' to set it. Here is an example:
> getwd()
 "/home/marek/public_html/math263/ExcelAssignments/Assignment5"
> setwd("/home/marek/Desktop")
> getwd()
 "/home/marek/Desktop"
>

This is done on Linux and would be similar on Mac. On windows, the location of the files would looks like this (depending on your Windows version/configuration)
C:/Documents And Settings/Marek/Desktop

or
C:/Users/Marek/Desktop

However, to avoid these typically long names of folders, you should use the R console menu ('File/Change dir' on Windows and 'Misc/Change Working Directory' on Mac OS X).