Math 263, Section 001 and 003 - Excel/R Assignment 5
Last updated on December 3, 12:58PM.Excel/R Assignment 5
In this assignment you will:- Read data from a CSV file using R.
- Cross-tabulate the data using two-factors.
- Conduct the \(\chi^2\)-test for various null hypotheses.
Software used
The statistical package R. Although RExcel may be used, the best way is to use R without RExcel.The data file
The data file is Dataset5.csv. This file uses semicolon as a separator of columns! You may need to specify this fact when your spreadsheet program reads this file. You can view it with Excel, or a text editor such as Notepad. There is an alternative version, Dataset5.txt, which can be viewed with a browser.About the dataset
The dataset is a famous dataset from CMU of a survey of kids on their personal goals. The sample consisted of 478 subjects. Please read the description of the dataset at the original website.Sample script
There is a sample script included, which using the \(\chi^2\)-test, tries to answer the question whether the girls and boys have different personal goals. The script is here. You need to understand how the script works and you need to be able to run it with R. The basic procedure for running a script with R is this:- Download the script and save it in a local file on your computer.
- Start R
- Type in the R-console
source("script.R", echo = T)You should replace "script.R" with the full path of your file on your computer, e.g.C:\\tmp\\script.R
if your file is run on Windows, and it has been saved in the folderC:\tmp
Note the double backslashes, necessary when you use R under Windows. Alternatively, you can use the File menu to change the working folder to to the folder where the file resides. If you succeed doing so, you do not need to change the command above.
Answer the following questions using \(\chi^2\)-test
Instructions for each question
Perform a sequence of \(\chi^2\) tests, with different data.
For each test, include the output of the command
tablewhich creates the cross-tabulated data (2-way table). Alternatively, you may use the command
xtabswhich uses the formula syntax, as in examples below. The latter is more powerful cross-tabulation command than the prior.
For each test, include the output of the command
chisq.testwhich prints the \(\chi^2\)-statistic, degrees of freedom, and P-value.
For every question, formulate precisely the null and alternative hypothesis. State whether the null hypothesis is rejected, based on the test conducted. Use 95% confidence level.
Important!
Sometimes R will print this message:
Warning message:
In chisq.test(table(Age, Money)) :
Chi-squared approximation may be incorrect
This may mean that the \(\chi^2\)-test may not work
for this data due to violations of "rules of thumb".
In this case you can call the chisq.test as follows:
chisq.test(table(Age, Money), simulate.p.value=T, B=100000)
In this case, R will modify the method by
performing a Monte Carlo
simulation to find the P-value. The parameter
B controls the size of a random sample used in
the Monte Carlo simulation. Its size defaults
to 2000, but we may increase the accuracy
of the simulation by raising this value.
However, very large values will result in slowing
down the computation, so keep B below a million.
The questions
- Is the opinion on the importance of money independent of gender?
- Is the opinion on the importance of money independent of school?
- Do girls and boys have the same attitude about the importance of good looks (being pretty/handsome)?
- Do the personal goals change with Grade (4,5 or 6)?
- Are personal goals of students the same in urban and rural schools?
Your own question?
There are 55 possible questions to be asked. I suggest you investigate a question or two that interests you about this dataset. Extra credit may be awarded for creative question/discussion.Calculate marginal distributions
Calculate the marginal distribution of boys and girls conditioned upon their attitude towards money. You should have 4 distributions, each consisting of two complementary probabilities.
In this part, you may follow the script presented in class: chi_alt.R
An example - using 'xtabs' and 'chisq.test' together
A quick way to combine cross-tabulation of two (of 11) variables is presented below. Note that there are \[ {11 \choose 2} = 55 \] distinct ways to build a two-way table out of 11 variables, by cross-tabulation. For example, lets decide whether the goals of students depend on the school. We do the following:
> chisq.test(print(xtabs(~ School + Goals)))
Goals
School Grades Popular Sports
Brentwood Elementary 40 17 10
Brentwood Middle 47 25 12
Brown Middle 22 14 16
Elm 4 11 6
Main 45 12 11
Portage 30 20 11
Ridge 16 18 14
Sand 15 7 6
Westdale Middle 28 17 4
Pearson's Chi-squared test
data: print(xtabs(~School + Goals))
X-squared = 34.5069, df = 16, p-value = 0.00464
Warning message:
In chisq.test(print(xtabs(~School + Goals))) :
Chi-squared approximation may be incorrect
Note that we inserted 'print' between 'xtabs' (cross-tabulation function of R)
and 'chisq.test' (the \(\chi^2\) testing function of R). This has the
effect of printing the cross-tabulated data.
An example - computing a marginal distribution
This is how we can compute the marginal distribution of gender in each school:
> round(prop.table(xtabs(~ School + Gender),1),2)
Gender
School boy girl
Brentwood Elementary 0.63 0.37
Brentwood Middle 0.56 0.44
Brown Middle 0.50 0.50
Elm 0.24 0.76
Main 0.46 0.54
Portage 0.43 0.57
Ridge 0.48 0.52
Sand 0.43 0.57
Westdale Middle 0.31 0.69
Here, we first cross-tabulate with 'xtabs', then we compute the marginal distribution with 'prop.table'
and finally we round the final result to two decimal places with 'round', all in one line of code!
Troubleshooting
Below I address some issues that were reported to me.The "/" in the header of Dataset5.txt causes error
I cannot confirm that this is an issue, but if you suspect it is, edit Dataset5.txt after downloading and replace "/" in "Urban/Rural" with a period. I am able to do this:
> x <- read.table("Dataset5.txt", header=T)
> names(x)
[1] "Gender" "Grade" "Age" "Race" "Urban.Rural"
[6] "School" "Goals" "Grades" "Sports" "Looks"
[11] "Money"
>
Working directory
If you have problem reading data files or sourcing scripts, probably your working directory is not set to the location of the data/script files. You can use the command 'getwd' to find out current working directory, and 'setwd' to set it. Here is an example:
> getwd()
[1] "/home/marek/public_html/math263/ExcelAssignments/Assignment5"
> setwd("/home/marek/Desktop")
> getwd()
[1] "/home/marek/Desktop"
>
This is done on Linux and would be similar on Mac.
On windows, the location of the files would looks like
this (depending on your Windows version/configuration)
C:/Documents And Settings/Marek/Desktop
or
C:/Users/Marek/Desktop
However, to avoid these typically long names of folders,
you should use the R console menu ('File/Change dir' on Windows
and 'Misc/Change Working Directory' on Mac OS X).