Midterm 3 Information (Fall 2012)

\(\newcommand{\reals}{\mathbb{R}} \newcommand{\expect}{\mathbb{E}} \newcommand{\var}{\mathbb{V}} \newcommand{\corr}{\mathrm{corr}} \)

Information about Midterm 3 (now updated for Fall 2012)

NOTE: Last updated at 3:05pm on November 15, 2012.

NOTE: Please make sure that the mathematical formulas show in the standard mathematical notation. I have a report that some of the computers on campus do not display formulas but the underlying coding. This means that JavaScript is disabled in those browsers. Please enable JavaScript or use a different computer/browser.

General Information

Miterm 3 will primarily cover topics in Chapters 4-7:
- Chapter 4: 4.3, 4.4, 4.5.
- Chapter 5: 5.1, 5.2.
- Chapter 6: 6.1, 6.2.
- Chapter 7: 7.1, 7.2.
There may be some overlap with materials in other sections.
There is a big practice test (over 50 problems) for Midterm 3 and the Final Exam. Only a portion is relevant to Midterm 3. The problems in this practice test will be used as the basis for the real test. The answers to the practice test WILL NOT be posted. You should omit problems not yet covered in class and WebAssign test while practicing for Midterm 3.
The test will consist of exactly 30 multiple choice questions. Grading criteria are similar to other midterms.
Please review old tests (2010 Midterm 3 and a portion of 2010 Midterm 2) in this folder) for a sample of problems that may be similar to some test questions. Note that the material of the old Midterm 3 will not exactly correspond to this exam. There may be additional practice problems posted for the current test.

Specific information

This is a discussion of various topics related to the material of the course and it is meant to complement the practice test in various ways. It is not meant to be a precise list of topics covered by the test.

Sample spaces

Be able to identify sample spaces in specific examples.
Know the distinction betwee sets and ordered pairs, and sequences.
Use proper notation for sets, ordered pairs and sequences:
- \(\{H,T\}\) or \(A=\{HH,HT,TH,TT\}\) denote sets because curly braces are used
- \((H,T)\) is an ordered pair and it may denote the sequence of two coin tosses with first resulting in a head, and second in a tail. Order matters!
- \((H,T,T,T,H)\) is a sequence of 5 elements. It may denote the sequence of results of 5 coin tosses, in which the first and last toss resulted in a head, and the remaining tosses resulted in a tail. Order matters!
Note that sequences are often abbreviated. Thus \(HH\) really means \((H,H)\) and \(HHTTH\) means \((H,H,T,T,H)\) but not \(\{H,H,T,T,H\}\).
Note that \(\{H,H,T,T,H\}=\{H,T\}\). Listing a set element multiple times is allowed, but it does not change the set
There are \(2^n\) sequences of \(n\) heads and tails. There are 6^n sequences of \(n\) numbers 1-6.
There are \[{7\choose3}=\frac{7\cdot 6\cdot 5}{3\cdot 2\cdot 1}=35\] 3-element subsets of numbers between 1 and 7. This is "7 choose 3". You could list them in a systematic manner: \[\{\{1,2,3\},\{1,2,4\},\ldots,\{5,6,7\}\}\] Note that there are \(7^3\) sequences of 3 numbers between 1 and 7: \[\{(1,1,1),(1,1,2),\ldots,(7,7,7)\}\] Thus, numbers in a sequence may repeat.
An alternative notation for "n choose k" is: \[ _nC_k = { n\choose k } \] and is used in the PowerPoint slides. C stands for combinations. A combination is another name for a subset.

Random variables

A random variable is a function on the sample space, with numerical values. Thus, if \(S\) is a sample space, any function \(X:S\to \reals \) is a random variable on the sample space \(S\).
Review the terminology of functions. Thus, if \(X:S\to U\) then \(S\) is called the domain of \(X\) and \(U\) is called the range of \(X\). Thus, for random variables the range is a subset of \(\reals \) the real numbers. Note: \[\reals =(-\infty,\infty)\] \[\reals = ]-\infty,\infty[ \] using two different conventions about denoting intervals.
The set of values of a random variable is the set of numbers \(X(s)\) where \(s\) is an outcome (an element of \(S\)). It is denoted \(X(S)\) ("X of the entire sample space"). The set of values of a random variable is important. For instance, if \(X_1\) and \(X_2\) are random variables.
Know the definition of a discrete random variable: A function \(X:S\to\reals \) is a discrete random variable if the set of values of \(X\) is either finite or countable infinite. Know that the definition in the book is incorrect, disallowing an infinite set of values. An example of a discrete random variable with an infinite set of values is the number of tosses of a coin before you see the first head. Thus,
- If in the first toss you get \(Head\), \(X=0\).
- If in the first toss is a \(Tail\) but you get \(Head\) on the second toss, \(X=1\).
- Thus, \(X\) may be \(0, 1, 2, \ldots\).
- It may be shown (using, for instance, tree-based calculations), that \[ P(X=k) = \frac{1}{2^{k+1}} \] for \(k=0,1,2,\ldots\).
Know the definition of the probability function (the same as probability distribution) for discrete random variables. The probability function for a random variable assuming values \(x_1,x_2,\ldots,x_n\) with probabilities \(p_1,p_2,\ldots,p_n\) is: \[ f(x_i) = p_i \] The probability table of \(X\) is simply a table that lists the values of this function: \[ \begin{array}{c|ccccc} \hline\\ x & x_1 & x_2 & x_3 & \ldots & x_n \\ \hline\\ P(X=x) & p_1 & p_2 & p_3 & \ldots & p_n\\ \hline \end{array} \] Since the set of values may be infinite (but countable), the table may be infinite, and it may be necessary to give a formula rather than listing values, as in the previous example.
Know the definition of a continuous random variable. The set of values of such a variable is an uncountable set such as an interval \([a,b]\) or \([0,\infty[\). The probability \[ P(X=x)=0 \] of a particular value \(x\) is always zero for a continuous random variable. Therefore, to calculate probabilities related to continuous random variables requires the knownledge of the probability density function ("density curve"). If the formula for the density curve is \(y = f(x) \) then the formula allowing us to compute the probability is: \[ P(a \le X \le b) = \int_{a}^b f(x)\,dx \] Thus, the probability is the area under the curve.
Note that \[ P(a \le X \le b) = P(a < X \le b)= P(a \le X < b) = P(a < X < b) \] for continuous random variables. This definitely not the case for discrete random variables (why?).
Probability distribution functions, probability tables, cumulative distributions, expected value, variance, standard deviation. Note that the cumulative distribution function of a random variable \(X\) is defined by: \[ F(x) = P(X \le x) \] This formula is valid, regardless of whether \(X\) is discrete or continuous. Note that this is the left or lower tail of the density curve. However, the calculation is different in these two cases: \[ F(x) = \sum_{y\leq x} P(X=y) \] for discrete variables, where the summation is over \(y\) which \(X\) actually assumes. For continuous random variables, \[F(x) = \int_{-\infty}^x f(y) \,dy \] where \(f(x)\) is the "density curve". Thus, this is the area under the curve \(w=f(y)\) and to the left of the line \(y=x\). Note that Table A tabulates \(F(x)\) for the normal density curve: \[ f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} x^2} \]
Calculation of the random variable expected value (also called the mean). Be familiar with the formula: \[ \expect X = \mu_X = \sum_{i=1}^n x_i p_i = \sum_{x} x\, P(X = x)\] where the summation extends over all values \(x\) of \(X\). Thus, \(n\) is the number of values, and it could be infinite. I gave the two notations, \(\mu_X\) and \(\expect X\) for the same thing, the mean (value) of a random variable. \(\expect\) is the first letter of "expected value" which is just a different name. The set of values could be an infinite, but countable set, such as natural numbers. In this case, the formula is an infinite series: \[ \expect X = \mu_X = \sum_{i=1}^\infty x_i p_i = \sum_{x} x\, P(X = x) \] i.e. formally \(n=\infty\). For example, the number of tosses before the first head is seen in a potentially infinite sequence of coin tosses is: \[ \sum_{k=0}^\infty k \cdot \frac{1}{2^{k+1}} = 1.\] Obtaining this result requires some familiarity with infinite series.
Calculation of variance of a random variable (not sample variance!): \[ \sigma^2_X = \sum_{i=1}^\infty (x_i -\mu)^2 p_i = \sum_{x} (x-\mu)^2\, P(X = x) \] where \(\mu = \mu_X\).
Know that \(X+Y\) is only defined if \(X\) and \(Y\) share the same sample space. In calculus, you have learnt that two functions may be added only if they have the same domain. This is the same principle.
The rule for the mean of a sum of random variables: \[ \mu_{X+Y} = \mu_{X}+\mu_{Y} \] Know that \(X\) and \(Y\) do not have to be independent for this rule to hold.
The rule for the variance of independent random variables: \[ \sigma_{X+Y}^2 = \sigma_X^2 + \sigma_Y^2 \] if variables \(X\) and \(Y\) are independent.
Another variance formula: \[ \sigma_{a\,X+b}^2 = b^2\,\sigma_{X}^2\] where \(a\) and \(b\) are constants.
Combined rule for expected values: \[ \mu_{a\,X+b} = a\,\mu_X + b \] where \(a\) and \(b\) are numbers.
Combined rule for variances: \[ \sigma_{a\,X+b}^2 = a^2\,\sigma_X^2\] where \(a\) and \(b\) are numbers.
Know the meaning of independence of random variables: for all \(x\) and \(y\) \[ P(X=x\;\text{and}\;Y=y) = P(X=x)\,P(Y=y) \]
Familiarity with correlation for random variables: \[ \rho = \rho_{X,Y} = \corr (X,Y)=\expect \left(\left(\frac{X-\mu_X}{\sigma_X}\right) \left(\frac{Y-\mu_Y}{\sigma_Y}\right)\right) = \sum_{i=1}^m\sum_{j=1}^n \left(\frac{x_i-\mu_X}{\sigma_X}\right) \left(\frac{y_j-\mu_Y}{\sigma_Y}\right)\cdot p_{ij} \] where \[ p_{ij} = P(X=x_i\;\text{and}\;Y=y_j) \] (NOTE: Cannot use product rule because \(X\) and \(Y\) possibly are not independent; if they were, \(\rho_{X,Y}=0\)) Here, we avoided excessive subscripts by using common notation for the expected value of a variable: \[ \expect X = \mu_X \]
Kow the meaning of the formula: \[ \sigma_{X+Y}^2 = \sigma_{X}^2 + \sigma_{Y}^2 + 2\,\rho_{X,Y}\,\sigma_X\sigma_Y \]
Note that the formula for \(\rho\) is not in the book, although it appears to be used in several examples. Be able to use it when \(X\) and \(Y\) assume very few values (2 or 3). Here is an example: For a certain basketball team it was determined that the probability of scoring a hit in the first free shot is \(0.8\). However, the probability of scoring on the second shot depends on whether the player scored on the first shot. If the player scored on the first shot, the probability of scoring on the second shot remains \(0.8\). However, if the player missed on the first shot, the probability of scoring on the second shot is only \(0.7\). This lower performance is known in the sports as "choking". Let \(X\) be a random variable which represents the number of points scored on the first shot (0 or 1), and let \(Y\) be the number of points scored on the second shot. Please answer the following questions:
- Find the four probabilities: \[p_{ij} = P(X=i\;\text{and}\;Y=j) \] for \(i,j=0,1\). That is, fill out the following table: \[ \begin{array}{c|c|c} i\backslash j & 0 & 1 \\ \hline\hline\\ 0 & p_{00} & p_{01}\\ 1 & p_{10} & p_{11}\\ \hline\hline \end{array} \] The above table is the joint probability function or joint distribution of \(X\) and \(Y\).
- Find \(\mu_X\) and \(\mu_Y\).
- Find \(\sigma_X\) and \(\sigma_Y\).
- Find \(\corr (X,Y)\). Note that in this case: \[ \begin{eqnarray} \corr (X,Y) &=&\frac{1}{\sigma_X\,\sigma_Y} \times \\ &&\bigg[ (0-\mu_X)(0-\mu_Y)\,p_{00}\\ &+&(0-\mu_X)(1-\mu_Y)\,p_{01}\\ &+&(1-\mu_X)(0-\mu_Y)\,p_{10}\\ &+&(1-\mu_X)(1-\mu_Y)\,p_{11}\bigg] \end{eqnarray} \]
- Find the probability function (table) for the random variable \(Z\), the total score of two free throws:\(Z=X+Y\). Find \(\mu_Z\) and \(\sigma_Z\) directly.
- Verify the equation \[ \sigma_{Z}^2 = \sigma_{X}^2+\sigma_{Y}^2+2\rho_{X,Y}\,\sigma_X\,\sigma_Y \]
- Please do finish the calculations above!
- You may also use the above example to practice conditional probabilities and the Bayes' formula. For example, if it is known that a player scored on the second throw, what is the probability that she/he missed on the first throw?
Know that independence of random variables \(X\) and \(Y\) implies \(\corr (X,Y)=0\). Know that \(\corr (X,Y)=0 \) does not imply independence of \(X\) and \(Y\). However, \(\corr (X,Y)=\pm 1\) implies that \(Y=a\,X+b\) for some constants \(a\) and \(b\). Moreover the sign of \(a\) is the same as the sign of \(\corr (X,Y)\). This is perfect linear dependence. NOTE: A similar result was true for sample correlations.
Note that \(\corr (X,Y)\) is only defined when \(\sigma_X>0\) and \(\sigma_Y>0\). In particular, \(X\) and \(Y\) must assume at least two different values.
Know that there is a mean \(\mu_X\) of a random variable and sample mean \[\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\] The former is a parameter of the population, and the latter is a statistic (a property of the sample).
Similarly, the variance \(\sigma_X^2\) is a parameter and the sample variance \[s_x^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2 \] which is a statistic.
Also, the correlation \(\rho_{X,Y}\) is a parameter, while the sample correlation \(r=r_{xy}\) is a statistic. Recall the formula for the sample correlation: \[ r=r_{xy}= \frac{1}{n-1}\sum_{i=1}^n \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y} \] Compare with the formula for \(\corr (X,Y)\), which involves double summation.

Conditional probability

Know the meaning of \( P(A\,|\,B)\)
Know the formula: \[ P(A\,|\,B) = \frac{P(A\cap B)}{P(B)} \]
Know the rules of probability for conditional probabilities. The function \( P'(A) = P(A|B) \) satisfies all Probability Rules for fixed \(B\). For example, \[ P(A\cup C|B) = P(A|B) + P(C|B) - P(A\cap C|B) \] Thus, if you learned a rule for ordinary, non-conditional probability, there is a corresponding rule for the conditional probability.
Know the Law of Alternatives also known as the Total Probability Formula. Let \(C_1,C_2,\ldots,C_n\) be mutually disjoint events ("causes"): \[C_i\cap C_j = \emptyset\quad\text{when \(i\neq j\)}\] and exhaustive events: \[ C_1\cup C_2\cup \ldots \cup C_n = S \] Then for every event \(A\) ("consequence"): \[ P(A) = \sum_{i=1}^n P(A|C_i)\,P(C_i) \]
Know the Bayes' Formula: \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
An alternative form of the Bayes' Formula for the probability of the cause, given a known consequence: \[ P(C_i|A) = \frac{P(A|C_i) P(C_i)}{\sum_{j=1}^nP(A|C_j) P(C_j)} \]
Know how to apply Bayes' Formula to common examples discussed by the book and slides.
You may find it useful to read the following article on Bayes' Theorem
The Monty Hall Problem provides an interesting example of an application of conditional probability. This example is often used in job interviews.

A note on notation

I will use the notation \[ \expect{X} = \mu_X \] for the expected value (mean) of a random variable \(X\). Note that in some browsers the bar in \(\bar{X}\) will not show if \(\bar{X}\) is in the subscript. If you don't see a bar below, you should watch out for this problem: \[ \expect{\bar{x}} = \mu_{\bar{x}} \] Similarly, we will use two notations as synonymous \[ \var{X} = \sigma^2_{X} \] for the variance of the random variable \(X\).

List of Chapter 5 topics covered

The binomial distribution (Section 5.1)

Be able to calculate probability from the formula for \( P(X=k)\) assuming \(B(n,p)\). \[ P(X=k) = {n \choose k} p^k (1-p)^{n-k} \]
Know mean and variance formulas. If \(X\) assumed values \(x_1,x_2,\ldots,x_n\) with probabilities \(p_1,p_2,\ldots,p_n\) (\( p_k = P(X=x_k) \)) then the mean and variance are defined by: \[ \mu_X = \sum_{i=1}^n x_i\, p_i \] \[ \sigma_X^2 = \sum_{i=1}^n (x_i-\mu_X)^2\,p_i \]
Know when to apply (drawing with vs. without replacement).
Ability to use normal approximation: if \(X\) has binomial distribution \(B(n, p)\) then \[ \mu_X = n\cdot p \] \[ \sigma_X = \sqrt{n\cdot p \cdot (1-p)} \] \[ P\left(a \le \frac{X-\mu_X}{\sigma_X} < b \right) \approx F(b) - F(a) \] where \(F(x)\) is the c.d.f. of N(0,1).

Proportions \(\hat{p}\)

Knowing the meaning of (population with fraction \(p\) having a binary characteristic (is a person a male or not, is the person a Republican or not, etc.).
The proportion \(\hat{p}\) represents the fraction of a sample that has the characteristic.
Understand the usefuleness of the special random variable \[ X(u) = \begin{cases} 1 &\text{if \(u\) has the characteristic,} \\ 0 &\text{otherwise.} \end{cases} \]
Understand perfectly that \(X\) above satisfies these two equations: \[ \mu_{X} = p \] \[ \sigma_{X}^2 = p(1-p) \]
Know the meaning of the formula \[\hat{p} = \frac{X}{n} = \frac{1}{n}\left(\sum_{i=1}^n X_i\right)\] where \(X\) is \(B(n,p)\)-distributed and \(n\) is the sample size and \(X_i\) is a Bernoulli trial: a random variable that assumes values \(\{0,1\}\) and \(P(X_i=1) = p \). Know that \(X_i\) are independent.
Understand the following equations perfectly: \[ \mu_{\hat{p}}=\mu_{\bar{X}} = \mu_{X_1} = p\] \[ \sigma_{\hat{p}}^2=\sigma_{\bar{X}}^2 = \frac{1}{n^2}\sum_{i=1}^n\sigma_{X_i}^2 = \frac{1}{n^2}\,n\,\sigma_{X_1}^2=\frac{p(1-p)}{n} \] \[ \sigma_{X_i}^2=\sigma_{X_1}^2 = p(1-p) \]

Chapter 6 and Chapter 7

The main difference between Chapter 6 and Chapter 7 is that in Chapter 6 the standard deviation is given (somewhat misteriously, it is known), while in Chapter 7 we estimate it using the sample variance. This results in the change of the normal distribution as sampling statistic of the z-score for \(\bar{x}\) \[ z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \] to the t-statistic \[ t = \frac{\bar{x}-\mu}{SEM} \] where \[ SEM = \frac{s}{\sqrt{n}}.\] is the standard error of the mean and is a statistic.

Sampling distribution

The term sampling distribution refers to the probability distribution of any statistic (a statistic is a random variable!). Often, the statistic is \(\bar{x}\), as the primary goal of many statistical method is to estimate the mean \(\mu\) of the population and \(\bar{x}\) is an unbiased estimator of \(\mu\), which, by definition, is expressed by the equation: \[ \mu_{\bar{x}} = \mu. \] In plain English: the expected value of sample mean is the population mean.
The expected value and variance of the sample mean. Thus, a sample consists of \(n\) individuals. Moreover:
- A sample is either an SRS, or otherwise guaranteed to be random.
- This gives rise to \(n\) random variables \(x_1,x_2,\ldots,x_n\). I would prefer capitals, to be consistent with Chapter 4, in which random variables are denoted by capital letters. However, in most of the book, the notation uses lower case. Thus, \(x_i\) represents a measurement of a variable (such as height, weight, etc.) on the \(i\)-th individual of the sample. Hence,
  - \(x_i\) has the same distribution as the entire population.
  - Random variables \(x_1,x_2,\ldots, x_n\) are assumed to be independent.
- For the curious: the precise meaning of independence of random variables \(X_1,X_2,\ldots,X_n\) is this. For every event \(A\) expressed as a collection of inequalities \[ a_1 < X_1 \leq b_1 \;\mathrm{and}\; a_2 < X_2 \leq b_2 \;\mathrm{and}\;\ldots\;\mathrm{and}\; a_n < X_n \leq b_n \] where \(a_i, b_i\) are arbitrary numbers, the probability \(P(A)\) is the product of probabilities \[ P(a_1 < X_1\leq b_1) \cdot P(a_2 < X_2\leq b_2) \cdot \ldots \cdot P(a_n < X_n\leq b_n)\] This kind of independence is called joint independence of random variables \(X_i\). There is also pairwise independence, which requires that pairs \(X_i,X_j\) be (jointly) independent for each pair \((i,j)\) where \(1\leq i\neq j \leq n\).
- Since \(x_i\) is a random variable, it is a function \(x_i: S\to \reals\) on a sample space. The sample space is hardly ever specified, but be clear that it can be precisely defined. Let the population sample space be \(S=S_{population}\). That is, this is the sample space consisting of all outcomes (individuals). Since a sample consists of \(n\) individuals, it is an \(n\)-tuple. Thus \[ S_{sample} = \{ (s_1,s_2,\ldots,s_n) \; \mathrm{where}\;s_i\; \mathrm{belongs}\;\mathrm{to}\;S_{population}.\} \]
  Example 1:: If we draw an SRS of 3 people (with replacement) out of 100, the sample space \(S_{sample}\) consists of \(100^3\) (a million) triples (3-tuples) \[ (person_1,person_2,person_3). \] Since we draw with replacement, these persons do not have to be distinct, that is, we may draw the same person twice.
  
  Example 2: If the situation is as in Example 1 but if we draw without replacement, the persons would have to be distinct. In this case, the number of elements in \(S_{sample}\) is \[ 100\cdot 99 \cdot 98 = \frac{100!}{(100-3)!}\] which is slightly smaller than a million, but still a very large number: \[ 970200\]
  Thus, the sample space is large even in relatively simple situations, such as picking 3 candidates for a job, when we have 100 applicants.
Here is how to count the number of outcomes of drawing \(k\) objects out of \(n\) objects (\(k\leq n\)). The number \[ _nP_k = n\cdot(n-1)\cdot\cdots\cdot(n-k+1)=\frac{n!}{(n-k)!} \] is the number of all sequences (variations or permutations) of \(k\) elements of an \(n\)-element set. The difference between permutations and combinations is that the order does matter for variations. We have a formula: \[ _nP_k = k!\,_nC_k \] which expresses the fact that we can take a combination of \(k\) elements and permute its elements to form all variations of length \(k\). There are \(k!\) ways to do so, so the number of variations is \(k!\) times larger than the number of combinations of the same length.
Clarity about the sample mean vs. the mean (expected value). Why sample mean is a random variable while mean is not, it is a parameter of the population. Why sample variance is a random variable while variance is not. Be clear why it makes sense to ask for the expected value of the sample variance. The formula for the sample mean: \[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \] Thus, \(\bar{x}\) is the arithmetic mean (average) of \(n\) random variables \(x_i\). Similarly, sample variance is \[ s^2 = \frac{1}{n-1}\sum_{i=1}^n \left(x_i - \bar{x}\right)^2 \] and it depends on random variables \(x_1,x_2\ldots,x_n\). Note that it also involves a random variable (statistic) \( \bar{x} \), but this random variable is expressed in terms of \(x_i\), so it introduces no new dependency.
Understand what it means that sample mean is an unbiased estimator of the expected value. This means that: \[ \expect{\bar{x}}=\mu_{\bar{x}} = \mu_{x_1} = \mu \] where \(\mu\) always means the expected value (synonymous with mean but not sample mean) of the underlying variable such as height, weight, etc. Note that \[ \mu = \sum_{u} u \cdot P(x_1 = u) \] where the summation extends over all \(u\) which are possible results of measuring the underlying variable (height, weight, etc.). Note that this applies to discrete random variables. For continuous random variables the analogous formula is: \[ \mu = \int_{-\infty}^\infty x\cdot f(x)\,dx\] where \(f(x)\) is the probability density function of the underlying variable (height, weight, etc.) Note that \(f(x)\,\Delta x\) represents the probability of the variable assuming a value in a small interval \((x,x+\Delta x)\). Thus, the integral is the limit value, as \(\Delta x\to 0\) of the Rieman sums: \[ \sum_{i} u_i \cdot P(u_i < x_i < u_i + \Delta x) = \sum_{i} u_i f(u_i)\Delta x \] where the sum is over evenly and densely spaced numbers \(u_i\) spread across all possible values of the random variable, for instance \(u_i = i \cdot \Delta x\) where \(i\) assumes all integer values (both positive and negative). Thus, the Riemann sum may involve an infinite number of terms, and thus is an infinite series. Note that this cannot be avoided for the normally distributed populations, because all real values are assumed. Thus, computing \(\mu\) often involves infinite summation or improper integrals.
Note the main point of the above discussion: \(\bar{x}\) and \(\mu\) are computed in a completely different manner. The former requires knowing the values \(x_i\) computed from a random sample, and thus is a statistic. The latter requires the knowledge of exact probabilities or the exact probability density function for the entire population. The calculation of \(\mu\) is often through a theoretical reasoning involving higher math, such as finding a sum of an (infinite) series or (improper) integral.
Similarly, note the difference between sample variance \(s^2\) and variance of the population \(\sigma^2\). Here are the contrasting formulas: \[ s^2 = \frac{1}{n-1}\sum_{i=1}^n \left(x_i - \bar{x}\right)^2 \] \[ \sigma^2 = \sum_{u} (u-\mu)^2 P(x_1 = u) \] where the summation extends over all values of the random variable \(x_1\). As you may have seen, or guessed, for continuous variables: \[ \sigma^2 = \int_{-\infty}^\infty (x-\mu)^2 f(x)\,dx \] where \(f(x)\) is the probability density function of \(x_1\).
The fact that we chose \(x_1\) in the above formulas, i.e. the measurement of the underlying variable (height, weight etc.) on the first individual of the sample drawn, is not important. Each random variables \(x_i\) with \(1\leq i \leq n\) can be used. In particular, these variables are identically distributed that is, for each real \(t\) and \(1\leq i,j \leq n\) : \[ P(x_i \le t) = P(x_j \le t).\] In addition, \(x_i\) are jointly independent. Such a collection of random variables is called IID (independent, identically distributed, or "eye-eye-dee").
The formula for the variance for the normal distribution \(N(\mu,\sigma)\) relies upon the fact that \[ \frac{1}{\sqrt{2\,\pi}\sigma}\,\int_{-\infty}^\infty (x-\mu)^2\,e^{-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}}\,dx = \sigma^2\] for all \(\mu\) and \(\sigma>0\). This is a calculus fact as much as it is a probability fact, and is typically taught in an advanced vector calculus class, such as Math 223 at the University of Arizona.
Understand what it means that the sample variance is a an unbiased estimator of the variance. By definition, this means: \[ \expect{\bar{x}} = \mu_{\bar{x}} = \mu \] That is, the expected value of the sample mean is the population mean. \[ \expect{s^2} = \mu_{s^2} = \frac{\sigma^2}{n} \] That is, the expected value of the sample variance is the population variance. Note that this would not be true if we hat the coefficient \(1/n\) in the expression for \(s^2\). This is perhaps the most important reason to have a coefficient of \(1/(n-1)\) in the formula for \(s^2\).
Note that the sample standard deviation \[ s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}\] is not an unbiased estimator of \(\sigma\), the population standard deviation. For a particular distribution, the normal distribution, the unbiased estimator of \(\sigma\) is denoted by \(\hat{\sigma}\) and it may be found in this Wikipedia article along with full discussion of this topic. Note that \(s\) is asymptotically unbiased, that is, the bias disappears for large samples. The bias simply means that \(s\) will tend to systematically underestimate \(\sigma\). One can use the following estimator: \[ \hat{\sigma}=\sqrt{\frac{1}{n-0.5}\sum_{i=1}^n (x_i-\bar{x})^2}\] which typically has smaller bias than \(s\) (rule of thumb).
Ability to calculate the expected value of the sample mean and variance. That is \[ \expect{\bar{x}} = \mu_{\bar{x}} = \mu \] However, please note the distinction between the two quantities: the expected value of the sample variance \[ \expect{s^2} = \sigma^2 \] and the the variance of the sample mean \(\sigma^2_{\bar{x}}\). We recall this derivation: \[ \sigma^2_{\bar{x}} = \var \left(\frac{1}{n}\sum_{i=1} x_i \right) = \left( \frac{1}{n}\right)^2 \sum_{i=1}^n\var(x_i) = \frac{1}{n^2} n\,\var(x_1) = \frac{1}{n^2} n\,\sigma^2 = \frac{\sigma^2}{n}\] which is one of the most important derivations in statistics, and yet simple enough to understand. It is based on known properties of the variance. Thus, the standard deviation of the sample mean \(\bar{x}\) is: \[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \] Note that this is a parameter of the population. In contrast, the sample mean \(\bar{x}\) is a statistic, and also a random variable.
Note that WebAssign sometimes uses the notation \(s_{\bar{x}}\). This notation is not used in the book or on PowerPoint slides, and is confused. There is no such thing as "sample standard deviation of sample mean" in any context discussed in this course. Typically, the confusion can be resolved by assuming that this notation means SEM, or equivalently, \(s/\sqrt{n}\).
Remember: Every statistic is a random variable. It is a special random variable which is defined on the sample space \(S_{sample}\), discussed elsewhere in these notes.
Note that we need to approximate \(\sigma_{\bar{x}}\), the standard error of the means (SEM), in applications such as the single-sample t-test. The idea of the t-statistic comes from the z-score for the sample mean (the random variable \(\bar{x}\)): \[ z = \frac{\bar{x}-\mu_{\bar{x}}}{\sigma_{\bar{x}}} = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\] The main idea of approximation in statistics is this: we replace a population parameter with its estimator, preferrably unbiased, or asymptotically unbiased. This immediately leads to t-statistic derived by approximating z-statistic by replacing known population parameter \(\sigma\) with a statistic \(s\). This leads to the formula \[ t = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}} \]
Note that SEM (standard error of the mean) is defined as follows: \[ SEM = \frac{s}{\sqrt{n}} \] Thus, SEM is a statistic. Using SEM, we may write the t-statistic as: \[ t = \frac{\bar{x}-\mu_{\bar{x}}}{SEM} \]
Be clear on how sample size affects the sampling distribution, in particular, its expected value and variance. Thus, the mean of the sampling distribution is the mean of the population independently of the sample size. The spread of the sampling distribution decreases with sample size. For a sample of size \(n\) it is exactly \(\sqrt{n}\) times smaller than the spread of the population distribution.

Confidence intervals

Be clear on what a confidence interval is and is NOT. Be clear about the implications.
Thus, the confidence interval is \( (\bar{x}-m, \bar{x}+m) \) for a single sample \((x_1,x_2,\ldots,x_m)\), where \(m\) is the margin of error. The true population mean \(\mu\) belongs to the confidence interval with probability equal to the confidence level \(C\). Thus, if we draw new samples over and over again (with replacement) and \(C=.95\) (95% confidence interval) then 95% of the time the true population mean \(\mu\) will be in the confidence interval.
It is not true that \(\bar{x}\), the sample mean of a sample belongs to the confidence interval with probability \(C\) - the confidence level. Note that a particular confidence interval is determined from a single sample, and thus may not be quite typical: its center may be quite far from the population mean. Thus, a particular confidence interval does not provide any information about where sample means of repeatedly drawn samples will fall.
Be able to calculate when the variance of the population is known (Chapter 6) and unknown (Chapter 7).
Understand the confidence level \(C\).
Know margin of error \(m\). The three formulas for the margin of error in different contexts.
Margin of error for single sample when \(\sigma\) is known, is \[ m = z^* \frac{\sigma}{\sqrt{n}} \] where \(z^*\) is the z-score for which the two-tailed probability is \(1-C\), where \(C\) is the confidence level. Alternatively, if we use Table A, we need to look up the z-value (inverse lookup) for \[ p = \frac{1-C}{2} \] becase Table A lists the probability of the left tail. After that we need to flip the sign: \[ z^* = -z \]
Margin of error for single sample when \(\sigma\) is unknown, is \[ m = t^* \frac{s}{\sqrt{n}} \] where \(t^*\) is the \(t\) for which the two-tailed probability of the \(t\)-distribution with degrees of freedom \(df=n-1\), is \(1-C\), where \(C\) is the confidence level. Alternatively, if we use Table D, \(t^*\) is the value of \(t\) for which the confidence level is \(C\) (the confidence level is in the bottom row). Note that this is inverse lookup (the value of \(t\) for given value of \(P\)). Because Table D lists the RIGH tail, we do not need to flip the sign of the looked up value of \(t\): \[t^* = t \]
Understand the difference between single-sample and two-sample confidence interval and hypothesis testing. Thus, if we have two samples \[ x_1 = (x_{11},x_{12},\ldots, x_{1n_1}) \] of size \(n_1\) and a second sample \[ x_2 = (x_{21},x_{22},\ldots, x_{2n_2}) \] of size \(n_2\) then we have two sample means: \[ \bar{x}_1 = \frac{1}{n_1} \sum_{i=1}^{n_1} x_{1i} \] and \[ \bar{x}_2 = \frac{1}{n_2} \sum_{i=1}^{n_1} x_{2i}. \] Therefore, we have the difference of the means \[ D = \bar{x}_2 - \bar{x}_1. \] \(D\) is a statistic and a random variable. The variance of the difference: \[ \sigma^2_D = \sigma^2_{\bar{x}_1} + \sigma^2_{\bar{x}_2} = \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} \] where \(\sigma_1^2\) and \(\sigma_2^2\) are the population standard deviations of the two populations from which samples \(x_1\) and \(x_2\) come from. Note that if \(x_1\) and \(x_2\) are two SRSs drawn from the same populations then \(\sigma_1 = \sigma_2\). This is due to the independence of \(\bar{x_1}\) and \(\bar{x_2}\). Note that this independence is a consequence of the independence between the two groups of random variables \(x_{1i}\) and \(x_{2j}\). Since only the first group contributes to \(\bar{x}_1\) and only the second group to \(\bar{x}_2\), this yields the independence of \(\bar{x_1}\) and \(\bar{x_2}\). While intuitively clear, the full derivation of this property is not given here. Thus we have the formula \[ \sigma_D = \sigma_{\bar{x}_2-\bar{x}_1} = \sqrt{\frac{\sigma^2_1}{n_1} + \frac{\sigma^2_2}{n_2}} \]
The two-sample z-statistic is the z-score for the random variable \(D\). Thus \[ z = \frac{D - \mu_D}{\sigma_D} = \frac{(\bar{x}_2-\bar{x}_1) - (\mu_2-\mu_1)}{\sqrt{\frac{\sigma^2_1}{n_1} + \frac{\sigma^2_2}{n_2}}}. \]
As usual, the t-statistic for a two-sample situation is obtained by replacing unknown \(\sigma\)'s with \(s\)'s. If the two populations are different, this leads to the formula \[ \sigma_D \approx SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \] and the resulting t-statistic is \[ t = \frac{(\bar{x}_2-\bar{x}_1) - (\mu_2-\mu_1)}{\sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}}. \] However, when the two samples \(x_1\) and \(x_2\) come from the same population, \(\sigma_1 = \sigma_2 = \sigma\) is approximated by the sample standard deviation of the pooled sample. The pooled sample is obtained by combining samples \(x_1\) and \(x_2\). Thus, the pooled sample is: \[ x = (x_{11},x_{12},\ldots,x_{1n_1},x_{21},x_{22},\ldots,x_{2n_2}) \] and it has size \(n_1 + n_2\). If \(s\) is the sample standard deviation of the pooled sample then we form the t-statistic as follows: \[ t = \frac{(\bar{x}_2-\bar{x}_1) - (\mu_2-\mu_1)}{\sqrt{\frac{s^2}{n_1} + \frac{s^2}{n_2}}} =\frac{(\bar{x}_2-\bar{x}_1) - (\mu_2-\mu_1)}{s\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}. \] We note that for the pooled sample \(df = n_1 + n_2 -1\) and is exact if the population is normal.
When calculating the confidence interval for two-sample situation, we modify the formulas for the margin of error accordingly. When the variances of the two populations are not the same: \[ m = t^* \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \] and when they are the same \[ m = t^* s \sqrt{\frac{1}{n_1} + \frac{1}{n_2}} \] and \(s\) is the sample mean of the pooled sample.
To the curious: As every random variable, it is a function on a sample space. The sample space in this case consists of pairs of samples \((x_1,x_2)\) as above, of fixed sizes \(n_1\) and \(n_2\). These samples may come from the same or different populations. The samples are assumed to be independently chosen, which implies that random variables \(x_{1i}\) and \(x_{2j}\) (measurements on individuals of the two samples) are independent. The consequence of independence is that we may apply a Product Rule for Probability of Independent Events: \(P(A\cap B) = P(A)\cdot P(B)\) if \(A\) depends on sample \(x_1\) and \(B\) depends on sample \(x_2\).
Remember: when doing an inverse lookup, we never multiply \(t\)-value or \(z\)-value by 2. We only perform such calculations on probabilities (P-values) as required by one-sided or two-sided nature of the calculation. Always draw the picture of the tails if unsure whether probability should be divided or multiplied by 2.

Hypothesis testing

Know the mechanics of hypothesis testing. Null hypothesis, alternative hypothesis, one-sided vs. two-sided tests.
Be able to formulate null and alternative hypothesis.
Understand the meaning of P-value, be able to compute it when the variance of the population is known (Chapter 6) and unknown (Chapter 7).
Be fluent in various lookups (straight and inverse) based on Table D (and review Table A), e.g.
- Given \(t\), degrees of freedom and confidence level, look-up P-value range (straight lookup).
- Given one of the standard confidence levels (i.e. once present in bottom row), look up \(t\) (inverse lookup).
- Be able to translate between confidence level \(C\) and significance level \(\alpha\). It is always true that \(C + \alpha = 1 \). However, in two-sided test, \(\alpha\) is split into two-tails and in one-sided tests \(\alpha\) should be the probability of one tail.
Be clear what the significance level \(\alpha\) is. Not to be confused with confidence level \(C\)!
Understand the difference between rejecting null hypothesis vs. accepting alternative hypothesis. Note that we can never accept null hypothesis as truth, using statistics. Statistics is used to reject incorrect models of the world. The reason for rejecting a model is that the observed data (sample) is too extreme, assuming that the model is true.

Understand the definition type I and II errors.

	Null hypothesis (''H''₀) is true	Null hypothesis (''H''₀) is false
Reject null hypothesis	Type I error	Correct outcome
Fail to reject null hypothesis	Correct outcome	Type II error

Type I error is the situation when we reject a null hypothesis when it is actually true. This happens with probability \(\alpha\), the significance level.
Type II error ocurrs when we do not reject the null hypothesis when it is false. The probability of rejecting the false null hypothesis is called the power of the test. If the probability of committing type II error is \(\beta\) then the power of the test is \(1-\beta\).

The Student t-distribution

Know when to apply (when the variance of the population is unknown).
Be able to calculate t-statistic.
Be able to compute the number of degrees of freedom
Be able to use table D of the book to lookup the critical values of the t-statistic and figure out the P-value.
Know how the t-distribution changes with degrees of freedom. Know that it becomes normal for large df.
Be able to conduct a t-test in the scope of 7.1. (single-sample).
Be able to conduct a two-sample t-test in the scope of 7.2. (two-sample).
Know the conservative estimate for the degrees of freedom for two samples drawn from two distinct populations for which there is no reason to believe that they have the same standard deviation: \[ df = \min(n_1-1,n_2-2) \] However, as discussed before, when the standard deviations are the same (for instance, because the two samples are SRSs drawn from the same population) then we used the pooled t-statistic discussed before, and the degrees of freedom are: \[ df = n_1 + n_2 - 1. \]
The more accurate formula for the degrees of freedom when variances of the populations are not equal is the Welch-Satterthwaite formula. \[df = {{\left( {s_1^2 \over n_1} + {s_2^2 \over n_2}\right)^2 } \over {{s_1^4 \over n_1^2 \cdot \nu_1}+{s_2^4 \over n_2^2 \cdot \nu_2}}}={{\left( {s_1^2 \over n_1} + {s_2^2 \over n_2}\right)^2 } \over {{s_1^4 \over n_1^2 \cdot \left({n_1-1}\right)}+{s_2^4 \over n_2^2 \cdot \left({n_2-1}\right)}}} .\,\] (Based on Wikipedia article on the Welch's test with the Welch-Satterthwaite formula for the number of degrees of freedom.) We used the notation \(\nu_1=n_1-1\) and \(\nu_2=n_2-1\). In calculations by hand, this formula is not very practical, but fortunately easy to do in software. A two-sample test with R is presented here. The formula is implemented there, in a reusable fashion.
Know the reason why such a complicated formula for \(df\) is needed. This is because for \(n_1\neq n_2\) (unequal sample sizes) the distribution of the two-sample t-statistic is not the t-distribution! The idea is to still approximate it with the t-distribution, with a carefully chosen number of degrees of freedom. Some mathematical considerations show that the Welch-Satterthwaite is more accurate than the conservative estimate.
In the case of equal, but unknown variances, we use the pooled t-statistic, and this one is exactly t-distributed when the original population is normal.

The list of Chapter 8 topics covered

NOTE: Information in this section will be useful for the next homework, and the Final Exam, but not for Midterm 3.

Inference for single proportion

Remember: In Chapter 8 we never need the t-statistic. It is all based on normal distribution (Table A).
Know the single proportion z-statistic: \[ z = \frac{\hat{p} - p_0 }{\sqrt{\frac{p_0(1-p_0)}{n}}} \] Note that in the denominator we use \(\hat{p}\) not \(p_0\). This is according to the general approximation principle stated above: we replaced the unknown population parameter \(p_0\) with its unbiased estimator \(\hat{p}\) (remember that \(\hat{p}\) is unbiased because \(\mu_{\hat{p}} = p_0\)).
Know the meaning of the denominator which depends on the knowledge of proportions as random variables from earlier Chapters. Thus, \[ \mu_{\hat{p}} = p \] and \[ \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \]
Know the Central Limit Theorem (CLT) and its implication for proportions: the z-statistic is approximately normally distributed according to N(0,1).
Know the large sample approximation. To apply the CLT we need to have the following:
- The population must be at least 10 times larger than the sample
- Both the number of failures and successes must be at least 10.
Know the margin of error for calculating confidence intervals for proportions: \[ m = z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \] The confidence interval is \[ (\hat{p}-m, \hat{p}+m) \]
Know how to conduct a test using the above statistic. The general methodology of tests is applicable, but in the proportion context we may text the null hypothesis \(H_0:\;p=p_0\), i.e. whether the proportion of the population having a certain characteristic is \(p_0\). For instance, we may test whether a coin is fair, by using the proportion \(\hat{p}\) from \(n=100\) coin tosses, with \(p_0 = 0.5\).
Know the plus 4 correction. That is, in order to obtain a better estimate of the confidence interval, we may compensate for the error due to the finite sample size by using a new statistic \(\tilde{p}\) instead of \(\hat{p}\) computed just like \(\hat{p}\) but assuming artificially increased number for the sample size: \(n+4\). Thus \[\tilde{p} = \frac{X + 2}{n + 4}\] where \(X\) is the count of successes. The numerator \(X+2\) is explained by the fact that 4 is evenly split between the successes and failures.
We also have a "plus 4" rule for the margin of error: \[ m = z^*\sqrt{\frac{\tilde{p}(1-\tilde{p})}{n+4}} \] Note that we used the corrected estimator \(\tilde{p}\), not \(\hat{p}\).
Finally, please remember that \[ \hat{p} = \frac{X}{n} \] where \(X\) has binomial distribution. When the samples are too small (say, less than 15 failures or successes, see the precise rules of thumb in the book covering single-sample and two-sample situations, means and proportions) to use the large sample approximation, or the "plus 4" rule, you may need to resort to using the binomial distribution (Table C). Thus, if you are testing a coin for fairness with \(n = 4\) tosses, the probability that you will get \(X=0,1,2,3,4\) successes (heads) is determined by B(4,p), i.e. \[ P(X=k) = {4\choose k}\cdot p^k\cdot(1-p)^{4-k} \] The null hypothesis would be that \(p=0.5\). Suppose you got jus 1 success. The P-value for a two-sided test of significance would be the probability of getting 1 success or worse, i.e. 0, 1, 4 or 5 successes. Hence, the P-value would be \[ P(|X-2|\ge 1) = 1- P(X=2) = 1 - {4 \choose 2}\cdot 0.5^2\cdot 0.5^2 = 1-6\cdot 0.5^4 = 0.625 \] Hence, your confidence level is \(1-0.625=.375\) or 37.5%, which by all standards does not lead to rejecting the null hypothesis. As you can see, in practice small samples are too small for testing, and the accuracy of the binomial distribution does not help here.
However, the binomial distribution should be used when extreme bias happens, such as \(\hat{p} = 0.6\) and we are testing the alternative hypothesis \(p < 0.2\) with \(n=20\). This is equivalent to stating that there were 12 successes and 8 failures because \(\hat{p}\cdot n = 0.6\cdot 20 = 12\). Below is an example of a test of significance based on the binomial distribution. The P-value can be only reliably be found by the binomial distribution, and the large sample approximation cannot be used by the rules of thum stated in the book. Note that the expected value of count \(X\) is \(n\,p = 20\cdot 0.2 = 4\), assuming the null hypothesis \(H_0: p = 0.2\). The probability of \(\hat{p}\) of \(0.6\) or higher is \[P(X\ge 0.6\cdot 20) = P(X\ge 12) = \sum_{k=12}^{20} {20 \choose k}\cdot (0.2)^k\cdot (0.8)^{20-k} \] This P-value may be evaluated with software (or Table C) and is \[P = 1.0172876515704846\cdot 10^{-4} \approx 0.01\% \] and thus is highly significant.

Inference for 2 proportions

When we have two samples from two populations with true proportions \(p_1\) and \(p_2\) of individuals with a certain characteristic of interest, we may wish to compare the proportions. The z-statistic used is: \[ z = \frac{(\hat{p}_1-\hat{p}_2) - (p_1-p_2)} {\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}} \]
Know the resulting formula for margin of error for the difference \(p_1-p_2\) is: \[ m = z^* \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} \]
Know the "plus 4" rule for two proportions. This becomes "plus 2" rule for each individual proportion, i.e. \[\tilde{p}_1 = \frac{X_1+1}{n_1+2}\] \[\tilde{p}_2 = \frac{X_2+1}{n_2+2}\] where \(X_1\) and \(X_2\) are the success counts for both samples. Similarly, the margin of error formula is modified accordingly: \[ m = z^* \sqrt{ \frac{\tilde{p}_1(1-\tilde{p}_1)}{n_1+2} + \frac{\tilde{p}_2(1-\tilde{p}_2)}{n_2+2} } \] Again, please note that \(\tilde{p}_1\) and \(\tilde{p}_2\) are used, not \(\hat{p}_1\) and \(\hat{p}_2\). This correction should be used when the confidence level \(C\ge 90\%\) and both sample sizes are at least 5.