This Assignment is compulsory, and contributes 10% towards your nal grade in Math2302,

8.5% in Math7308. The due date for the assignment is 10am on Monday 25 October, 2021 (see

the ECP for information about requesting an extension). You should submit your assignment

electronically. Prepare your assignment as a PDF le, either by typing it or by scanning your

handwritten work. Ensure that your name, student number and tutorial group number appear

on the rst page of your submission. Check that your pdf le is legible and that the le size is

not excessive. Files that are poorly scanned and/or illegible may not be marked. Upload your

submission using the assignment submission link.

1. STAT2203: Probability Models and Data Analysis for Engineering Week 12 Exercises

- According to Hubble’s law, relative velocity $v(\mathrm{~km} / \mathrm{s})$ of any two galaxies separated by a distance $D$ (Mega parsec $-1$ parsec is $3.09 \times 10^{13} \mathrm{~km}$ ) is given by

$$

v=H_{0} D,

$$

where $H_{0}$ is Hubble’s constant. If the expansion of the universe was linear, then $1 / H_{0}$ (Hubble Time) would give the age of the universe. The velocities and distances of 24 galaxies containing Cepheid stars is given in the file hubble.xlsx.

The MATLAB commands for fitting a linear model can be found in the lecture notes.

(a) Fit the linear regression model $V=\beta_{0}+\beta_{1} D+\varepsilon$, where $\varepsilon \sim \mathrm{N}\left(0, \sigma^{2}\right)$ to the hubble data. Assess the suitability of the linear regression model with diagnostic plots.

Solution: After setting the path to include the directory which contains our dataset hubble. $x$ lsx:

Hubblefit = fitlm(hubble, ‘velocity distance’)

Hubblefit =

Linear regression model:

velocity $1+$ distance

From the residual vs fitted value plot, there is no discrenable trend. The red line attempts to make any possible trend apparent but the departure from the horizontal line here is probably due to only a few points. There is a noticeable increar of cor This document is available free of charge on SlUDOCU.COMN ion is ok for

now, the normal probability plot looks close to a straightline which would be consistent with the normality assumption. There is one residual (observation 15) which appears somewhat larger (negative) than we might have expected.

(b) Assuming the linear regression model is appropriate, is the data consistent with $\beta_{0}=0$.

Solution: We want to test the $H_{0}: \beta_{0}=0$ against the alternative that $\beta_{0} \neq 0$. The reported test statistic from the output (Intercept line) is $0.053$ and the reported p-value is $0.958$. We can conclude that there is no evidence against $H_{0}$, suggesting that $\beta_{0}=0$.

(c) In the linear regression model, Hubble’s constant is $\beta_{1}$. Construct a $99 \%$ confidence interval for the Hubble constant.

Solution: Recall confidence intervals have the form

$$

\text { extimate } \pm \text { (critical value) } \times \text { s.e.(estimate). }

$$

We have the estimate and the standard error from the output. We only need to determine the critical value. The reference distribution is the $t$-distribution with $24-2=22$ degrees of freedom. As we want a $99 \%$ confidence interval, $1-\alpha=0.99$, so $\alpha=0.01$ and we need the $0.995$ quantile of the $t_{22}$-distribution

$\gg \operatorname{tinv}(0.995,22)$

ans $=$

$2.8188$

So the confidence interval is

$$

76.127 \pm 2.8188 \times 9.4935

$$

We are $99 \%$ confident the true value of $\beta_{1}$ lies between $49.3667(1 / \mathrm{s})$ and $102.8873(1 / \mathrm{s})$.

(d) Construct a $95 \%$ confidence interval for the mean relative velocity of two galaxies separated by 10 Mega pasecs.

Solution: The estimated mean relative veloicty of two galaxies separated by 10 Mega pasecs is

$$

6.6963+76.127 \times 10=767.9663

$$

The estimate covariance matrix for the estimator of the regression line

\gg> Hubblefit. CoefficientCovariance

ans $=$

- $0 e+04 *$

$1.6017 \quad-0.1086$

$-0.1086 \quad 0.0090$

Set $x_{\text {new }}=\left[\begin{array}{ll}1 & 10\end{array}\right]$. To get $s \sqrt{x_{\text {new }}\left(X^{T} X\right)^{-1} x_{\text {new }}^{T}}$ :

$\gg$ xnew $=\left[\begin{array}{ll}1 & 10\end{array}\right]$;

$>>$ sqrt (xnew*Hubblefit. CoefficientCovariance*xnew’)

ans $=$

$57.4513$

We still use the $t_{22}$-distribution as reference.

$\gg \operatorname{tinv}(0.975,22)$

ans $=$

$2.0739$

So the $95 \%$ confidence interval for the mean relative velocity of two galaxies separated by 10 Mega pasecs is

$$

767.966 \pm 2.0739 \times 57.4513

$$

We are $95 \%$ confident the true mean relative velocity of two galaxies separated by 10 Mega pasecs is between $648.8192(\mathrm{~km} / \mathrm{s})$ and $887.1128(\mathrm{~km} / \mathrm{s})$.

Alternatively, this could be done in MATLAB by

$\gg$ [yhat, $c i$ ] $=$ predict(Hubblefit, [nan 10], ‘Alpha’, 0.05)

yhat =

$767.9658$

ci =

$648.8190 \quad 887.1126$

For some reason MATLAB wants an additional column even though only one explanatory variable is used in the model.

- Polychlorinated biphenyls ( $\mathrm{PCBs}$ ) were once used in industry but were banned in the 1970 s because of concerns about their toxicity. Despite the ban, PCBs can still be detected in most people because they are persistent in the environment. A team of researchers recorded the amount of PCBs detected in maternal milk from mothers who had eaten fish from a particular lake considered to be contaminated with PCBs. They subsequently administered an IQ test to the children when they were 11 years old. The data from 14 mothers and their eldest child are shown in the following scatter plot along with the least-squares line fitting a linear relationship between the two variables:

A regression analysis produced the following edited summary:

Coefficients :

Estimate Std. Error

(Intercept) $127.937156 \quad 6.962961$

$\begin{array}{lll}\text { PCB } & -0.023631 & 0.007529\end{array}$

(a) Briefly interpret the value $-0.023631$.

Solution: An increase of one ng/g of PCB in maternal milk is associated with a decrease in the child’s IQ of $0.0236$.

(b) Carry out a t-test to assess whether there is evidence of a association between maternal milk PCB levels and IQ outcome. Show your working and state your conclusion.

Solution: Let $\beta_{1}$ be the true slop of the linear relationship between PCB in maternal milk and mean IQ of child. We want to test $H_{0}: \beta_{1}=0$ against $H_{1}: \beta_{1} \neq 0$. The test statistic is

$$

\begin{aligned}

t &=\frac{\text { estimate }-\text { hypothesised }}{\text { s.e. }(\text { estimate })} \

&=\frac{-0.023631-0}{0.007529}=-3.138664

\end{aligned}

$$

We compare the test statistic with the $t$-distribution with $14-2=12$ degrees of freedom. The p-value is $2 \times \min \left{\mathbb{P}\left(T_{12} \leqslant-3.138664\right), \mathbb{P}\left(T_{12} \geqslant-3.138664\right)\right}$.

\gg> tcdf $(-3.138664,12)$

ans $=$

$0.0043$

tcdf (-3.138664, 12, ‘upper’)

ans $=$

$0.9957$

$2 * \operatorname{tcdf}(-3.138664,12)$

ans $=$

$0.0086$

The $\mathrm{p}$-value is $0.0086$. This is strong evidence against the null hypothesis, susggesting an association between PCB levels in maternal milk and the IQ of the child.

(c) Construct a $95 \%$ confidence interval for the intercept of the regression line. Solution: The confidence interval has the form

$$

\text { extimate } \pm(\text { critical value }) \times \text { s.e. }(\text { estimate })

$$

where the critical value is the $0.975$ quantile of the $t_{12}$-distribution.

$\gg \operatorname{tinv}(0.975,12)$

ans $=$

$2.1788$

The confidence interval is

$$

127.937156 \pm 2.1788 \times 6.962961

$$

We are $95 \%$ confident the true value of the intercept is between $112.7663$ and 143.1081 (IQ points).

(d) Based on the coefficients for the least-squares line provided by MATLAB, estimate the mean IQ of children if their mothers had a maternal milk PCB measurement of $1400 \mathrm{ng} / \mathrm{g}$. If the child has an IQ 102 , what is the residual? Solution: The estimated mean IQ of a child whose mother had PCB level of $1400 \mathrm{ng} / \mathrm{g}$ is

$$

127.937156-0.023631 \times 1400=94.85376

$$

(units in IQ points). For a child with IQ 102, the residual is

$$

\begin{aligned}

\text { Residual } &=(\text { observed value })-(\text { estimate from regression model }) \

&=102-94.85376 \

&=7.14624

\end{aligned}

$$

(e) State the assumptions underlying linear regression.

Solution:

- Linear relationship between mean response and explanatory variable.
- Response variable has constant variance and normal distribution.
- Observations are independent

(f) The plot below plots the residuals of the linear regression against the theoretical quantiles of the standard normal distribution. Comment on the validity of the assumptions of the linear regression model.

Solution: We cannot use the plots to assess independence. There is no obvious trend in the residuals vs fitted values plot so the linear relationship between mean response and explanatory variable is ok. There is no obvious change in spread in this plot either so the constant variance assumption is ok. The plot of residuals against the quantiles of the normal distribution looks fairly straight which is consistent with the assumption of a normal distribution for the response.

# Probability Models & Data Analysis (STAT2203)

## Course level

Undergraduate

## Faculty

Engineering, Architecture & Information Technology

## School

Mathematics & Physics School

## Units

2

## Duration

One Semester

## Class contact

3 Lecture hours, 1 Practical or Laboratory hour

## Incompatible

ECON1310, ECON1320, ENVM2000, STAT1201,STAT1301, STAT2003, STAT2201, STAT2202

## Prerequisite

MATH1051 or MATH1071

## Assessment methods

Assignments, tutorial reports, exam

## Course coordinator

Dr Ross McVinish

## Study Abroad

This course is pre-approved for Study Abroad and Exchange students.