STOR 455 Introduction to logistic regression

“Why shouldn’t I just use ordinary least squares?” Good question.
Consider the linear probability (LP) model:
$$Y=a+B X+e$$
where

• $\mathrm{Y}$ is a dummy dependent variable, $=1$ if event happens, $=0$ if event doesn’t happen,
• a is the coefficient on the constant term,
• $\boldsymbol{B}$ is the coefficient(s) on the independent variable(s),
• $X$ is the independent variable(s), and
• e is the error term. run into trouble. There are 3 problems with using the LP model:
1. The error terms are heteroskedastic (heteroskedasticity occurs when the variance of the dependent variable is different with different values of the independent variables): $\operatorname{var}(\mathrm{e})=p(1-p)$, where $p$ is the probability that $E V E N T=1$. Since $P$ depends on $X$ the “classical regression assumption” that the error term does not depend on the Xs is violated.
2. e is not normally distributed because $P$ takes on only two values, violating another “classical regression assumption”
3 . The predicted probabilities can be greater than 1 or less than $0 \mathrm{~ w h i c h ~ c a n ~ b e ~ a ~ p r o b l e m ~ i f ~ t h e ~ p r e d i c t e d ~ v a l u e s ~ a r e ~ u s e d ~ i n ~ a ~ s u b s e q u e n t ~ a n a l y s i s . ~ S o m e ~ p e o p l e ~ t r y ~ t o ~ s o l v e r e m e}$ by setting probabilities that are greater than (less than) 1 (0) to be equal to 1 ( 0 ). This amounts to an interpretation that a high probability of the Event (Nonevent) occuring is considered a sure thing.
The logistic regression model
The “logit” model solves these problems:
\begin{aligned} &\ln [p /(1-p)]=a+B X+e \text { or } \ &{[p /(1-p)]=\exp (a+B X+e)} \end{aligned}
where:
• In is the natural logarithm, $\log _{\exp }$, where exp $=2.71828 \ldots$
• $p$ is the probability that the event $Y$ occurs, $p(Y=1)$
• p/(1-p) is the “odds ratio”
• In[p/(1-p)] is the log odds ratio, or “logit”
• all other components of the model are the same.

The common brushtail possum of the Australia region is a
bit cuter than its distant cousin, the American opossum (see Figure 7.5 on page 334). We consider
104 brushtail possums from two regions in Australia, where the possums may be considered a
random sample from the population. The first region is Victoria, which is in the eastern half of
Australia and traverses the southern coast. The second region consists of New South Wales and
Queensland, which make up eastern and northeastern Australia.
We use logistic regression to di↵erentiate between possums in these two regions. The outcome
variable, called population, takes value 1 when a possum is from Victoria and 0 when it is from
New South Wales or Queensland. We consider five predictors: sex male (an indicator for a
possum being male), head length, skull width, total length, and tail length. Each variable
is summarized in a histogram. The full logistic regression model and a reduced model after variable
selection are summarized in the table. Frequency
sex_male
0
(Female)
1
(Male)
0
20
40
60
Frequency
85 90 95 100
0
5
10
15
skull_width (in mm)
Frequency
50 55 60 65
0
5
10
15
total_length (in cm)
Frequency
75 80 85 90 95
0
5
10
tail_length (in cm)
Frequency
32 34 36 38 40 42
0
5
10
15
20
Frequency
0
(Not Victoria)
1
(Victoria)
population
0
20
40
60
Full Model Reduced Model
Estimate SE Z Pr(>|Z|) Estimate SE Z Pr(>|Z|)
(Intercept) 39.2349 11.5368 3.40 0.0007 33.5095 9.9053 3.38 0.0007
sex male -1.2376 0.6662 -1.86 0.0632 -1.4207 0.6457 -2.20 0.0278
head length -0.1601 0.1386 -1.16 0.2480
skull width -0.2012 0.1327 -1.52 0.1294 -0.2787 0.1226 -2.27 0.0231
total length 0.6488 0.1531 4.24 0.0000 0.5687 0.1322 4.30 0.0000
tail length -1.8708 0.3741 -5.00 0.0000 -1.8057 0.3599 -5.02 0.0000
(a) Examine each of the predictors. Are there any outliers that are likely to have a very large
influence on the logistic regression model?
(b) The summary table for the full model indicates that at least one variable should be eliminated
when using the p-value approach for variable selection: head length. The second component
of the table summarizes the reduced model following variable selection. Explain why the
remaining estimates change between the two models.

real analysis代写analysis 2, analysis 3请认准UprivateTA™. UprivateTA™为您的留学生涯保驾护航。

概率论代考

Materials and Websites for the Class:

Textbook: Graybill and Iyer, REGRESSION ANALYSIS: Concepts and Applications. Available for
free at http://www.stat.colostate.edu/%7Ehari/regression_book/index.html
There is be a tab for the textbook on Sakai, and other materials will also be provided during the course.
Gradescope: All homework will be handed in on Gradescope, which you can access through the
Piazza: Piazza is a forum where students can ask questions of me and each other and get responses in a
timely fashion. I have not previously used it myself, but several colleagues (including Dr. Cunningham,
who is teaching the parallel STOR 455.2) have used it and highly recommend it. However, Piazza is
moving to a for-payment model with a more limited free option. I need to find out what other people are
doing before making a firm decision myself. I will get back to you once I have a recommendation how to
proceed.
Programming Requirement: Throughout the course, we will be taking advantage of the R
programming language. Before the course, you should download R, R-studio and R-markdown, all of
which are free. If needed, I will provide further references for use of R.
Prerequisites: STOR 155 or equivalent. Some familiarity with matrix algebra is recommended, but
not required.
Final Grade: 30% HW, including Case Studies and Projects; 25% midterm 1, 25% midterm 2; 20%
final exam at noon on Saturday, May 8.
Pass-Fail Option: Similar to Fall 2020. Students have the option of switching to P/F grading if you
file the registrar’s office form by Wednesday, May 5.
HW Assignments:
Homework due dates will be made clear throughout the course. In general, there will always be some
problems to be working on although the due dates will vary. Please watch for class announcements for
more details of the homework schedule. Every assignment will be posted on Gradescope at least one
week before the assignment is due.
• On homework, it is ok to work with others, but the work you turn in should be yours alone. Do
not copy-paste code from other students, this is easily detected and defeats the purpose of the homework.