In this problem, you will use different models to classify compounds that are
designated as cither crossing the blood brain barrier or not crossing. The data
set “bbb.train” consists of 280 compounds and descriptors (like MOE descrip-
tors or LCALC descriptors, total of 32 descriptors; first column is the label).
The data set “labels.train” corresponds to whether each compound can cross
the blood brain barrier or not $\left(” \mathrm{c}^{n}\right.$ for crossing and “ne” for not erossing). Sim-
ilarly, test data can be found as bbb.test and labels.test.esv.

First, we are going to use logistic regression. We observe $n$ independent obser-
vations $\left(y_{i}, x_{1 i}, \ldots, x_{p}\right)$ for $i=1, \ldots, n$, where $y_{i} \in\{n c, c\} .$ The data comes from
the following logistic model:
$$y_{i} \sim \text { Bernoulli }\left(\pi_{i}\right) \text { with } P\left(y_{i}=n c\right)=\pi_{i}=1-P\left(y_{i}=c\right)$$
The parameter $\pi_{i}$ satisfies:
$$\operatorname{logit}\left(\pi_{i}\right)=\log \left(\frac{\pi_{i}}{1-\pi_{i}}\right)=\beta_{0}+\beta^{T} x$$
or equivilently:
$$\pi_{i}=\frac{e^{\beta_{a}+\beta^{T} x}}{1+e^{f_{0}+\beta^{T_{x}}}}$$
The estimates of parameters $\beta$ in the logistic regression model are obtained by
maximum likelihood. In this ease, the likelihood function is:
$$L(\beta)=\prod_{i=1}^{n}\left(1-\pi_{i}\right)^{1-y}\left(\pi_{i}\right)^{y_{-}}$$

parameters $\beta$
b) Run a logistic regression to clnssify the coenpounds from the blood-brain
barrier ou the test set. Present the coefficients obtained and the confusion
matrix for the test set.
e) Use Ridge logistic regression to clnseify the compounds from the blood-brain
barrier on the test set. First, write the loes function that you will minimize.
Second, explain why we use Ridge regressbon. Third, run the model. Present
the optimal tuning parameter, obtained using eroes validation, the coefficinnts
for this parameter and the confusion matrix for the test sot.
d) Use laseo logistic regression to claseify the ootupounds from the blood brain
barríer on the test sot. Finst, write the loes function that you will minimize.
Soeond, explain why we use lasso regression. Third, run the model. Prescnt the
optimal tuming parameter, obtained using cross-validation, the coelticients for
this parameter and the confusion matrix for the test sct.

##### 概率优化

Theory 太多 …Practice题目有点hold 不住？