## Faculty of Science, Technology, Engineering and MathematicsM248 Analysing data

M248 - TMA 04

*Please read the Student guidance for preparing and submitting TMAs on the M248 website before beginning work on a TMA. You can submit a TMA either by post or electronically using the University’s online TMA/EMA
service.

Question 1, which covers topics in Unit 7, and Question 2, which covers topics in Unit 8, form M248 TMA 04. Question 1 is marked out of 23; Question 2 is marked out of 27.

Question 1 { 23 marks
You should be able to answer this question after working through Unit 7.
(a) Let X and Y be independent random variables both with the same
mean µ 6= 0. Define a new random variable W = aX + bY where a and
b are constants.
(i) Obtain an expression for E(W). [2]
(ii) What constraint is there on the values of a and b so that W is an
unbiased estimator of µ? Hence write all unbiased versions of W as
a formula involving a, X and Y only (and not b). [3]
(b) An otherwise fair six-sided die has been tampered with in an attempt to
cheat at a dice game. The effect is that the 1 and 6 faces have different
probability of occurring than the 2, 3, 4 and 5 faces.
Let θ be the probability of obtaining a 1 on this biased die. Then, the
outcomes of rolling the biased die have the following probability mass
function.
Table 1 The p.m.f. of outcomes of rolls of a biased die:

(i) By consideration of the p.m.f. in Table 1, explain why it is necessary for θ to be such that 0 < θ < 1=2. [2]
(ii) The value of θ is unknown. Data from which to estimate the value of θ were obtained by rolling the biased die 1000 times. The result of this experiment is shown in Table 2.

Table 2 Outcomes of 1000 independent rolls of a biased die

Show that the likelihood of θ based on these data is
L(θ) = C θ395 (1 − 2θ)605 where C is a positive constant, not dependent on θ. [5]
(iii) Show that L0(θ) = C θ394(1 − 2θ)604 (395 − 2000 θ): [4]

(iv) What is the value of the maximum likelihood estimate, θb, of θ based on these data? Justify your answer. What does the value of θb suggest about the value of θ for this biased die compared with the
value of θ associated with a fair, unbiased, die? [4]

c) Studies of the size and range of wild animal populations often involve tagging observed individual animals and recording how many times each is caught in a trap (from which it is then released back into the wild). The dataset presented in Table 3 consists of the numbers of times each of n = 334 wood mice were caught in a particular trap (over a two-year time period). The data are also provided in the Minitab file wood-mice.mtw.

Table 3 Numbers of trappings of wood mice

The geometric distribution with parameter p is a good model for these data.
(i) What is the maximum likelihood estimator of p for a geometric model? [1]
(ii) What is the maximum likelihood estimate of p for the data in Table 3? You are recommended to use Minitab to help you to answer this part of the question. [2]

Question 2 in Statistics
You should be able to answer this question after working through Unit 8.
(a) In this part of the question, you should calculate the required confidence interval by hand, using tables, and show your working. (You may use Minitab to check your answers, if you wish.)

Modern aircraft cockpit windscreens are complex items, comprising several layers of material and a heating system. Such windscreens are replaced upon damage to any of their components. A dataset was collected on the times to replacement of n = 84 windscreens of a particular modern airliner. The sample mean windscreen replacement time was 23 515 hours of flight. The sample standard deviation of windscreen replacement times was 5168 hours of flight.

(i) Obtain an approximate 90% confidence interval for the mean replacement time of this type of aircraft windscreen. What
property of the dataset justifies using this type of confidence interval, and why? [6]
(ii) Interpret the particular confidence interval that you found in part (a)(i) in terms of repeated experiments. [3]

(b) In this part of the question, you should calculate the required confidence interval by hand, using tables, and show your working. (You may use Minitab to check your answers, if you wish.)
In a large study of patients who were being treated for hypertension (high blood pressure), 148 out of 5493 patients receiving the conventional treatment for hypertension later suffered a stroke. Also,
192 out of 5492 patients receiving an alternative drug to treat their hypertension later suffered a stroke

(i) Obtain an approximate 95% confidence interval for the difference in proportions between the number of conventionally treated hypertension patients who later suffered a stroke and the number of hypertension patients treated with the alternative drug who later suffered a stroke. (You are advised to work with proportions rounded to four decimal places throughout; also, you may assume that the numbers involved are large enough that your approximation is a good one.) [5]

(ii) Some clinicians had suggested that the proportions of hypertension patients who suffered a stroke would not depend on which treatment they were being given. Are the data consistent with that

(c) In various places in this module, data on the silver content of coins minted in the reign of the twelfth-century Byzantine king Manuel I Comnenus have been considered. The full dataset is in the Minitab file coins.mtw. The dataset includes, amongst others, the values of the silver content of nine coins from the first coinage (variable Coin1) and seven from the fourth coinage (variable Coin4) which was produced a number of years later. (For the purposes of this question, you can ignore the variables Coin2 and Coin3.) In particular, in Activity 8 and Exercise 2 of Computer Book B, it was argued that the silver contents in both the first and the fourth coinages can be assumed to be normally distributed. The question of interest is whether there were differences in the silver content of coins minted early and late in Manuel’s reign. You are about to investigate this question using a two-sample t-interval.

(i) Using Minitab, find either the sample standard deviations of the two variables Coin1 and Coin4, or their sample variances. Hence, check for equality of variances using the rule of thumb given in
Subsection 4.4 of Unit 8. [3]

(ii) Whatever the outcome of part (c)(i), use Minitab to obtain a 90% two-sample t-interval for the difference E(X1) − E(X4) where X1 denotes the mean silver content in coins of the first coinage and X4
denotes the mean silver content in coins of the fourth coinage.

State that interval and comment briefly on what it tells us about the silver content of coins in the earlier and later coinages. [3]

(iii) Name the distribution used in constructing the confidence interval in part (c)(ii), state the value of its parameter and show why the parameter takes the value that it does. [2]
(iv) What would have been the outcome if you had obtained a 90% two-sample t-interval for E(X4) − E(X1) instead of for

E(X1) − E(X4)? Justify your conclusion in terms of the derivative of the parameter transformation involved. [3]

If you need someone to help you with this statistics assignment, then MyMathLab homework help is the right platform to address all your statistics needs.

## Data Analytics using Regression Model

Suppose that a resource allocation decision is being faced whereby one must decide how many computer servers a service facility should purchase to optimize the firm’s costs of running the facility. The more servers they have, the less workers are needed. Too many servers will result in over-capacity and waste resources. The firm’s predictive analytics effort has shown a growth trend. A new facility is called for if costs can be minimized. The firm has a history of setting up large and small service facilities and has collected the 40 data points. Let’s consider the following linear model, and estimate that using the data.

Linear Model

Where COST = the total cost to maintain a service facility.

``````         X  = the number of servers installed in each service facility
``````

Using the Excel data, copy and paste to MINITAB answer the following questions.

1) Estimate the model and copy and paste the results and explain the meanings of the estimated coefficients.
2) Find TSS, RSS, ESS and R square, and carefully explain their meanings.
3) Using t test, prove/disprove if the estimated coefficient b is significant
4) What are the elasticity of server on total cost if you have 20 servers, or 40 servers?

5) Let’s consider the following log linear model

Explain the coefficient of b and find the elasticity of number of server on the total cost.

6) Linear and Nonlinear Polynomial Models (1 point each)

a. Estimate the model and copy and paste the results and perform the F test for each model
b. Let’s compare the two models, the Linear vs. Nonlinear models. In terms of goodness-to-fit, which one fits better? Carefully explain.
c. According to each model, what are the total cost to maintain the facility if you want install 10, 20, 50 servers?
d. Choose the best model from the regression model in terms of goodness-to-fit, and find the number of servers to minimize the total cost of the service facility.

II. Single Family House Sales in Chicago
We obtain a house sales data from the local Multiple Listing Service (MLS) who provides the up-to-date real estate market listing prices. We obtain the following variables from the properties listed in Chicago in 2015.
BEDROOM : Number of Bedroom
BATHROOM : Number of Bathroom
SQFT : Square Feet of Living Area
GARAGE : Number of Cars in Garage
AGEBLD : Age of Building
FIREPLACE : Number of Fireplace
ZIP : Zip Code
PRICE : Listing Price

1. Find the descriptive statistics of Listing Price (PRICE) for two zip codes separately, and compare their central tendency, and variance using the following hypothesis tests: (1 points each)
2. Simple Regression Model (Estimate separate model for each zip code)
Let’s consider the following simple regression model:

1) Estimate the simple regression model, and copy and paste the results from Minitab Regression output from Minitab and explain the meaning of coefficients from each model. (1 point)

2) Using the simple regression output find the following statistics. (0.5 point each)
) Estimate the simple regression model, and copy and paste the results from Minitab Regression output from Minitab and explain the meaning of coefficients from each model. (1 point)

2) Using the simple regression output find the following statistics. (0.5 point each)

Statistics ZIP CODE 1 = ZIP CODE2 =
a. Estimated intercept
b. Estimated slope coefficient
c. Total Sum of Square (TSS)
d. Regression Sum of Square (RSS)
e. Error Sum of Square (ESS)
f. R2
h. Variance and standard error of b1
i. Correlation Coefficient between listing price (PRICE) and square feet (SQFT)
j. Variance of et

1. Nonlinear Model (Estimate separate model for each zip code, 1 point each)
Let’s consider the following log transformed model.

1) Estimate the model and copy and paste the results, and explain the meanings of the estimated slope coefficients from each regression model.

2) Compare and explain the elasticity of square feet to price between two zip codes, which is

1. Multiple Regression Model (Estimate separate model for each zip code, 1 points each)

1) Estimate the model using Minitab, and copy and paste the results.
2) Explain the meanings of the estimated coefficients.
3) Perform the t tests to find which variables are significant. List all significant variables at 5% and 10% significance levels.
4) Perform the F test for the each regression model, explain your verdict from the test.
5) Let’s compare the simple regression and the multiple regression models for each zip code. Carefully explain which is better.

6) Challenging model (3 points)
Now let’s find the best model to explain the listing price using the given variables. Any combination or any different functional forms are allowed. Find the best possible model. After deciding your final model, justify why your model is better than the other models.