STA 9700: Homework 6
As with the previous assignment, this is a first draft of a section of your semester project.
Lecture Notes 6 The Matrix Approach to Regression
- Simple Linear Regression in Matrix Terms
In this exercise, you will use the matrix approach to do a simple linear regression of y on x for just one of your x-variables. Use a modest subset of rows, say, n=8, which is the sample size I used in the Excel example.
(a) Show the X matrix using your x-variable and the necessary column of 1's.
(b) Show the y-vector, composed of your observed y-values.
(c) In Excel, use matrix operations to compute the b-vector, Hat matrix, and the y-hat vector. Show your Excel work, using the posted Excel file as a guide.
(d) Using R or Proc Reg, run the regression of y on x and show the output. Check that the b-vector agrees with the values found using the matrix approach in Excel. If the values do not agree, remain calm. Just state, "There was a problem," and go on. You can fix it, later.
(e) On page 7 of STA 9700 Lecture Notes 6, it is shown that for simple regression we can compute the entries of the hat matrix directly from the data. Compute the value of h2,3 by the method shown, and check your result matches the value in the Excel hat matrix in the second row, third column. If the result does not match, don't drive yourself crazy: double-check your work once, and then just say, "There was a problem" and go on.
Multiple Linear Regression in Matrix Terms
In this exercise, add a second regressor variable to the data set used above.
(a) Show the new X matrix.
(b) Use R or SAS Proc IML to fit the model, following the example given in the Lecture Notes (which is also contained in the posted SAS example file). Show the output of that program.
(c) Using Proc Reg in SAS, check the values in the Proc IML b-vector. If the output does not match, just say so. We will get it corrected, later!
(d) Show the Proc IML program you used.
Our Statistics online help experts did this assignment initially for another student, and the following attached solution was provided. If you need help the same or a similar project then do not hesitate to contact us.
Download the excel dataset name online_buying. The dataset consists of purchases made by consumers across two channels from a company/firm: online and the physical store (multi-channelpurchases). More specifically, the dataset has following columns:
1) Customer: The Id of the customer.
2) Online_buy: This takes the value of 1 if the consumer buys online and 0 if he/she buys in
the physical store.
3) distance: The distance in miles from the customers home/residence to the physical store.
4) experience: The number of months that the customer has been shopping with the firm.
Thus, this can be interpreted as the experience the customer has with the firm.
5) c_rank: This is the convenience rank. Customers were asked about how convenient it
would be for them to shop online versus the physical store on a scale from 1(least
convenient) to 4 (most convenient).
- Formulate and estimate a (logistic) model to understand how the factors can affect the likelihood of customers shopping online. Comment on the model parameter estimates. You can calculate the odd ratios etc. What are the managerial implications of your findings?
- For each customer calculate the probability of him/her shopping online. You should do this based on the model estimates and the data and using the formula for p. You can do this either in excel or in SAS.
Download the excel dataset brand_choice**. You have disaggregate data of choice of brands (denoted by 1, 2 and 3) in the beverage category by 50 consumers. Let’s say brands 1, 2 and 3 are Fuze, Honestea and Suja respectively. The data set consists of the following:
1) Customer_id: This denotes/indexes the customer who is making the purchase.
2) Price: This indicates the price per 12 oz. of the brand.
3) Brand: If the particular brand is chosen by the customer, then this takes the value of 1; if the brand is not chosen then this takes the value of 0.
4) Product: This number tells you which brand it is: brands 1, 2 and 3 are Fuze, Honestea
and Suja respectively.
The objective is to model the consumer’s choice of beverage brands using
brands’ pricing information. Estimate the model and interpret the results of the model.
Note: You should use proc mdc to do this.
You can modify the code given below and use it. You should first generate the intercept terms in the data for each of the products.
You can do this as follows.
In cases when product=1 then you can have int1=1 and when product=2 and product=3 then int1=0
· When product=2 then you can have int2=1 and when product=1 and product=3 then
· When product=3 then you can have int3=1 and when product=1 and product=2 then
So int1=1 for product 1 only and 0 otherwise; int2=1 for product 2 only and 0 otherwise; int3=1 for product 3 only and 0 otherwise. You can do the above in excel or SAS.
proc sort data=brand_choice; by customer_id; run; proc mdc data=brand_choice; model brand = in1 in2 price / type=clogit nchoice=3; id customer_id; run;
Please read the article attached (“Retailers’ Emails Are Misfires for Many Holiday Shoppers”)
and answer the following questions.
1) In your own words, what does it mean to 'personalize' digital messaging? What tools,
technologies and methods are necessary to personalize digital messaging?
2) The article says “the ugly truth is that most retailers haven’t done the unsexy work of
understanding how to use the data”. How can retailers use the data available to them to
better design personalized emails? What should they do? How can customers be targeted with the right product?
Disaggregate Choice Model project
You have disaggregate data of choice of brands (denoted by 1, 2 and 3) in the cereal category by 50 consumers. Let’s say brands 1, 2 and 3 are Kellogg’s, Post and General Mills respectively.
The details of the dataset are as follows (a snap-shot is provided below):
customer_id price brand Brand_number int1 int2 int3
1 1.6481 0 1 1 0 0
2 1.5123 0 1 1 0 0
3 1.9469 0 1 1 0 0
4 1.8847 0 1 1 0 0
5 1.2578 0 1 1 0 0
6 1.1513 1 1 1 0 0
7 1.0651 1 1 1 0 0
8 0.8359 1 1 1 0 0
9 1.1679 1 1 1 0 0
10 2.3237 0 1 1 0 0
11 1.3236 0 1 1 0 0
12 2.0052 0 1 1 0 0
13 1.8917 0 1 1 0 0
Customer_id: This denotes/indexes the customer who is making the purchase.
Price: This indicates the price per pound of the brand.
Brand: If the particular brand is chosen by the customer, then this takes the value of 1; if the brand is not chosen then this takes the value of 0.
Brand_number: This number tells you which brand it is: brands 1, 2 and 3 are Kellogg’s, Post and General Mills respectively.
Int1, int2 and int3: these are the intercept terms that are used.
The objective of this assignment is to model the consumer’s choice of cereal brands using brands’ pricing information. Estimate the model and interpret the results of the model.
You should use proc mdc to do this. See the class notes for the appropriate code.
Scenario: Melbourne Property Prices
An important issue to many young Australians is housing affordability with recent
booms in property prices in Australia’s largest cities making it more difficult for
first-time homebuyers to enter the market. Sellers and buyers alike are interested in
what drives property prices, and buyers are generally interested in knowing where
bargains can be found.
We will consider data on more than 23,000 properties listed on
domain.com.au in the Melbourne metropolitan area between January 2016 and September 2017.
These data include key geographic information (e.g., suburb, address, distance from Melbourne CBD), property information (e.g., property type, number of rooms, land
and building area), and selling information (e.g., agent, sale method, price).
For this project, there are three primary questions to be investigated:
(a) Does property price increase the closer you get to the Melbourne CBD, and does the relationship between distance from the Melbourne CBD and property price change depending on the property type?
(b)What factors are most relevant in helping to predict property prices, and which
general region (REGION NAME) appears to be the best bargain for houses (i.e.
excluding other property types) based on what you would predict house prices
to be for that region?
(c)Are there certain (non-geographic) attributes of properties that characterize a
general region (REGION NAME)? In other words, is (non-geographic) information
on the property sufficient to allow a buyer to understand where a property is
likely to be located?
Methods and Analysis:
Provides a description of the data and the statistical analyses that will be performed. If a linear regression, principal component analysis, or linear discriminant analysis is to be carried out, this section should provide an explanation of and motivation for the variables that are included in the model. This section should also include descriptive statistics (statistics, tables, graphs) that are useful in describing the data and providing a glimpse of what you might expect from your statistical analyses. A good deal of thought should go into your descriptive statistics, as these must clearly show some relevance to your questions of interest, and you must explain what you can derive from these.
Provides a thorough description of the results of the analyses you described in the previous section. Include tables with relevant output. If analyses are carried out that involve the estimation of parameters, this should include an interpretation of the parameters for the variables of interest. Any issues with significant violations of the requirements/assumptions needed to perform the analyses carried out must be addressed.
R code and summary output should not be pasted into the document, but instead relevant results should be presented in nicely formatted tables.