Posts under category R statistics help

STA 9700: Homework 6

As with the previous assignment, this is a first draft of a section of your semester project.

Lecture Notes 6 The Matrix Approach to Regression

  1. Simple Linear Regression in Matrix Terms

In this exercise, you will use the matrix approach to do a simple linear regression of y on x for just one of your x-variables. Use a modest subset of rows, say, n=8, which is the sample size I used in the Excel example.

(a) Show the X matrix using your x-variable and the necessary column of 1's.
(b) Show the y-vector, composed of your observed y-values.

(c) In Excel, use matrix operations to compute the b-vector, Hat matrix, and the y-hat vector. Show your Excel work, using the posted Excel file as a guide.

(d) Using R or Proc Reg, run the regression of y on x and show the output. Check that the b-vector agrees with the values found using the matrix approach in Excel. If the values do not agree, remain calm. Just state, "There was a problem," and go on. You can fix it, later.

(e) On page 7 of STA 9700 Lecture Notes 6, it is shown that for simple regression we can compute the entries of the hat matrix directly from the data. Compute the value of h2,3 by the method shown, and check your result matches the value in the Excel hat matrix in the second row, third column. If the result does not match, don't drive yourself crazy: double-check your work once, and then just say, "There was a problem" and go on.

  1. Multiple Linear Regression in Matrix Terms

    In this exercise, add a second regressor variable to the data set used above.

(a) Show the new X matrix.

(b) Use R or SAS Proc IML to fit the model, following the example given in the Lecture Notes (which is also contained in the posted SAS example file). Show the output of that program.

(c) Using Proc Reg in SAS, check the values in the Proc IML b-vector. If the output does not match, just say so. We will get it corrected, later!

(d) Show the Proc IML program you used.

Our Statistics online help experts did this assignment initially for another student, and the following attached solution was provided. If you need help the same or a similar project then do not hesitate to contact us.
MyMathLab answers

Excel data file

Disaggregate Choice Model project
You have disaggregate data of choice of brands (denoted by 1, 2 and 3) in the cereal category by 50 consumers. Let’s say brands 1, 2 and 3 are Kellogg’s, Post and General Mills respectively.

The details of the dataset are as follows (a snap-shot is provided below):

customer_id price brand Brand_number int1 int2 int3
1 1.6481 0 1 1 0 0
2 1.5123 0 1 1 0 0
3 1.9469 0 1 1 0 0
4 1.8847 0 1 1 0 0
5 1.2578 0 1 1 0 0
6 1.1513 1 1 1 0 0
7 1.0651 1 1 1 0 0
8 0.8359 1 1 1 0 0
9 1.1679 1 1 1 0 0
10 2.3237 0 1 1 0 0
11 1.3236 0 1 1 0 0
12 2.0052 0 1 1 0 0
13 1.8917 0 1 1 0 0

Customer_id: This denotes/indexes the customer who is making the purchase.
Price: This indicates the price per pound of the brand.
Brand: If the particular brand is chosen by the customer, then this takes the value of 1; if the brand is not chosen then this takes the value of 0.
Brand_number: This number tells you which brand it is: brands 1, 2 and 3 are Kellogg’s, Post and General Mills respectively.
Int1, int2 and int3: these are the intercept terms that are used.

The objective of this assignment is to model the consumer’s choice of cereal brands using brands’ pricing information. Estimate the model and interpret the results of the model.

You should use proc mdc to do this. See the class notes for the appropriate code.

Scenario: Melbourne Property Prices

An important issue to many young Australians is housing affordability with recent
booms in property prices in Australia’s largest cities making it more difficult for
first-time homebuyers to enter the market. Sellers and buyers alike are interested in
what drives property prices, and buyers are generally interested in knowing where
bargains can be found.

We will consider data on more than 23,000 properties listed on in the Melbourne metropolitan area between January 2016 and September 2017.

These data include key geographic information (e.g., suburb, address, distance from Melbourne CBD), property information (e.g., property type, number of rooms, land
and building area), and selling information (e.g., agent, sale method, price).

For this project, there are three primary questions to be investigated:

(a) Does property price increase the closer you get to the Melbourne CBD, and does the relationship between distance from the Melbourne CBD and property price change depending on the property type?

(b)What factors are most relevant in helping to predict property prices, and which
general region (REGION NAME) appears to be the best bargain for houses (i.e.
excluding other property types) based on what you would predict house prices
to be for that region?

(c)Are there certain (non-geographic) attributes of properties that characterize a
general region (REGION NAME)? In other words, is (non-geographic) information
on the property sufficient to allow a buyer to understand where a property is
likely to be located?

Methods and Analysis:
Provides a description of the data and the statistical analyses that will be performed. If a linear regression, principal component analysis, or linear discriminant analysis is to be carried out, this section should provide an explanation of and motivation for the variables that are included in the model. This section should also include descriptive statistics (statistics, tables, graphs) that are useful in describing the data and providing a glimpse of what you might expect from your statistical analyses. A good deal of thought should go into your descriptive statistics, as these must clearly show some relevance to your questions of interest, and you must explain what you can derive from these.

Provides a thorough description of the results of the analyses you described in the previous section. Include tables with relevant output. If analyses are carried out that involve the estimation of parameters, this should include an interpretation of the parameters for the variables of interest. Any issues with significant violations of the requirements/assumptions needed to perform the analyses carried out must be addressed.

R code and summary output should not be pasted into the document, but instead relevant results should be presented in nicely formatted tables.



Technographics Dataset

Forrester North American Technographics Benchmark survey is conducted by Forrester Research Co. every year to track trends in usage and purchase of a variety of high-tech products (like computers) and services from tens of thousands of US and Canadian consumers. The survey is the largest survey in the world that is available to study consumer attitude and use of technology, and is commonly used by consumer and technology marketing companies for “product planning and go-to-market strategy assessments” (Business Wire 2008).

Download the SAS dataset named techno. The dataset (which is a sample of the original survey) has information on brand of PC bought by a panel of consumers, how they bought the PC and characteristics of the PC they bought. More specifically, the dataset has the following columns:
1) serial: consumer id
2) techimpress: Response to a Likert scale question “I like to impress people with my lifestyle. To what extent does this statement describe your attitude” with 1 being “Does not describe your attitudes at all” and 10 being “Describes your attitude completely”. Thus, a higher score here mean that the person wants to impress people with lifestyle.
3) techimp: Response to a Likert scale question “Technology is important to me. To what extent does this statement describe your attitude” with 1 being “Does not describe your attitudes at all” and 10 being “Describes your attitude completely”. Thus, a higher score here mean that technology is important to the person.

4) competitive: Response to a Likert scale question “I am very competitive when it comes to my career. To what extent does this statement describe your attitude” with 1 being “Does not describe your attitudes at all” and 10 being “Describes your attitude completely”. Thus higher score means more competitive.

5) fun: Response to a Likert scale question “Having fun is the whole point of life. To what extent does this statement describe your attitude” with 1 being “Does not describe your attitudes at all” and 10 being “Describes your attitude completely”. Thus higher score means more fun loving.
6) liketech: Response to a Likert scale question “I like technology. To what extent does this statement describe your attitude” with 1 being “Does not describe your attitudes at all” and 10 being “Describes your attitude completely”
7) age: respondent’s age
8) gender: =1 if the respondent is a female and = 2 if the respondent is a male
9) income: respondent’s income in $.
10) dma: id for geographic location.
11) price: Price the consumer paid for the PC (in $)
12) brand: The brand of the PC bought (Apple, Dell etc.).
13) Instorepurchase= “Yes”, if the consumer bought the PC by physically visiting the store and “No”, if the consumer bought the PC remotely via online
14) graphicscard=”Yes”, if the PC has a graphics card and “No”, otherwise
15) broadband= “Yes”, if the PC has a broadband adapter and “No”, otherwise
16) cdrom= “Yes”, if the PC has a CDROM Drive and “No”, otherwise
17) externalstorage= “Yes”, if the PC has external storage memory and “No”, otherwise
18) cdwrite= “Yes”, if the PC has a CD Writer and “No”, otherwise
19) newbie= “Yes” if the consumer is buying a PC for the first time (ever) and “No” otherwise. So Yes means the consumer is a newbie.

  1. Calculate the shares of the two types of marketing channels (instore vs. online). Calculate the average prices of computers across the two channels.
  2. Summarize how many newbies buy their PC from physical/brick-and-mortar store and how many from online. Comment on the results.
  3. Summarize the shares of the various brands at the aggregate level (across all regions) and at the individual regional level.
  4. Is there a correlation between price the consumer paid and their income?
  5. Which is the costliest brand?
  6. Is there a correlation between the variables of techimpress techimp and price paid by consumers? What do you conclude?

BONUS POINTS: Formulate and estimate a regression model to understand how the different channels, and the different PC Characteristics affect the price of the PCs. Write the model formulation, estimate the model, and then report/analyze the results of the model.

We have solutions for this group project, You can let us know if you'd like to have access to it to compare with your results.

STA 9700: Homework 2

Reading Assignment

                      Read STA 9700 Lecture Notes 2;  Read again, write questions in margins.
                        (There is some related material in Kutner, pg. 2-27.)
                 STA 9708 LN 5 (Expectation and variance of random variables)

Questions based on STA 9700 Lecture Notes 2
2.1 Looking at Fig. 2.1 in Lecture Notes 2, we see that there is a general rise in the NetWt of the bags as the Count increases. While the phrase "general rise" is not clearly defined, it is certainly better than the following commonplace description, "Bags with more M&M's are heavier." That statement is far too simplistic!

(a) The data for Fig. 2.1 is shown on pages 15-17 of Lecture Notes 2. Using the data, give several examples of pairs of bags for which the statement "Bags with more M&M's are heavier" is false.

(b) Having shown that not all bags with more M&M's are heavier than all bags with fewer M&M's, consider this next vague description, "The average bag containing 18 M&M's weighs more than the average bag containing 17 M&M's." What is vague about that statement? Hint: which bag is the average bag? What is the definition of the average bag? (That is as hard as defining or locating the average American, which should be easy because we hear about that dude everyday on the news.)

(c) Critique this statement: “Since on page 12 the sample slope is 1.276 when regressing net weight on count for the 192 bags, then the sample average for bags with Count=18 must be higher than for Count=17.” And, find a counterexample in the data set, itself!

(d) What statement are we struggling to make here about the relationship between the sub-populations of Net Weights and their Count?

2.2 Putting together the BigMM SAS program and the following Proc Reg routine, we can create a SAS program that computes the sample slope, the sample intercept, and the root mean square error for each of the 8 groups of bags of M&M's (there are 24 bags per group), outputs those statistic to a SAS file, and prints the file.

               proc reg outest=LTatum;                 
               model NetWt=Count;
               By Group;
               proc print data=LTatum; run;

The Proc Reg option "outest=LTatum" instructs SAS to save the regression statistics (or "estimates") into a SAS file named "Ltatum." The output is shown below due to difficulties with SAS, but I would be delighted if you are able to produce it yourself! The sample slopes are in the Count column.

Obs    Group    _MODEL_    _TYPE_    _DEPVAR_     _RMSE_    Intercept     Count      Wt

 1        2     MODEL1     PARMS      NetWt      1.52202     25.2154     1.28176     -1
 2        3     MODEL1     PARMS      NetWt      0.94023     27.2769     1.16531     -1
 3        5     MODEL1     PARMS      NetWt      0.96081     17.6571     1.65238     -1
 4        6     MODEL1     PARMS      NetWt      1.01435     19.1121     1.59741     -1
 5        7     MODEL1     PARMS      NetWt      1.53226     26.1459     1.22875     -1
 6        8     MODEL1     PARMS      NetWt      1.09972     28.7744     1.11778     -1
 7        9     MODEL1     PARMS      NetWt      0.99709     22.1760     1.42708     -1
 8       10     MODEL1     PARMS      NetWt      1.10568     26.5912     1.18456     -1

(a) You now have 8 different sample slopes, or 8 different values for . These can be viewed as 8 values drawn from what population? (Hint: You need The Story of Many Possible Samples.)
(b) Imagine that for our production run of 10,000 bags of Peanut M&M's that we regressed the 10,000 net weights on their respective 10,000 counts. What would we call the resulting intercept and slope? Show the answer in words and Greek letters.
(c) Using The Story of Many Possible Samples, explain what it would mean to say that is an unbiased estimator.

2.3 Refer to the SAS output on page 12, for the regression using all 192 bags.
(a) Compute the value for count=18.
(b) What is estimated by b1?
(c) How is the value related to ?

Expected Value and Variance Review Questions

2.4 For a roll of a fair die with 4 sides, numbered 1 to 4, find the expected value and the variance.
2.5 Find the probability distribtuion for the average of two rolls of a fair die with four sides. Then, compute expected value and variance of the average from the distribution.
2.6 How were the answers to question 2.5 related to those of question 2.4?

2.7 Generic Calculus Questions; warming up to least squares: Find the derivative with respect to x of the following functions:

(a) y = x2 
       (b) y = (4x + 3)2 
(c) y = (-3x2 + x)

2.8 The R function


will regress y on x, and the function


produces output similar to the SAS regression output. For BigMM, see if you can get output with similar values as those given by SAS on page 16. Locate the estimate of the variance of epsilon.