R Statistics Help

Question 13.

metro <- read.csv("MetroMedian.csv", header = T)
reads the file from the folder and store to the dataframe “metro” with header=T so that it can retain variable names from the file.

The package “Reshape” is installed so that its function ”melt()” can be used to create molten data from the matrix read. The “data.table” package is loaded so it can transform the dataframe for easier manipulation. Load the data.table after the “reshape” package.

tidyMetro <- melt(metro,id.vars=c("RegionName","State","SizeRank"),variable.name="date",na.rm=TRUE)

use the melt function from the data.table library, to convert the dataframe from Wide to Long. The function has the “object” to be converted, the factors and the variable of interest as the inputs, the na.rm=TRUE, drops the empty cells.

use the r function mean(), That’s is select value from the tidyMetro dataframe, where state==”NY”
regionMean <- function(valueFrame,searchRegion) {


The function above is stored in the variable region. The inputs of the variable are;- the object and the searchRegion. Inside the function, we use the r-function mean(), which selects values from the variable of interest where the state name is same as the searchRegion entered to the function.
Question 14

beaches <- read.csv("BeachWaterQuality.csv", header = T)
because the data is stored in an excel .csv format, use the function read.csv(), to read the excel file using the columns names as the variable names.
beaches$Results[is.na(beaches$Results)] <- 0
select the variable Results from the beaches dataframe and check if it is NA, assign 0 to the empty value.
check the format of the dataframe.
new.Date <- strptime(as.character(beaches$Sample.Date),"%m/%d/%Y")\
create a variable new.data which is in r-local format, using the r–function strptime(), as.character converts the date variable to be of character type so that its format is understood. The “m/%d/%Y”, tells are the date format from the file is month/date/ and Year written in four digits. The month and date does not contain the leading 0.

beaches$new.date <- new.Date
Add the newly created variable;- new.Date to the beaches dataframe and assign the name new.Date.
beachPlot <- function(beachData,beachName,sampleLocation){
beaches2 <- subset(beachData, Beach.Name==beachName & Sample.Location==sampleLocation)
plot(beaches2$new.date,beaches2$Results, ylab = "Bacterial Count", xlab = "Date",main=c('Bacterial count for', beachName,sampleLocation))
lines(beaches2$new.date[order(beaches2$new.date)], beaches2$Results[order(beaches2$new.date)],
xlim=range(beaches2$new.date), ylim=range(beaches2$Results),col="red")

Create a function and assign the name beachplot. The inputs to the function are;- the dataframe, beachName and sampleLocation. Use the function inputs to subset the dataframe and store the subset to the beache2 dataframe. The subset dataframe is selected from the input dataframe. The rows that have beachName and sample location are selected. Use the plot function to create a plot by entering the x-axis variable, y-axis variable, the y-axis label the x-axis label and main title label which is entered as a vector so that it can get the function input factors.
Add lines to the plot for Results against Date. The line uses the data range and the plot has a red color.

Question 15.

mileage <- read.csv("Insight (3).csv", header = T)
The data set in the directory is stored in excel .csv format with the name Insight. So use read.csv() function to read the data file and store the variables in a dataframe called mileage.
Check the structure of the dataframe using the head(), function.

plot(MPG~Avg.Temp, data = mileage, ylab="",xlab="Average Temperature",main="MPG against Avg.Temp and Car Said",col="blue")

plot function to create the plot. The tilde sign means y~x. And get the values for x and y axis from the mileage dataframe. Leave the y-axis empty because another line will be added after. Label the x-axis because both variables are being plotted against the same x-varibale. Add a title using main=”” and set the colour for this plot to blue.

abline(lm(MPG~Avg.Temp,data = mileage),col="blue")
Add a trend line to the plot. The line to be added is the line of best fit from a linear model formed using the dependent and the response variable, Get the variables from the mileage dataframe. Set the color of the trendline to blue using col=”blue” command.

D ~

par(new = TRUE)
par() is an r-function used to combine plots. So setting new=T, allows a new plot to be embedded in an existing plot.
plot(Car.Said~Avg.Temp, data=mileage,col="red",ylab="MPG/Car Said",xlab="",axes=FALSE)
use the plot() function to add a new line to the existing plot. Set the color to red and add the y-axis label. Axes=FALSE suppresses the axis values.

e Adding a Red Line, that Fits the Linear Model

abline(lm(Car.Said~Avg.Temp,data = mileage),col="red")
create a linear model and Add a trendline for the 2nd plot. Set the colour to red.

legend("topleft",legend=c("Measured MPG","Car Reported MPG"),


Add a legend to the plot. Place the legend to the top left of the plot. The labels of the legend should be “Measured MPG” and “Reported MPG”, the colors of the text are red and blue, pch-sets the width of the line and color them with “blue” and “red respectively.

Guidelines for the Empirical Analysis

Analysis in R or STATA

  1. Plot the cross-sectional average of deposits/assets and Non-Deposit Debt/assets across time.
    You can calculate Non-Deposit Debt = Assets – Deposits – Equity. How have the averages
    evolved? How would you interpret your results?
  2. Run OLS regressions of quarterly loan growth on non-deposit debt/assets (one quarter lagged value) controlling for bank size (i.e. one quarter lagged natural logarithm of total asset) and
    profitability (i.e. one quarter lagged return on assets) for the sub-sample of your data during the
    financial crisis (i.e. 2008Q1 – 2010Q1). You should have Bank and Time fixed effects in your
    regression. What is the sign and magnitude of the co-efficient on non-deposit debt/assets? Is the
    coefficient significant? How will you interpret the co-efficient? Justify your findings.
  3. Compute two measures / (ex-post) proxies for bank risk
    a. Risk weighted asset divided by total assets
    b. Non-performing loans divided by total loans
  4. Plot the cross-sectional average of the above two measures across time. How have the averages
    evolved across years? How would you interpret your results?
  5. Run OLS regressions of the two ex-post measures of bank risk on equity over assets (one
    quarter lagged values). Control for bank size (i.e. one quarter lagged natural logarithm of total
    asset) and profitability (i.e. one quarter lagged return on assets) on the entire sample. You
    should have Bank and Time fixed effects in your regression. What is the sign and magnitude of
    the co-efficient on equity/assets? Is the coefficient significant? How will you interpret the coefficient? Justify your findings.
    Note: Make sure you winsorize all your variables (per quarter at the 1st and 99th percentile) to
    remove outliers.

Solve using Excel & Minitab (Do not use formula)

The following questions are from probability and statistics questions. The questions were previously solved by our statistic experts using MINITAB; in case you are a student looking for help with similar questions, then you can contact us so that we may provide similar services under our do MyMathLab homework so that we can provide you similar solutions, or solutions with similar questions, either using Excel data analysis tools , or by using the latest Minitab application software. The solutions to each question are attached for you confirmation.

  1. A die is tossed 3 times. What is the probability of
    (a) No fives turning up?
    (b) 1 five?
    (c) 3 fives?
    Probability Solution for Question 1
    Outut for Question 1:
    Probability Density Function

Binomial with n = 3 and p = 0.17

x P( X = x )
0 0.571787
Probability Density Function
Binomial with n = 3 and p = 0.17
x P( X = x )
1 0.351339
Probability Density Function
Binomial with n = 3 and p = 0.17
x P( X = x )
3 0.004913
  1. Hospital records show that of patients suffering from a certain disease, 75% die of it. What is the probability that of 6 randomly selected patients, 4 will recover?

    Question 2
    probability of 4 recoveries
    Probability Density Function
    Binomial with n = 6 and p = 0.25
    x P( X = x )
    4 0.0329590
  2. The ratio of boys to girls at birth in Singapore is quite high at 1.09:1.
    What proportion of Singapore families with exactly 6 children will have at least 3 boys? (Ignore the probability of multiple births.)
Question 3
Probability of atleast 3 Boys
Cumulative Distribution Function
Binomial with n = 6 and p = 0.5219
x P( X ≤ x )
2 0.303638
P(x <= 3) = 1-0.3036 = 0.6957
  1. A manufacturer of metal pistons finds that on the average, 12% of his pistons are rejected because they are either oversize or undersize. What is the probability that a batch of 10 pistons will contain
    (a) no more than 2 rejects? (b) at least 2 rejects?

    Question 4

    a) Probability of not more than 2
    Cumulative Distribution Function
    Binomial with n = 10 and p = 0.12
    x P( X ≤ x )
    2 0.891318

b) probability of at least 2 = 1-P(x <= 1)
Cumulative Distribution Function
Binomial with n = 10 and p = 0.12
x P( X ≤ x )
1 0.658275
p(x>=2) = 1-0.6583 = 0.3417
  1. A die is rolled 240 times. Find the mean, variance and standard deviation for the number of 3s that will be rolled?
  2. If there are 200 typographical errors randomly distributed in a 500 page manuscript, find the probability that a given page contains exactly 3 errors.

    Question 6

    exactly 3 errors.
    Results for: Q6.MTW
    Probability Density Function
    Poisson with mean = 0.4
    x P( X = x )
    3 0.0071501

  3. A sales form receives on the average of 3 calls per hour on its toll-free number. For any given hour, find the probability that it will receive a. At most 3 calls; b. At least 3 calls; and c. Five or more calls.

Question 7

At most 3 calls
P (X <= 3)
Cumulative Distribution Function
Poisson with mean = 3
x P( X ≤ x )
3 0.647232
b At Least 3 Calls = 1-P(X<=2)
Cumulative Distribution Function
Poisson with mean = 3
x P( X ≤ x )
2 0.423190
p(X>=3) = 1-0.42319 = 0.5768
c Probability of five or More calls
= 1- p(X<=4)
Cumulative Distribution Function
Poisson with mean = 3
x P( X ≤ x )
4 0.815263
p(x>=5) = 1-0.815263 = 0.1847

  1. A life insurance salesman sells on the average 3 life insurance policies per week. Calculate the probability that in a given week he will sell
    a. Some policies
    b. 2 or more policies but less than 5 policies.
    c. Assuming that there are 5 working days per week, what is the probability that in a given day he will sell one policy?

A solution to this probability question has been provided by our experts, you may contact us if you need help with this question.

  1. Twenty sheets of aluminum alloy were examined for surface flaws. The frequency of the number of sheets with a given number of flaws per sheet was as follows:

Number of flaws
0 4
1 3
2 5
3 2
4 4
5 1
6 1
What is the probability of finding a sheet chosen at random which contains 3 or more surface flaws?

  1. Find the area right of z=1.11

You can solve this question using either Excel data analysis tools, or Minitab, when you choose to use Excel, then use the function =NORM.S.DIST(1.11,TRUE) = 0.8665, which gives you the area to the left, to find the area to the right = 1-0.8665 = 0.1335

  1. Find the area left of z = -1.93
    You can also apply similar tactics as above to solve this question.
  2. Find the area between -/+ 1, 2, 3, 4, 5, 6, standard deviations.
  1. Find the z value such that the area under the normal distribution curve between 0 and the z value is 0.2123
  2. A study on recycling shows that in a certain city, each household accumulates an average of 14 pounds of newspaper each month to be recycled. The standard deviation is 2 pounds. If a household is selected at random, find the probability it will accumulate the following:
    a. Between 13 and 17 pounds of newspaper for a month.
    b. More than 16.2 pounds of newspaper for one month.

This question has been solved in many of our questions under our myMathlab homework help services.

  1. A standardized achievement test has a mean of 50 and a standard deviation of 10. The scores are normally distributed. If the test is administered to 800 selected people, approximately how many will score between 48 and 62?

Hypothesis Testing Assignments

  1. The Florida Department of Labor and Employment Security reported the state mean annual wage was $26,133. A hypothesis test of wages by county can be conducted to see whether the mean annual wage for a particular county differs from the state mean.
    a. Formulate the hypothesis that can be used to determine whether the mean annual wage in Baker county differs from the state annual mean wage of $26,133.
    b. A sample of 550 people in Baker County showed a sample mean of $25,457. Assume a population standard deviation of $7600. What is the p-value? Use a significance level of 5%. What is your conclusion?
  2. Glow toothpaste maintains that their tubes have always contained an mean of 12 ounces. The production group believes that the mean weight has changed. The weight in ounces for a sample of 15 tubes of toothpaste had an average value of 12,09 ounces and a standard deviation of 0.20 ounces. Use an appropriate hypothesis test to determine if the data show evidence of change in the mean weight. Use 90% confidence level.
  3. Enumerate the 36 possible outcomes from rolling a pair of dice, and compute the probability of rolling each of the numbers from 2 to 12.
  4. The Excel file contains mean temperatures for January and July and average annual precipitation for selected cities across the U.S. Construct 90% confidence intervals for the mean temperatures and precipitation.
  5. If, based on a sample size of 100, a political candidate found that 59 people would vote for her in a two-person race. What is the 95% confidence interval for her expected proportion of the vote?
  6. The Excel file contains the list of all the 76 items McDonalds serve and they are classified as sandwiches, fries, chicken pieces, salads, breakfasts, and desserts/shakes. Each record contains the serving size, calories, fat, cholesterol, sodium, and carb contents. For the entire month of September 1-30, you ate one of the sandwiches picked randomly from the list for dinner. Now you are sick of eating sandwiches. You decided to eat a salad for the entire month of October 1-31, picked in the same way you did in September. Your task is to analyze the data, summarize your experience and compare the differences between September and October. You have in your possession some very powerful statistical tools:
    a. Descriptive Statistics;
    b. Sampling;
    c. Confidence Interval;
    d. Hypothesis Testing:
    e. Graphical display.

BSB123 Data Analysis Assessment Item 2 Research Report (2017 S1)

The file: Birthweights.xlsx contains data on the following variables for a sample of 1000 births recorded in a large local hospital in 2015:

Variable Description
Birthweight Birthweight in grams
Gestation Length of pregnancy in days
Smoke Whether the mother is a smoker or not
Pre-pregnancy weight Mother’s pre-pregnancy weight in kilograms
Height Mothers height in centimetres
Status Mother’s indigenous status
Age Mother’s age in years

Management at the hospital is interested in being able to better manage room allocations and bookings in their maternity ward. They are keen to identify mothers at risk of having low birth weight babies who may require additional hospital resources during their stay in the hospital.

The hospital has collected data for a number of previous births at the hospital. The data contains information on the variables outlined in the table above. As a consultant, they have approached you and asked if you could analyse this dataset.


Part 1 - Analysis (80%)

  1. Past records (2004) show that the average birthweight was 3500 grams. Test at 5% if the average birthweight in 2015 has increased with the improvement in general nutrition.
    (Include all six steps for hypothesis testing.) 2 marks)
  2. Perform a two-sample t-test for each of the following tasks. (Include all six steps for hypothesis testing in each.)
    (a) Determine if there is evidence that on average the weight of a baby of a mother who smokes is less than that of a mother who does not. ( = 5%) (2 marks)
    (b) Determine if being indigenous is a disadvantage in terms of birthweight. ( = 5%) (2 marks)
    The hospital management is particularly interested in whether you can develop a regression model to help them to predict the birthweight of a baby based on the variables in the data supplied. The model could then be used to predict birthweight to identify babies at risk in future.
  3. By using the forward stepwise method, develop a multiple regression model to predict the birthweight.

    Step 1: Gestation only
    Step 2: Gestation and Smoke
    Step 3: Gestation, Smoke and Pre-pregnancy Weight
    Step 4: Gestation, Smoke, Pre-pregnancy Weight and Height
    Step 5: Gestation, Smoke, Pre-pregnancy Weight, Height and Status
    Step 6: Gestation, Smoke, Pre-pregnancy Weight, Height, Status and Age
    (a) Interpret the regression coefficients of all six (6) independent variables in the model obtained in Step 6, and comment on the statistical significance of each. (3 marks)
    (b) Use Excel to obtain the correlation matrix for the following variables: Gestation, Pre-pregnancy Weight, Height, Age and Birthweight. Do you think multi-collinearity is a problem in the regression model? Are the correlation coefficients consistent with the regression coefficients obtained in the model in Step 6? Discuss briefly. (3 marks)
    (c) Focusing on Steps 3 and 4, discuss fully how the introduction of Height in Step 4 affects the regression coefficient of Pre-pregnancy Weight. (3 marks)
    (d) Based on the results in (a) to (c), explain which independent variables should be included or excluded to formulate the final model. State the final model.
    (2 marks)
    (e) Comment on the overall adequacy of the final model. (2 marks)
    (f) Consider an indigenous mother who is a smoker, 20 years of age, and 160cm tall with a pre-pregnancy weight of 58kg and gestational age of 267 days. What is the expected weight of the child, using the final model you have developed in (d)? (2 marks)
  4. Compute the difference in the average birthweight of babies of indigenous and non-indigenous mothers (called the birthweight difference, for simplicity). Discuss fully if there is any discrepancy between the regression coefficient of Status obtained in the regression model and the birthweight difference. (3 marks)

    Part 2 – Report (20%)
    You are required to submit a concise report (word limit: 400) presenting any important features or relationships in the data. The content of your report should be based on, but not restricted to, insights gleaned from your analyses conducted in Part 1. (6 marks)


Part 1 - Analysis

• For presentation and ease of marking, it is advisable to include relevant Excel output in your answer to each question in this part instead of placing them in appendices.
• There is no word limit in Part 1.
Part 2 - Report
• The report is primarily based on the data provided. If, however, you wish to include, and refer to, additional information, you can use any referencing system as long as it is used consistently.
• You can include relevant charts and Excel objects in your report.
• Use 1 & ½ spacing and font size of 11.
• The word limit of 400 (with a tolerance of 10%) is exclusive of words in tables, appendices and reference list (if any).

• You should submit your response to both parts as a single pdf document saved in the format:
BSB123 Report_StudentName.pdf
• After uploading your research report, it is your responsibility to go back to the Assignment Upload page to check that your report was properly uploaded.
• Due: 11:59 pm 28 May 2017 (Sunday) via Blackboard

For any assistance with this project, contact MyMathLab homework Help