Dataset search and Analysis using R

For this, you need to find a dataset that contains at least 100 observations. There are a variety of repositories on the internet that contain large data sets. If you’re totally stuck, try https://data.world/

Once you have identified your dataset, determine an interesting plot you can make from it. This can be any kind of chart you want (scatter, line, pie, etc) and can be built using base R or ggplot2 as you prefer.

Now build an R Markdown document with parameters that can be used to generate a report from your dataset and can be customized by setting the parameters. This will follow the same basic approach for the beach water quality example.

So for instance, the parameters might be a start date and an end date and the plot would be limited to that subset. Or they might be a state or a region that is in the data file and plots data for that state.

R Statistics Help

Question 13.
a.

reads the file from the folder and store to the dataframe “metro” with header=T so that it can retain variable names from the file.
b.
install.packages("reshape2")
install.packages("data.table")

The package “Reshape” is installed so that its function ”melt()” can be used to create molten data from the matrix read. The “data.table” package is loaded so it can transform the dataframe for easier manipulation. Load the data.table after the “reshape” package.

tidyMetro <- melt(metro,id.vars=c("RegionName","State","SizeRank"),variable.name="date",na.rm=TRUE)

use the melt function from the data.table library, to convert the dataframe from Wide to Long. The function has the “object” to be converted, the factors and the variable of interest as the inputs, the na.rm=TRUE, drops the empty cells.
c.

mean(tidyMetro\$value[tidyMetro\$State=="NY"])
use the r function mean(), That’s is select value from the tidyMetro dataframe, where state==”NY”
d.
regionMean <- function(valueFrame,searchRegion) {
mean(valueFrame\$value[valueFrame\$RegionName==searchRegion])

}

The function above is stored in the variable region. The inputs of the variable are;- the object and the searchRegion. Inside the function, we use the r-function mean(), which selects values from the variable of interest where the state name is same as the searchRegion entered to the function.
Question 14
a.

because the data is stored in an excel .csv format, use the function read.csv(), to read the excel file using the columns names as the variable names.
b.
beaches\$Results[is.na(beaches\$Results)] <- 0
select the variable Results from the beaches dataframe and check if it is NA, assign 0 to the empty value.
check the format of the dataframe.
c.
new.Date <- strptime(as.character(beaches\$Sample.Date),"%m/%d/%Y")\
create a variable new.data which is in r-local format, using the r–function strptime(), as.character converts the date variable to be of character type so that its format is understood. The “m/%d/%Y”, tells are the date format from the file is month/date/ and Year written in four digits. The month and date does not contain the leading 0.

beaches\$new.date <- new.Date
Add the newly created variable;- new.Date to the beaches dataframe and assign the name new.Date.
d.
beachPlot <- function(beachData,beachName,sampleLocation){
beaches2 <- subset(beachData, Beach.Name==beachName & Sample.Location==sampleLocation)
plot(beaches2\$new.date,beaches2\$Results, ylab = "Bacterial Count", xlab = "Date",main=c('Bacterial count for', beachName,sampleLocation))
lines(beaches2\$new.date[order(beaches2\$new.date)], beaches2\$Results[order(beaches2\$new.date)],
xlim=range(beaches2\$new.date), ylim=range(beaches2\$Results),col="red")
}

Create a function and assign the name beachplot. The inputs to the function are;- the dataframe, beachName and sampleLocation. Use the function inputs to subset the dataframe and store the subset to the beache2 dataframe. The subset dataframe is selected from the input dataframe. The rows that have beachName and sample location are selected. Use the plot function to create a plot by entering the x-axis variable, y-axis variable, the y-axis label the x-axis label and main title label which is entered as a vector so that it can get the function input factors.
Add lines to the plot for Results against Date. The line uses the data range and the plot has a red color.

Question 15.
a.

The data set in the directory is stored in excel .csv format with the name Insight. So use read.csv() function to read the data file and store the variables in a dataframe called mileage.
Check the structure of the dataframe using the head(), function.
b.

plot(MPG~Avg.Temp, data = mileage, ylab="",xlab="Average Temperature",main="MPG against Avg.Temp and Car Said",col="blue")

plot function to create the plot. The tilde sign means y~x. And get the values for x and y axis from the mileage dataframe. Leave the y-axis empty because another line will be added after. Label the x-axis because both variables are being plotted against the same x-varibale. Add a title using main=”” and set the colour for this plot to blue.
c.

abline(lm(MPG~Avg.Temp,data = mileage),col="blue")
Add a trend line to the plot. The line to be added is the line of best fit from a linear model formed using the dependent and the response variable, Get the variables from the mileage dataframe. Set the color of the trendline to blue using col=”blue” command.

D ~

par(new = TRUE)
par() is an r-function used to combine plots. So setting new=T, allows a new plot to be embedded in an existing plot.
plot(Car.Said~Avg.Temp, data=mileage,col="red",ylab="MPG/Car Said",xlab="",axes=FALSE)
use the plot() function to add a new line to the existing plot. Set the color to red and add the y-axis label. Axes=FALSE suppresses the axis values.

e Adding a Red Line, that Fits the Linear Model

abline(lm(Car.Said~Avg.Temp,data = mileage),col="red")
create a linear model and Add a trendline for the 2nd plot. Set the colour to red.
par(new=FALSE)

legend("topleft",legend=c("Measured MPG","Car Reported MPG"),

``   text.col=c("blue","red"),pch=c(16,16),col=c("blue","red"))``

Add a legend to the plot. Place the legend to the top left of the plot. The labels of the legend should be “Measured MPG” and “Reported MPG”, the colors of the text are red and blue, pch-sets the width of the line and color them with “blue” and “red respectively.

Guidelines for the Empirical Analysis

Analysis in R or STATA

1. Plot the cross-sectional average of deposits/assets and Non-Deposit Debt/assets across time.
You can calculate Non-Deposit Debt = Assets – Deposits – Equity. How have the averages
evolved? How would you interpret your results?
2. Run OLS regressions of quarterly loan growth on non-deposit debt/assets (one quarter lagged value) controlling for bank size (i.e. one quarter lagged natural logarithm of total asset) and
profitability (i.e. one quarter lagged return on assets) for the sub-sample of your data during the
financial crisis (i.e. 2008Q1 – 2010Q1). You should have Bank and Time fixed effects in your
regression. What is the sign and magnitude of the co-efficient on non-deposit debt/assets? Is the
coefficient significant? How will you interpret the co-efficient? Justify your findings.
3. Compute two measures / (ex-post) proxies for bank risk
a. Risk weighted asset divided by total assets
b. Non-performing loans divided by total loans
4. Plot the cross-sectional average of the above two measures across time. How have the averages
evolved across years? How would you interpret your results?
5. Run OLS regressions of the two ex-post measures of bank risk on equity over assets (one
quarter lagged values). Control for bank size (i.e. one quarter lagged natural logarithm of total
asset) and profitability (i.e. one quarter lagged return on assets) on the entire sample. You
should have Bank and Time fixed effects in your regression. What is the sign and magnitude of
the co-efficient on equity/assets? Is the coefficient significant? How will you interpret the coefficient? Justify your findings.
Note: Make sure you winsorize all your variables (per quarter at the 1st and 99th percentile) to
remove outliers.

Solve using Excel & Minitab (Do not use formula)

The following questions are from probability and statistics questions. The questions were previously solved by our statistic experts using MINITAB; in case you are a student looking for help with similar questions, then you can contact us so that we may provide similar services under our do MyMathLab homework so that we can provide you similar solutions, or solutions with similar questions, either using Excel data analysis tools , or by using the latest Minitab application software. The solutions to each question are attached for you confirmation.

1. A die is tossed 3 times. What is the probability of
(a) No fives turning up?
(b) 1 five?
(c) 3 fives?
Probability Solution for Question 1
Outut for Question 1:
Probability Density Function

Binomial with n = 3 and p = 0.17

x P( X = x )
0 0.571787
Probability Density Function
Binomial with n = 3 and p = 0.17
x P( X = x )
1 0.351339
Probability Density Function
Binomial with n = 3 and p = 0.17
x P( X = x )
3 0.004913
1. Hospital records show that of patients suffering from a certain disease, 75% die of it. What is the probability that of 6 randomly selected patients, 4 will recover?

Question 2
probability of 4 recoveries
Probability Density Function
Binomial with n = 6 and p = 0.25
x P( X = x )
4 0.0329590
2. The ratio of boys to girls at birth in Singapore is quite high at 1.09:1.
What proportion of Singapore families with exactly 6 children will have at least 3 boys? (Ignore the probability of multiple births.)
Question 3
Probability of atleast 3 Boys
Cumulative Distribution Function
Binomial with n = 6 and p = 0.5219
x P( X ≤ x )
2 0.303638
P(x <= 3) = 1-0.3036 = 0.6957
1. A manufacturer of metal pistons finds that on the average, 12% of his pistons are rejected because they are either oversize or undersize. What is the probability that a batch of 10 pistons will contain
(a) no more than 2 rejects? (b) at least 2 rejects?

Question 4

a) Probability of not more than 2
Cumulative Distribution Function
Binomial with n = 10 and p = 0.12
x P( X ≤ x )
2 0.891318

b) probability of at least 2 = 1-P(x <= 1)
Cumulative Distribution Function
Binomial with n = 10 and p = 0.12
x P( X ≤ x )
1 0.658275
p(x>=2) = 1-0.6583 = 0.3417
1. A die is rolled 240 times. Find the mean, variance and standard deviation for the number of 3s that will be rolled?
2. If there are 200 typographical errors randomly distributed in a 500 page manuscript, find the probability that a given page contains exactly 3 errors.

Question 6

exactly 3 errors.
Results for: Q6.MTW
Probability Density Function
Poisson with mean = 0.4
x P( X = x )
3 0.0071501

3. A sales form receives on the average of 3 calls per hour on its toll-free number. For any given hour, find the probability that it will receive a. At most 3 calls; b. At least 3 calls; and c. Five or more calls.

Question 7

At most 3 calls
P (X <= 3)
Cumulative Distribution Function
Poisson with mean = 3
x P( X ≤ x )
3 0.647232
b At Least 3 Calls = 1-P(X<=2)
Cumulative Distribution Function
Poisson with mean = 3
x P( X ≤ x )
2 0.423190
p(X>=3) = 1-0.42319 = 0.5768
c Probability of five or More calls
= 1- p(X<=4)
Cumulative Distribution Function
Poisson with mean = 3
x P( X ≤ x )
4 0.815263
p(x>=5) = 1-0.815263 = 0.1847

1. A life insurance salesman sells on the average 3 life insurance policies per week. Calculate the probability that in a given week he will sell
a. Some policies
b. 2 or more policies but less than 5 policies.
c. Assuming that there are 5 working days per week, what is the probability that in a given day he will sell one policy?

A solution to this probability question has been provided by our experts, you may contact us if you need help with this question.

1. Twenty sheets of aluminum alloy were examined for surface flaws. The frequency of the number of sheets with a given number of flaws per sheet was as follows:

Number of flaws
Frequency
0 4
1 3
2 5
3 2
4 4
5 1
6 1
What is the probability of finding a sheet chosen at random which contains 3 or more surface flaws?

1. Find the area right of z=1.11

You can solve this question using either Excel data analysis tools, or Minitab, when you choose to use Excel, then use the function =NORM.S.DIST(1.11,TRUE) = 0.8665, which gives you the area to the left, to find the area to the right = 1-0.8665 = 0.1335

1. Find the area left of z = -1.93
You can also apply similar tactics as above to solve this question.
2. Find the area between -/+ 1, 2, 3, 4, 5, 6, standard deviations.
1. Find the z value such that the area under the normal distribution curve between 0 and the z value is 0.2123
2. A study on recycling shows that in a certain city, each household accumulates an average of 14 pounds of newspaper each month to be recycled. The standard deviation is 2 pounds. If a household is selected at random, find the probability it will accumulate the following:
a. Between 13 and 17 pounds of newspaper for a month.
b. More than 16.2 pounds of newspaper for one month.

This question has been solved in many of our questions under our myMathlab homework help services.

1. A standardized achievement test has a mean of 50 and a standard deviation of 10. The scores are normally distributed. If the test is administered to 800 selected people, approximately how many will score between 48 and 62?

Hypothesis Testing Assignments

1. The Florida Department of Labor and Employment Security reported the state mean annual wage was \$26,133. A hypothesis test of wages by county can be conducted to see whether the mean annual wage for a particular county differs from the state mean.
a. Formulate the hypothesis that can be used to determine whether the mean annual wage in Baker county differs from the state annual mean wage of \$26,133.
b. A sample of 550 people in Baker County showed a sample mean of \$25,457. Assume a population standard deviation of \$7600. What is the p-value? Use a significance level of 5%. What is your conclusion?
2. Glow toothpaste maintains that their tubes have always contained an mean of 12 ounces. The production group believes that the mean weight has changed. The weight in ounces for a sample of 15 tubes of toothpaste had an average value of 12,09 ounces and a standard deviation of 0.20 ounces. Use an appropriate hypothesis test to determine if the data show evidence of change in the mean weight. Use 90% confidence level.
3. Enumerate the 36 possible outcomes from rolling a pair of dice, and compute the probability of rolling each of the numbers from 2 to 12.
4. The Excel file contains mean temperatures for January and July and average annual precipitation for selected cities across the U.S. Construct 90% confidence intervals for the mean temperatures and precipitation.
5. If, based on a sample size of 100, a political candidate found that 59 people would vote for her in a two-person race. What is the 95% confidence interval for her expected proportion of the vote?
6. The Excel file contains the list of all the 76 items McDonalds serve and they are classified as sandwiches, fries, chicken pieces, salads, breakfasts, and desserts/shakes. Each record contains the serving size, calories, fat, cholesterol, sodium, and carb contents. For the entire month of September 1-30, you ate one of the sandwiches picked randomly from the list for dinner. Now you are sick of eating sandwiches. You decided to eat a salad for the entire month of October 1-31, picked in the same way you did in September. Your task is to analyze the data, summarize your experience and compare the differences between September and October. You have in your possession some very powerful statistical tools:
a. Descriptive Statistics;
b. Sampling;
c. Confidence Interval;
d. Hypothesis Testing:
e. Graphical display.