Posts under category R statistics help

Overview

As part of continuing your work on your evidence-based project proposal, identify one to three outcomes of interest. That is,what you hope to change or improve based on the implementation of your project (your project’s dependent variable). For example, if you are designing an intervention to reduce hospital re-admission rates for patients with heart failure, the outcome is re-admission rates. Another example would be a mindfulness-based intervention for critical care nurses to reduce burnout; in this case, burnout is the outcome of interest.

Once you have identified the outcomes of interest in your project, you need to determine how each outcome will be measured. Consider the examples above. How might you measure re-admission rates? (For example, you might measure the percentage of patients with heart failure who are re-admitted with a diagnosis of heart failure within 90 days of being discharged). How about measuringburnout in critical-care nurses? (Would you use the Maslach Burnout Inventory, or some other tool?). As you can see, there are different ways to measure outcomes.

The Maslach Burnout Inventory is an example of a measurement tool. Many tools such as this exist to measure a variety of phenomena such as resilience, moral distress, self-efficacy, and many others! These tools can include surveys or questionnaires that have been used in the literature to evaluate similar evidence-based practice projects. Many tools may be available to you depending on your topic. This assignment involves you searching the databases to learn about how your topic has been evaluated in the past.

To find tools and determine how outcomes can be measured, start by reading the literature. What tools are frequently used to assess the variables or outcomes of interest? Some are very commonly used. When you find a tool, you’ll want to review the original primary source —the published (or revised and updated) book or article where the tool was first described. Evaluate how the tool was developed and if it was found to be reliable and valid. It is very interesting to read instrument-development articles, so please do if you get the chance! Often the titles of these articles contain the terms “development and psychometric testing of the _ scale/tool.” Please note when actually conducting research that there are many considerations for the selection and use of measurement tools, including permission to use the tool from the researcher.

Assignment Instructions

For this assignment, select one to three outcomes and identify how each outcome will be measured. How many outcomes you have depends on your individual project.

As discussed, you might need to determine exactly how certain outcomes will be measured (such as re-admission rate). It is also possible that there is no tool available to measure an outcome of interest in your study. For example, if I wanted to assess “knowledge” of some topic, I would need to create questions to obtain data about this outcome. In either case, include the following information:

Clearly state the outcome and how, specifically, you will measure it.
Describe why you selected the measurement method and how you plan to use it in the project.

If you are able to identify an appropriate measurement tool that already exists (such as the Maslach Burnout Inventory) for an outcome of interest, include the following information:

The outcome and the name of the tool that will be used to measure the outcome.
A brief description the tool. How many items are there? How are items scored? What do scores mean?
An explanation of why you selected this tool and how you plan to use it in your project.
The validity and reliability of the tool.

Here is an example:

Self-Efficacy. Self-efficacy will be measured using the General Self-Efficacy Scale (GSE) (Schwarzer & Jerusalem, 1995).

The scale was developed in 1979 and subsequently revised and adapted to 26 languages, and consists of 10 items, scored on a scale of 1 (not at all true) to 4 (exactly true), with a score range of 10 to 40 (where lower scores indicate lower self-efficacy and higher scores indicate higher self-efficacy) (The General Self-Efficacy Scale, n.d.).

In a sample of 747 early career nurses, the scale had a Cronbach’s alpha of 0.884 (Wang et al., 2017). This establishes reliability in that sample (early career nurses).

The GSE scale was selected because it offers a general overview of the concept of self-efficacy and is not specific to nursing practice. In the proposed study, the GSE scale will be given to participants before and after the intervention.

After you identify between two and five peer-reviewed tools, in a Microsoft Word document, describe in 300 to 500 words why you have selected them and how you plan to use them in your proposal. Include the validity and reliability of the tools (which is found in journal articles). Submit the names of the tools along with your 300- to 500-word justification and ensure that you use APA format.

Please refer to the Grading Rubric for details on how this activity will be graded.

The goal of this lab is to start getting you comfortable using the Rguroo point-and-click interface and using the software to help visualize and interpret data.

Part I. Eye Color Dataset

For this part of the lab, we will explore the graphical features of Rguroo using the dataset called HairEyeColor. This dataset can be found on Titanium. Download the dataset to your desktop. In Rguroo in the left hand column select the dropdown Data, then select Data Import. Within Data Import select Data Frame, then select the file and select Upload.

Question #1 Once you have imported the data, then if you double click on the dataset name the raw data will show up. If you right click on the dataset name there are many features, one of which is the summary function. Using these features answer the following questions.
(a) How many variables are there in this dataset?
(b) Are the variables quantitative or categorical?
(c) Specifically name one of the variables, and state what values it can take.
(d) How many cases are in this dataset?

Question #2 Now let’s look at only the variable of Eye color and obtain a barplot of the values. Do this by clicking on the drop menu for Create Plot and select Barplot. We first need to select the dataset by clicking the drop down menu of Select a Dataset; choose HairEyeColor. Switch from Numerical/Freq tab to Categorical tab. Select the Factor 1 drop down menu and click on Eye. Now, click on the Relative Frequency selection. Fill in the Labels for the Title, X-Axis, and Y-Axis. Click on the eye icon to view the bar graph.
Copy the Barplot and paste it below.

Question #3 You can add the specific percentage of each category as well as other features by clicking on the Details tab. To add the specific percentages, go to Bar, Value Labels, Error Bars and select Add Value Labels. Press the eye icon to see the change.
Copy the new more detailed Barplot and paste it below.

Question #4 Based on your Barplot in the previous question, which category has the most people? Which has the least?

Question #5 We can also look at the Eye as a Factor of Gender. This would allow us to visually compare the distribution of Eye Color of males and females. To do this click on the Basics tab, and select Sex for the Factor 2 box and select the eye icon.
Copy the Barplot with eye color and gender and paste it below.

Question #6 Which color is most prevalent for females; which color for males?

Statistics Homework in R Studio

  1. Write an R function called Cleaner that accepts a single vector of numbers that may contain NA entries and returns a vector where the NA’s have been replaced with -1.
  2. Write an R function that accepts three parameters: a lower bound, an upper bound, and an increment. Then use a repeat loop to generate a vector of the numbers from the lower bound to the upper bound by increment.
    For example, if my function was called counter
    answer <- counter(2,10,2)
    answer
    [1] 2 4 6 8 10
    i.e. the numbers from 2 to 10 in increments of 2
    answer <- counter(2,10,3)
    answer
    [1] 2 5 8
  3. Assuming I have three variables called lower, upper, and increment, how could I produce the same thing as number 2 with a single R statement that does not employ a loop?
  4. Write an R function that accepts two parameters: a vector of strings and a single search character. The function will then return a vector that contains the input strings that contain the search character.

For Example, if my function was called searcher
names <- c(“Bob”, “Bill”, “William”,”Tom”)
answer <- searcher(names,”i”)
answer
[1] “Bill” “William”

  1. Write an R function that accepts three parameters: a vector of strings, a single search character, and a single replacement character. The function will return the vector of strings, but with all instances of the search character replaced with the replacement character.
    For example, if my function was called replacer
    names <- c(“Bob”, “Bill”, “William”,”Tom”)
    answer <- replacer(names, “o”, “O”)
    answer
    [1] “BOb” “Bill” “William” “TOm"

R Statistics Help

Question 13.
a.

read.csv()
metro <- read.csv("MetroMedian.csv", header = T)
reads the file from the folder and store to the dataframe “metro” with header=T so that it can retain variable names from the file.
b.
install.packages("reshape2")
install.packages("data.table")

The package “Reshape” is installed so that its function ”melt()” can be used to create molten data from the matrix read. The “data.table” package is loaded so it can transform the dataframe for easier manipulation. Load the data.table after the “reshape” package.

tidyMetro <- melt(metro,id.vars=c("RegionName","State","SizeRank"),variable.name="date",na.rm=TRUE)

use the melt function from the data.table library, to convert the dataframe from Wide to Long. The function has the “object” to be converted, the factors and the variable of interest as the inputs, the na.rm=TRUE, drops the empty cells.
c.

mean(tidyMetro$value[tidyMetro$State=="NY"])
use the r function mean(), That’s is select value from the tidyMetro dataframe, where state==”NY”
d.
regionMean <- function(valueFrame,searchRegion) {
mean(valueFrame$value[valueFrame$RegionName==searchRegion])

}

The function above is stored in the variable region. The inputs of the variable are;- the object and the searchRegion. Inside the function, we use the r-function mean(), which selects values from the variable of interest where the state name is same as the searchRegion entered to the function.
Question 14
a.

beaches <- read.csv("BeachWaterQuality.csv", header = T)
because the data is stored in an excel .csv format, use the function read.csv(), to read the excel file using the columns names as the variable names.
b.
beaches$Results[is.na(beaches$Results)] <- 0
select the variable Results from the beaches dataframe and check if it is NA, assign 0 to the empty value.
head(beaches)
check the format of the dataframe.
c.
new.Date <- strptime(as.character(beaches$Sample.Date),"%m/%d/%Y")\
create a variable new.data which is in r-local format, using the r–function strptime(), as.character converts the date variable to be of character type so that its format is understood. The “m/%d/%Y”, tells are the date format from the file is month/date/ and Year written in four digits. The month and date does not contain the leading 0.

beaches$new.date <- new.Date
Add the newly created variable;- new.Date to the beaches dataframe and assign the name new.Date.
d.
beachPlot <- function(beachData,beachName,sampleLocation){
beaches2 <- subset(beachData, Beach.Name==beachName & Sample.Location==sampleLocation)
plot(beaches2$new.date,beaches2$Results, ylab = "Bacterial Count", xlab = "Date",main=c('Bacterial count for', beachName,sampleLocation))
lines(beaches2$new.date[order(beaches2$new.date)], beaches2$Results[order(beaches2$new.date)],
xlim=range(beaches2$new.date), ylim=range(beaches2$Results),col="red")
}

Create a function and assign the name beachplot. The inputs to the function are;- the dataframe, beachName and sampleLocation. Use the function inputs to subset the dataframe and store the subset to the beache2 dataframe. The subset dataframe is selected from the input dataframe. The rows that have beachName and sample location are selected. Use the plot function to create a plot by entering the x-axis variable, y-axis variable, the y-axis label the x-axis label and main title label which is entered as a vector so that it can get the function input factors.
Add lines to the plot for Results against Date. The line uses the data range and the plot has a red color.

Question 15.
a.

mileage <- read.csv("Insight (3).csv", header = T)
head(mileage)
The data set in the directory is stored in excel .csv format with the name Insight. So use read.csv() function to read the data file and store the variables in a dataframe called mileage.
Check the structure of the dataframe using the head(), function.
b.

plot(MPG~Avg.Temp, data = mileage, ylab="",xlab="Average Temperature",main="MPG against Avg.Temp and Car Said",col="blue")

plot function to create the plot. The tilde sign means y~x. And get the values for x and y axis from the mileage dataframe. Leave the y-axis empty because another line will be added after. Label the x-axis because both variables are being plotted against the same x-varibale. Add a title using main=”” and set the colour for this plot to blue.
c.

abline(lm(MPG~Avg.Temp,data = mileage),col="blue")
Add a trend line to the plot. The line to be added is the line of best fit from a linear model formed using the dependent and the response variable, Get the variables from the mileage dataframe. Set the color of the trendline to blue using col=”blue” command.

D ~

par(new = TRUE)
par() is an r-function used to combine plots. So setting new=T, allows a new plot to be embedded in an existing plot.
plot(Car.Said~Avg.Temp, data=mileage,col="red",ylab="MPG/Car Said",xlab="",axes=FALSE)
use the plot() function to add a new line to the existing plot. Set the color to red and add the y-axis label. Axes=FALSE suppresses the axis values.

e Adding a Red Line, that Fits the Linear Model

abline(lm(Car.Said~Avg.Temp,data = mileage),col="red")
create a linear model and Add a trendline for the 2nd plot. Set the colour to red.
par(new=FALSE)

legend("topleft",legend=c("Measured MPG","Car Reported MPG"),

   text.col=c("blue","red"),pch=c(16,16),col=c("blue","red"))

Add a legend to the plot. Place the legend to the top left of the plot. The labels of the legend should be “Measured MPG” and “Reported MPG”, the colors of the text are red and blue, pch-sets the width of the line and color them with “blue” and “red respectively.

Working With R Files

Save your work to a .r file. Name yor file like _<.First Name>_HW4.
Upload your files to DropBox.

I have provided you with a .csv file called Insight.csv. It contains the log of my Honda Insight’s mileage since I bought it. There is a header row at the top of the spreadsheet. The columns are mostly self-explanatory, but just in case:

Date: the date of the fill up
Miles: the number of miles on the odometer when I filled up
Gallons: the number of gallons I put in the car
Price.per.Gallon: the price per gallon of gas
Total.cost: the total cost of the fillup (i.e., Price.Per.Gallon * Gallons)
Grade: the grade of gasoline purchased
MPG: the calculated miles per gallon for that tank (Miles/Gallons)
Price.Per.Mile: Total.cost/Miles
Cumulative.Miles: cumulative miles driven by the car
Cumulative.Gallons: cumulative gallons of gas purchased
Cumulative.Cost: cumulative cost of gas purchased
Cumulative.MPG: Cumulatove.Miles/Cumulative.Gallons
Cumulative.Price.per.Mile: Cumulative.Cost/Cumulative.Miles
Gas.Source: the chain of gas station the gas came from
Car.Said: the mileage for that tank that the car’s on-board computer reported
Delta: the difference between the mileage the car said and the mileage I calculated
Average.Price.of.Gas: cumulative average price paid per gallon of gas
Avg.Temp: average temperature for the period since the last fill-up, taken from a website at UD

Given this data, write an R script that will produce a SEPARATE WINDOW with this output.

r statistics output.png

Note that the bottom right chart may not display all the labels on the X-axis until the window is maximized….that’s okay.

IMPORTANT NOTE: There are several gas stations that only have one or a small number of entries. Those should be INCLUDED in the data for the first two charts, but EXCLUDED when producing the bottom two charts. The following code will quickly accomplish this for you:

Assuming you have read the data into a dataframe called mileage:

tbl <- table(mileage$Gas.Source)
new.Mileage <- droplevels(mileage[mileage$Gas.Source %in% names(tbl)[tbl>10],,drop=FALSE])

The dataframe new.Mileage now has the data minus any entries where there were fewer than 10 observations at that Gas.Source.

Your script should begin with reading in the data. You can assume that your script and the data file are both in the working directory, so you need to only reference the filename (Insight.csv) and not worry about the file path.