Posts tagged with r statistics help

Constructing a Confidence Interval for the Difference between Two Population Proportions
In order to determine if a new instructional technology improves students' scores, a professor wants to know if a larger percentage of students using the instructional technology passed the class than the percentage of students who did not use the new technology. Records show that 45 out of 50 randomly selected students who were in classes that used the instructional technology passed the class and 38 out of 51 randomly selected students who were in classes that did not use the instructional technology passed the class. Construct a 95%

confidence interval for the true difference between the proportion of students using the technology who passed and the proportion of students not using the technology who passed.


We are going to show how to construct the confidence interval first without a TI-83/84 Plus calculator and then with one.
Step 1: Find the point estimate.

First, we'll let Population 1 be those students who used the new technology and Population 2 be those students who did not. Next, we need to calculate the sample proportions. The sample proportion for Sample 1 (using instructional technology) is calculated as follows.


The sample proportion for Sample 2 (without the instructional technology) is found as follows.


Now that we have the sample proportions, we can calculate the point estimate.


Step 2: Find the margin of error.

Notice that the samples are indeed independent of one another. Because they are two separate groups of students, they are not connected in any way. We can assume that the other necessary conditions are met to allow us to use the standard normal distribution to calculate the margin of error. The level of confidence is c=0.95
, so the critical value is zα2/=z0.052/=z0.025=1.96

. Substituting the values into the formula gives us the following.


Step 3: Subtract the margin of error from and add the margin of error to the point estimate.

Subtracting the margin of error from the point estimate and then adding the margin of error to the point estimate gives us the following endpoints of the confidence interval.

Lower endpoint: (pˆ1−pˆ2)−E=0.154902−0.145675≈0.009Upper endpoint: (pˆ1−pˆ2)+E=0.154902+0.145675≈0.301

Thus, the 95%
confidence interval for the difference between the two population proportions ranges from 0.009 to 0.301

. The confidence interval can be written mathematically using either inequality symbols or interval notation, as shown below.




Therefore, we are 95%
confident that the percentage of students who passed the class is between 0.9% and 30.1% higher for the population of students who used the new instructional technology (Population 1) than for the population of students who did not use the technology (Population 2). Thus, with 95%

confidence, the professor can conclude that the new instructional technology improves students' scores.

To calculate the confidence interval for the difference between two proportions on the calculator, we don't need to find the individual sample proportions; we just need to enter the number of successes and the sample size for each sample, as well as the level of confidence. Press STAT , scroll to TESTS, and then choose option B:2-PropZInt. x1 is the number of successes from the first sample and n1 is the first sample's size. Similarly, x2 is the number of successes from the second sample and n2 is the second sample's size. As usual, C-Level is the confidence level, which must be entered as a decimal. The data should be entered as shown in the first screenshot below. After you select Calculate and press ENTER , the results will be displayed on the screen as shown in the second screenshot below.
2-PropZInt data entry screen with x_1 equal to 45, n_1 equal to 50, x_2 equal to 38, n_2 equal to 51, and C-Level equal to .95. 2-PropZInt results screen shows ( .00923 , .30057 ), p hat_1 equal to .9 , p hat_2 equal to .7450980392, n_1 equal to 50, and n_2 equal to 51.

Notice that the calculator gives the same interval but with more decimal places. The interpretation of the confidence interval is still the same. The proportion of students passing the class was higher for the population of students who used the new instructional technology than for the population of students who did not use the technology.

The goal of this lab is to start getting you comfortable using the Rguroo point-and-click interface and using the software to help visualize and interpret data.

Part I. Eye Color Dataset

For this part of the lab, we will explore the graphical features of Rguroo using the dataset called HairEyeColor. This dataset can be found on Titanium. Download the dataset to your desktop. In Rguroo in the left hand column select the dropdown Data, then select Data Import. Within Data Import select Data Frame, then select the file and select Upload.

Question #1 Once you have imported the data, then if you double click on the dataset name the raw data will show up. If you right click on the dataset name there are many features, one of which is the summary function. Using these features answer the following questions.
(a) How many variables are there in this dataset?
(b) Are the variables quantitative or categorical?
(c) Specifically name one of the variables, and state what values it can take.
(d) How many cases are in this dataset?

Question #2 Now let’s look at only the variable of Eye color and obtain a barplot of the values. Do this by clicking on the drop menu for Create Plot and select Barplot. We first need to select the dataset by clicking the drop down menu of Select a Dataset; choose HairEyeColor. Switch from Numerical/Freq tab to Categorical tab. Select the Factor 1 drop down menu and click on Eye. Now, click on the Relative Frequency selection. Fill in the Labels for the Title, X-Axis, and Y-Axis. Click on the eye icon to view the bar graph.
Copy the Barplot and paste it below.

Question #3 You can add the specific percentage of each category as well as other features by clicking on the Details tab. To add the specific percentages, go to Bar, Value Labels, Error Bars and select Add Value Labels. Press the eye icon to see the change.
Copy the new more detailed Barplot and paste it below.

Question #4 Based on your Barplot in the previous question, which category has the most people? Which has the least?

Question #5 We can also look at the Eye as a Factor of Gender. This would allow us to visually compare the distribution of Eye Color of males and females. To do this click on the Basics tab, and select Sex for the Factor 2 box and select the eye icon.
Copy the Barplot with eye color and gender and paste it below.

Question #6 Which color is most prevalent for females; which color for males?

Data Analysis Using Excel

For each of the following problems, save your work to a .r file. Name your files like
<.First Name>_HW3.
So my file for problem 2 would be Hendrix_Jeremy_HW3_2.r
Upload your five files to DropBox.

I have provided you with an Excel spreadsheet called Last_FM_data_shuffled.xlsx. It contains the log of all the music I have listened to on my phone since I began using the website. As the name implies however, I have shuffled the entries so that they are no longer in chronological order. There is a header row at the top of the spreadsheet, and there are four columns of data: Band, Album, Song, and Date.

  1. Assuming you are not using packages that let you read from Excel, what must you do first in order to prepare this data to import to an R dataframe? What command will you use to import it?
    For this problem, submit a .r file where the first line is a comment telling me what you have to do, and the second line is the R command to import the data. Remember that # is the comment character.
  2. What is a single R command that can be used to count how many different bands are represented in the data file?
  3. Write an R script that will sort the data back into chronological order and store it in a new dataframe.
  4. Recall that the table() function can be used to quickly summarize data. As an example, assuming I have attached the dataframe with the song data, I can type

And get the following output

(Song For My) Sugar Spun Sister 1901 45

                          2                        1               2

     50 Ways to Say Goodbye     6th Avenue Heartache      8:02:00 PM 
                          1                        2               1 

Each song title appears as a column heading and the number underneath it represents the number of time the song appears in the Song column of the dataframe.
Using this, what is the R command to determine the name of the song that has been played the most times? What is the R command to determine how many times that song has been played?

  1. Using R, determine the average number of songs I listened to per day over the time period in the dataset.

Statistics Homework in R Studio

  1. Write an R function called Cleaner that accepts a single vector of numbers that may contain NA entries and returns a vector where the NA’s have been replaced with -1.
  2. Write an R function that accepts three parameters: a lower bound, an upper bound, and an increment. Then use a repeat loop to generate a vector of the numbers from the lower bound to the upper bound by increment.
    For example, if my function was called counter
    answer <- counter(2,10,2)
    [1] 2 4 6 8 10
    i.e. the numbers from 2 to 10 in increments of 2
    answer <- counter(2,10,3)
    [1] 2 5 8
  3. Assuming I have three variables called lower, upper, and increment, how could I produce the same thing as number 2 with a single R statement that does not employ a loop?
  4. Write an R function that accepts two parameters: a vector of strings and a single search character. The function will then return a vector that contains the input strings that contain the search character.

For Example, if my function was called searcher
names <- c(“Bob”, “Bill”, “William”,”Tom”)
answer <- searcher(names,”i”)
[1] “Bill” “William”

  1. Write an R function that accepts three parameters: a vector of strings, a single search character, and a single replacement character. The function will return the vector of strings, but with all instances of the search character replaced with the replacement character.
    For example, if my function was called replacer
    names <- c(“Bob”, “Bill”, “William”,”Tom”)
    answer <- replacer(names, “o”, “O”)
    [1] “BOb” “Bill” “William” “TOm"

Dataset search and Analysis using R

For this, you need to find a dataset that contains at least 100 observations. There are a variety of repositories on the internet that contain large data sets. If you’re totally stuck, try

Once you have identified your dataset, determine an interesting plot you can make from it. This can be any kind of chart you want (scatter, line, pie, etc) and can be built using base R or ggplot2 as you prefer.

Now build an R Markdown document with parameters that can be used to generate a report from your dataset and can be customized by setting the parameters. This will follow the same basic approach for the beach water quality example.

So for instance, the parameters might be a start date and an end date and the plot would be limited to that subset. Or they might be a state or a region that is in the data file and plots data for that state.