Posts tagged with r statistical analysis

Dataset search and Analysis using R

For this, you need to find a dataset that contains at least 100 observations. There are a variety of repositories on the internet that contain large data sets. If you’re totally stuck, try

Once you have identified your dataset, determine an interesting plot you can make from it. This can be any kind of chart you want (scatter, line, pie, etc) and can be built using base R or ggplot2 as you prefer.

Now build an R Markdown document with parameters that can be used to generate a report from your dataset and can be customized by setting the parameters. This will follow the same basic approach for the beach water quality example.

So for instance, the parameters might be a start date and an end date and the plot would be limited to that subset. Or they might be a state or a region that is in the data file and plots data for that state.

I have provided you with an Excel spreadsheet called Last_FM_data_shuffled.xlsx. It contains the log of all the music I have listened to on my phone since I began using the website. As the name implies however, I have shuffled the entries so that they are no longer in chronological order. There is a header row at the top of the spreadsheet, and there are four columns of data: Band, Album, Song, and Date.

  1. Assuming you are not using packages that let you read from Excel, what must you do first in order to prepare this data to import to an R dataframe? What command will you use to import it?
    For this problem, submit a .r file where the first line is a comment telling me what you have to do, and the second line is the R command to import the data. Remember that # is the comment character.
  2. What is a single R command that can be used to count how many different bands are represented in the data file?
  3. Write an R script that will sort the data back into chronological order and store it in a new dataframe.
  4. Recall that the table() function can be used to quickly summarize data. As an example, assuming I have attached the dataframe with the song data, I can type


And get the following output

(Song For My) Sugar Spun Sister 1901 45

                          2                        1               2

     50 Ways to Say Goodbye     6th Avenue Heartache      8:02:00 PM 
                          1                        2               1 

Each song title appears as a column heading and the number underneath it represents the number of time the song appears in the Song column of the dataframe.
Using this, what is the R command to determine the name of the song that has been played the most times? What is the R command to determine how many times that song has been played?

  1. Using R, determine the average number of songs I listened to per day over the time period in the dataset.