Posts under category Elementary Statistics

MyMathLab Homework 5.5 Answers
A drug tester claims that a drug cures a rare skin disease 78% of the time. The claim is checked by testing the drug on 100 patients. If at least 73 patients are cured, the claim will be accepted.

Find the probability that the claim will be rejected assuming that the manufacturer's claim is true. Use the normal distribution to approximate the binomial distribution if possible.

Solving the question

The first step for solving this question is to check that the conditions for normal to binomial approximation are met.,
np >= 5, and nq >= 5

In this case the conditions are met because np; 1000.78 = 78 , & nq = 0.22100 = 22
q = 1-p; = 1-0.78 = 0.22
To use the normal approximation , we need to calculate the mean and standard deviation based on the data given.
np = 1000.78 = 78, and standard deviation sqrt(npq) = sqrt(0.780.22*100) = 4.142

We are interested in the probability that 1-p(X < 73)
We'll use continuity correction to improve the accuracy of our calculations.
Therefore, we'll find p(X < 72.5)

The probability will be rejected if the value is less than 73.
P(X < 73) = (72.5 - 78)/4.142 = -1.33
P(X < -1.33) = 0.0921.,

Note that you can also use Excel to answer this question,
the function; =NORM.DIST(72.5,78,4.142,TRUE) is very helpful when dealing with such kinds of questions.

You are free to contact MyMathLab Homework Help in case you need help in answering similar questions.

Doing Tests of Assumptions

Qn1. Download the file designtime.csv from the course materials. This file describes a study in which designers used Adobe Illustrator or Adobe InDesign to create a benchmark set of classic children’s illustrations. The amount of time they took was recorded, in minutes. How many subjects took part in this study?

Qn2. Create a boxplot of the task time data for each tool. At a glance, which of the following conclusions seems to be most likely?

-Illustrator and InDesign have similar median task times, with similar variances.
- Illustrator has a higher median task time than InDesign, with similar variances.
- Illustrator has a higher median task time than InDesign, with dissimilar variances.
- InDesign has a higher median task time than Illustrator, with similar variances.
- InDesign has a higher median task time than illustrator, with dissimilar variances.

Qn3. Conduct a Shapiro-Wilk test on the time response for each of the tools. To the nearest ten-thousandth (four digits), what is the p-value of this test for illustrator?

Qn4. Conduct a Shapiro-Wilk normality test on the residuals of Time by Tool. To the nearest ten-thousandth (four digits), What is the W value displayed? Hint: use aov to fit a model and then run Shapiro.test on the model residuals.
Qn5. In light of your normality tests, would you conclude the data does or does not violate normality?

-The data does violate normality
-The data does not violate normality

Qn6. Conduct a Brown-Forsythe test of homoscedasticity. To the nearest hundredth (two digits), what is the F statistic for the test? Hint: use the car library and its leveneTest function with center=median.
Qn7. Fit a lognormal distribution to the Time response of each of the design tools. Conduct a Kolmogorov-Smirnov goodness-of-fit test. To the nearest ten-thousandth (four digits), What is the exact p-value of the test for the Illustrator data? Hint: use the MASS library and its fitdistr function with “lognormal” to acquire a fit estimate. The use ks.test with “plnorm” passing the acquired fit values as meanlog and sdlog. Request and exact fit.
Qn8. Create a new column that is the log-transformed Time response. Compute the mean of this log-transformed response for each drawing tool. To the nearest hundredth (to digits), what is the mean of the log-transformed response for InDesign?
Qn9. Conduct an independent-samples t-test on the log-transformed Time response. Use the Welch version for unequal variances. To the nearest hundredth (two digits), what is the t statistic for the test?
Qn10. As an alternative to log-transforming the Time response, leave Time as it is and conduct an exact nonparametric Mann-Whitney U test on it. To the nearest ten-thousandth (four digits), what is the z statistic that results from this test? Hint: use the coin library and its wilcox test function with distribution=”exact”

Assessment Item 2 Research Report
BSB123 Data Analysis

Assessment Item 2 Research Report (2017 S1)

The file: Birthweights.xlsx contains data on the following variables for a sample of 1000 births recorded in a large local hospital in 2015:

Variable Description
Birthweight Birthweight in grams
Gestation Length of pregnancy in days
Smoke Whether the mother is a smoker or not
Pre-pregnancy weight Mother’s pre-pregnancy weight in kilograms
Height Mothers height in centimetres
Status Mother’s indigenous status
Age Mother’s age in years

Management at the hospital is interested in being able to better manage room allocations and bookings in their maternity ward. They are keen to identify mothers at risk of having low birth weight babies who may require additional hospital resources during their stay in the hospital.

The hospital has collected data for a number of previous births at the hospital. The data contains information on the variables outlined in the table above. As a consultant, they have approached you and asked if you could analyse this dataset.

Part 1 - Analysis (80%)

  1. Past records (2004) show that the average birthweight was 3500 grams. Test at 5% if the average birthweight in 2015 has increased with the improvement in general nutrition.
    (Include all six steps for hypothesis testing.)

                               (2 marks)
  2. Perform a two-sample t-test for each of the following tasks. (Include all six steps for hypothesis testing in each.)
    (a) Determine if there is evidence that on average the weight of a baby of a mother who smokes is less than that of a mother who does not. ( = 5%)

                              (2 marks)

    (b) Determine if being indigenous is a disadvantage in terms of birthweight. ( = 5%)
    (2 marks)
    The hospital management is particularly interested in whether you can develop a regression model to help them to predict the birthweight of a baby based on the variables in the data supplied. The model could then be used to predict birthweight to identify babies at risk in future.

  3. By using the forward stepwise method, develop a multiple regression model to predict the birthweight.
    Step 1: Gestation only
    Step 2: Gestation and Smoke
    Step 3: Gestation, Smoke and Pre-pregnancy Weight
    Step 4: Gestation, Smoke, Pre-pregnancy Weight and Height
    Step 5: Gestation, Smoke, Pre-pregnancy Weight, Height and Status
    Step 6: Gestation, Smoke, Pre-pregnancy Weight, Height, Status and Age
    (a) Interpret the regression coefficients of all six (6) independent variables in the model obtained in Step 6, and comment on the statistical significance of each.
    (3 marks)
    (b) Use Excel to obtain the correlation matrix for the following variables: Gestation, Pre-pregnancy Weight, Height, Age and Birthweight. Do you think multi-collinearity is a problem in the regression model? Are the correlation coefficients consistent with the regression coefficients obtained in the model in Step 6? Discuss briefly.
    (3 marks)
    (c) Focusing on Steps 3 and 4, discuss fully how the introduction of Height in Step 4 affects the regression coefficient of Pre-pregnancy Weight.
    (3 marks)
    (d) Based on the results in (a) to (c), explain which independent variables should be included or excluded to formulate the final model. State the final model.
    (2 marks)
    (e) Comment on the overall adequacy of the final model.
    (2 marks)
    (f) Consider an indigenous mother who is a smoker, 20 years of age, and 160cm tall with a pre-pregnancy weight of 58kg and gestational age of 267 days. What is the expected weight of the child, using the final model you have developed in (d)?
    (2 marks)
  4. Compute the difference in the average birthweight of babies of indigenous and non-indigenous mothers (called the birthweight difference, for simplicity). Discuss fully if there is any discrepancy between the regression coefficient of Status obtained in the regression model and the birthweight difference.
    (3 marks)

    Part 2 – Report (20%)
    You are required to submit a concise report (word limit: 400) presenting any important features or relationships in the data. The content of your report should be based on, but not restricted to, insights gleaned from your analyses conducted in Part 1.
    (6 marks)

Part 1 - Analysis
• For presentation and ease of marking, it is advisable to include relevant Excel output in your answer to each question in this part instead of placing them in appendices.
• There is no word limit in Part 1.
Part 2 - Report
• The report is primarily based on the data provided. If, however, you wish to include, and refer to, additional information, you can use any referencing system as long as it is used consistently.
• You can include relevant charts and Excel objects in your report.
• Use 1 & ½ spacing and font size of 11.
• The word limit of 400 (with a tolerance of 10%) is exclusive of words in tables, appendices and reference list (if any).

• You should submit your response to both parts as a single pdf document saved in the format:
BSB123 Report_StudentName.pdf
• After uploading your research report, it is your responsibility to go back to the Assignment Upload page to check that your report was properly uploaded.
• Due: 11:59 pm 28 May 2017 (Sunday) via Blackboard

ECON1280 HW1, Semester 2 2022-23

The University of Hong Kong, School of Business

• Homework caught copying other’s work or past solutions will receive a zero.
• No proof is needed if using formulas taught in class, otherwise provide your justification.
• Penalty may apply if students fail to type their assignment. Questions about the penalty and assignment
in general can be directed toward TAs.

Question 1(10 points) Tourists visiting Croatia are asked to participate in a survey, consisting of
various questions regarding their experience during their trip, which have been provided below. For each
question, describe the type of data obtained based on (1) the type and amount of information contained
in the data, and (2) levels of measurement.

a. Which of the following areas did you visit?
• (i) Coast; (ii) Islands; (iii) Mountains; (iv) Zagreb (Croatia’s capital).
b. What are the last four digits of your phone number?
c. Was the average amount you spent on food per day greater than USD100?
• (i) Yes; (ii) No.
d. What is the optimal number of days you would recommend a tourist spends in Croatia?
e. How often would you recommend visiting Croatia?

• Every year; once in a five years; Once in a lifeltime; Never
Question 2 (10 points) For predicting credit default, a sample of financial data analysts was asked to
provide forecasts of credit scores for next year. The results are summarized in the following table:
Forecast (£ 000) Number of Clients
11.95 < 12.45 5
12.45 < 13.95 18
13.95 < 14.45 16
14.45 < 15.95 23
15.95 < 16.45 11

Page 1 of 3a. Estimate the sample mean forecast. Answers without proper calculation and/or justification will receive 0.
b. Estimate the sample standard deviation. Answers without proper calculation and/or justification will receive 0.

Question 3 (10 points) A random sample of 70 business majors was asked a series of demographic
questions including major, gender, age, year in school, and current grade point average (GPA). Other
questions were also asked for their levels of satisfaction with campus parking, campus housing, and
campus dining. Responses to these satisfaction questions were measured on a scale from 1 to 5, with
5 being the highest level of satisfaction. Finally, these students were asked if they planned to attend
graduate school within 5 years of their college graduation (0: no; 1: yes). These data are contained in
the data file Finstad and Lie Study n70 23sp.csv.

a. Construct a cluster bar chart of the respondents’ major and gender.
b. Construct a pie chart of their majors.

Question 4 (20 points) The demand for bottled water increases during the hurricane season in Florida.
The manager at a plant that bottles drinking water wants to be sure that the process to fill 1-gallon
bottles (approximately 3.785 liters) is operating properly. Currently, the company is testing the volumes
of 1-gallon bottles. A random sample of 50 bottles is tested. Study the filling process for this product
and submit a report of your findings to the operations manager. The data are stored in the data file
Water n50 23sp.cvs.

a. Construct a frequency distribution and cumulative frequency distribution.
b. Construct a histogram.
c. Incorporate these graphs into a well-written summary. How could we apply statistical thinking in
this situation?

d. Find the range, variance, and standard deviation of the volumes. Answers without screenshots of
your process will receive 0.
e. Find and interpret the interquartile range for the data. Answers without screenshots of your
process will receive 0.
f. Find the value of the coefficient of variation. Answers without screenshots of your process will
receive 0.
Question 5(10 points) Bishop’s supermarket records the actual price for consumer food products and
the weekly quantities sold. The data are stored in the data file Bishop n30 23sp.cvs.
a. Obtain the scatter plot for the actual price of a gallon of orange juice and the weekly quantities
sold at that price.
b. Does the scatter plot follow the pattern from economic theory? Explain briefly.
Page 2 of 3Question 6 (20 points) Search the FRED database for the "Real Personal Income" (Series ID: RPI)
and download the data series from 2011-12-1 to 2021-12-1.
a. Compute the population mean. Answers without screenshots of your process will receive 0.
b. Compute the population median. Answers without screenshots of your process will receive 0.
c. Find the five-number summary for this data. Answers without screenshots of your process will receive 0.
d. Compute and comment on symmetry or skewness. Answers without proper justification will receive 0.
Question 7 (10 points) In April 2018, a study was conducted to understand the role of Twitter. The
study showed that Twitter helps business by illustrating the practical implications of a decision, connecting businesses to prominent technologies and trends - eWom, big data, and smart cities - and identifying
possible future directions. The practical implications of the study have a mean of 686 results and a
standard deviation 66.
a. It can be guaranteed that 75% of businesses being connected to prominent technologies and trends
will be in what interval? Answers without proper calculation and/or justification will receive 0.
b. Using the empirical rule, it can be estimated that approximately 95% of businesses being connected
to prominent technologies and trends will be in what interval? Answers without proper calculation
and/or justification will receive 0.
Question 8 (10 points) Management of a local retail outlet in the U.S. wants to predict weekly online
shopping sales. It administers an aptitude test to all sales people, where test scores range from 0 to 10
with greater scores indicating a higher aptitude. Test scores and weekly sales (in Thousands of USD)
are as follows:

Test Score (x) 10 9 8 8 9
Weekly Sales (y) 23 61 17 29 20

a. Calculate the sample covariance without the aid of programming languages. Answers without proper calculation and/or justification will receive 0.
b. Compute the sample correlation between test scores and weekly sales without the aid of programming languages. Answers without proper calculation and/or justification will receive 0.
c. Based on your answers to parts (a) and (b), state your understanding of the relation between
aptitude test scores and weekly sales. Explain briefly.
Page 3 of 3

Assignment file attached Attached