## Data Analysis

Assessment Item 2 Research Report
BSB123 Data Analysis

Assessment Item 2 Research Report (2017 S1)

Data
The file: Birthweights.xlsx contains data on the following variables for a sample of 1000 births recorded in a large local hospital in 2015:

Variable Description
Birthweight Birthweight in grams
Gestation Length of pregnancy in days
Smoke Whether the mother is a smoker or not
Pre-pregnancy weight Mother’s pre-pregnancy weight in kilograms
Height Mothers height in centimetres
Status Mother’s indigenous status
Age Mother’s age in years

Background
Management at the hospital is interested in being able to better manage room allocations and bookings in their maternity ward. They are keen to identify mothers at risk of having low birth weight babies who may require additional hospital resources during their stay in the hospital.

The hospital has collected data for a number of previous births at the hospital. The data contains information on the variables outlined in the table above. As a consultant, they have approached you and asked if you could analyse this dataset.

Part 1 - Analysis (80%)

1. Past records (2004) show that the average birthweight was 3500 grams. Test at 5% if the average birthweight in 2015 has increased with the improvement in general nutrition.
(Include all six steps for hypothesis testing.)

``                           (2 marks)``
2. Perform a two-sample t-test for each of the following tasks. (Include all six steps for hypothesis testing in each.)
(a) Determine if there is evidence that on average the weight of a baby of a mother who smokes is less than that of a mother who does not. ( = 5%)

``                          (2 marks)``

(b) Determine if being indigenous is a disadvantage in terms of birthweight. ( = 5%)
(2 marks)
The hospital management is particularly interested in whether you can develop a regression model to help them to predict the birthweight of a baby based on the variables in the data supplied. The model could then be used to predict birthweight to identify babies at risk in future.

3. By using the forward stepwise method, develop a multiple regression model to predict the birthweight.
Step 1: Gestation only
Step 2: Gestation and Smoke
Step 3: Gestation, Smoke and Pre-pregnancy Weight
Step 4: Gestation, Smoke, Pre-pregnancy Weight and Height
Step 5: Gestation, Smoke, Pre-pregnancy Weight, Height and Status
Step 6: Gestation, Smoke, Pre-pregnancy Weight, Height, Status and Age
(a) Interpret the regression coefficients of all six (6) independent variables in the model obtained in Step 6, and comment on the statistical significance of each.
(3 marks)
(b) Use Excel to obtain the correlation matrix for the following variables: Gestation, Pre-pregnancy Weight, Height, Age and Birthweight. Do you think multi-collinearity is a problem in the regression model? Are the correlation coefficients consistent with the regression coefficients obtained in the model in Step 6? Discuss briefly.
(3 marks)
(c) Focusing on Steps 3 and 4, discuss fully how the introduction of Height in Step 4 affects the regression coefficient of Pre-pregnancy Weight.
(3 marks)
(d) Based on the results in (a) to (c), explain which independent variables should be included or excluded to formulate the final model. State the final model.
(2 marks)
(e) Comment on the overall adequacy of the final model.
(2 marks)
(f) Consider an indigenous mother who is a smoker, 20 years of age, and 160cm tall with a pre-pregnancy weight of 58kg and gestational age of 267 days. What is the expected weight of the child, using the final model you have developed in (d)?
(2 marks)
4. Compute the difference in the average birthweight of babies of indigenous and non-indigenous mothers (called the birthweight difference, for simplicity). Discuss fully if there is any discrepancy between the regression coefficient of Status obtained in the regression model and the birthweight difference.
(3 marks)

Part 2 – Report (20%)
You are required to submit a concise report (word limit: 400) presenting any important features or relationships in the data. The content of your report should be based on, but not restricted to, insights gleaned from your analyses conducted in Part 1.
(6 marks)

Notes:
Part 1 - Analysis
• For presentation and ease of marking, it is advisable to include relevant Excel output in your answer to each question in this part instead of placing them in appendices.
• There is no word limit in Part 1.
Part 2 - Report
• The report is primarily based on the data provided. If, however, you wish to include, and refer to, additional information, you can use any referencing system as long as it is used consistently.
• You can include relevant charts and Excel objects in your report.
• Use 1 & ½ spacing and font size of 11.
• The word limit of 400 (with a tolerance of 10%) is exclusive of words in tables, appendices and reference list (if any).

Submission
• You should submit your response to both parts as a single pdf document saved in the format:
BSB123 Report_StudentName.pdf
• Due: 11:59 pm 28 May 2017 (Sunday) via Blackboard

## The University of Hong Kong, School of Business ECON 1280 HW

ECON1280 HW1, Semester 2 2022-23

## The University of Hong Kong, School of Business

Instructions:
• Homework caught copying other’s work or past solutions will receive a zero.
• No proof is needed if using formulas taught in class, otherwise provide your justification.
• Penalty may apply if students fail to type their assignment. Questions about the penalty and assignment
in general can be directed toward TAs.

Question 1(10 points) Tourists visiting Croatia are asked to participate in a survey, consisting of
various questions regarding their experience during their trip, which have been provided below. For each
question, describe the type of data obtained based on (1) the type and amount of information contained
in the data, and (2) levels of measurement.

a. Which of the following areas did you visit?
• (i) Coast; (ii) Islands; (iii) Mountains; (iv) Zagreb (Croatia’s capital).
b. What are the last four digits of your phone number?
c. Was the average amount you spent on food per day greater than USD100?
• (i) Yes; (ii) No.
d. What is the optimal number of days you would recommend a tourist spends in Croatia?
e. How often would you recommend visiting Croatia?

• Every year; once in a five years; Once in a lifeltime; Never
Question 2 (10 points) For predicting credit default, a sample of financial data analysts was asked to
provide forecasts of credit scores for next year. The results are summarized in the following table:
Forecast (£ 000) Number of Clients
11.95 < 12.45 5
12.45 < 13.95 18
13.95 < 14.45 16
14.45 < 15.95 23
15.95 < 16.45 11

Page 1 of 3a. Estimate the sample mean forecast. Answers without proper calculation and/or justification will receive 0.
b. Estimate the sample standard deviation. Answers without proper calculation and/or justification will receive 0.

Question 3 (10 points) A random sample of 70 business majors was asked a series of demographic
questions including major, gender, age, year in school, and current grade point average (GPA). Other
questions were also asked for their levels of satisfaction with campus parking, campus housing, and
campus dining. Responses to these satisfaction questions were measured on a scale from 1 to 5, with
5 being the highest level of satisfaction. Finally, these students were asked if they planned to attend
graduate school within 5 years of their college graduation (0: no; 1: yes). These data are contained in
the data file Finstad and Lie Study n70 23sp.csv.

a. Construct a cluster bar chart of the respondents’ major and gender.
b. Construct a pie chart of their majors.

Question 4 (20 points) The demand for bottled water increases during the hurricane season in Florida.
The manager at a plant that bottles drinking water wants to be sure that the process to fill 1-gallon
bottles (approximately 3.785 liters) is operating properly. Currently, the company is testing the volumes
of 1-gallon bottles. A random sample of 50 bottles is tested. Study the filling process for this product
and submit a report of your findings to the operations manager. The data are stored in the data file
Water n50 23sp.cvs.

a. Construct a frequency distribution and cumulative frequency distribution.
b. Construct a histogram.
c. Incorporate these graphs into a well-written summary. How could we apply statistical thinking in
this situation?

d. Find the range, variance, and standard deviation of the volumes. Answers without screenshots of
e. Find and interpret the interquartile range for the data. Answers without screenshots of your
f. Find the value of the coefficient of variation. Answers without screenshots of your process will
Question 5(10 points) Bishop’s supermarket records the actual price for consumer food products and
the weekly quantities sold. The data are stored in the data file Bishop n30 23sp.cvs.
a. Obtain the scatter plot for the actual price of a gallon of orange juice and the weekly quantities
sold at that price.
b. Does the scatter plot follow the pattern from economic theory? Explain briefly.
Page 2 of 3Question 6 (20 points) Search the FRED database for the "Real Personal Income" (Series ID: RPI)
c. Find the five-number summary for this data. Answers without screenshots of your process will receive 0.
d. Compute and comment on symmetry or skewness. Answers without proper justification will receive 0.
Question 7 (10 points) In April 2018, a study was conducted to understand the role of Twitter. The
study showed that Twitter helps business by illustrating the practical implications of a decision, connecting businesses to prominent technologies and trends - eWom, big data, and smart cities - and identifying
possible future directions. The practical implications of the study have a mean of 686 results and a
standard deviation 66.
a. It can be guaranteed that 75% of businesses being connected to prominent technologies and trends
will be in what interval? Answers without proper calculation and/or justification will receive 0.
b. Using the empirical rule, it can be estimated that approximately 95% of businesses being connected
to prominent technologies and trends will be in what interval? Answers without proper calculation
Question 8 (10 points) Management of a local retail outlet in the U.S. wants to predict weekly online
shopping sales. It administers an aptitude test to all sales people, where test scores range from 0 to 10
with greater scores indicating a higher aptitude. Test scores and weekly sales (in Thousands of USD)
are as follows:

Test Score (x) 10 9 8 8 9
Weekly Sales (y) 23 61 17 29 20

a. Calculate the sample covariance without the aid of programming languages. Answers without proper calculation and/or justification will receive 0.
b. Compute the sample correlation between test scores and weekly sales without the aid of programming languages. Answers without proper calculation and/or justification will receive 0.
c. Based on your answers to parts (a) and (b), state your understanding of the relation between
aptitude test scores and weekly sales. Explain briefly.
Page 3 of 3

Assignment file attached Attached

## example Question for estimating sample size

The state education commission wants to estimate the fraction of tenth-grade students that have reading skills at or below the eight grade level. In an earlier study, the population proportion was estimated to be 0.16.
How large a sample would be required in order to estimate the fraction of tenth graders reading at or below the eighth grade level at the 99% confidence level with an error of at most 0.03? Round your answer up to the next integer.

## MEDICARE OVERBILLING ANALYSIS

Your company is running a Medicare audit on Sleaze Hospital. Because Sleaze has a history of overbilling, the focus of your audit is on checking whether the billing amounts are correct. Assume that each invoice is for too high an amount with probability 0.06 and for too low an amount with probability 0.01 (so that the probability of a correct billing is 0.93). Also, assume that the outcome for any invoice is probabilistically independent of the outcomes for other invoices.

## probability Question using Excel

1. For this Assignment, reflect on the case presented. Think about what
strategies you might use to calculate associated probabilities for
Sleaze Hospital, and then address the series of questions for the
completion of the Assignment.
2. If you randomly sample 200 of Sleaze's invoices, what is the
probability that you will find at least 15 invoices that overcharge
the customer? What is the probability you won't find any that
undercharge the customer?
3. Find an integer, k, such that the probability is at least 0.99 that
you will find at least k invoices that overcharge the customer.
(Hint: Use trial and error with the BINOMDIST function to find k.)
Suppose that when Sleaze overcharges Medicare, the distribution of
the amount overcharged (expressed as a percentage of the correct
billing amount) is normally distributed with mean 15% and standard
deviation 4%.
4. What percentage of overbilled invoices are at least 10% more than
the legal billing amount? What percentage of all invoices are at
least 10% more than the legal billing amount?
5. If your auditing company samples 200 randomly chosen invoices, what
is the probability that it will find at least five where Medicare
was overcharged by at least 10%?
6. Submit your answers and embedded Excel analysis as a Microsoft Word
management report.