Data Science

Data Science.

SESSION 2 FORMAL EXAMINATIONS { NOVEMBER 2020
EXAMINATION DETAILS:

Unit Code: COMP2200/COMP6200
Unit Name: Data Science
Duration of exam: 3 hours in a 6 hour window
Total number of questions: 8
Total number of pages: 5 (incl. this cover sheet)
Total number of marks: 100

INSTRUCTIONS:
Answer ALL questions in a single word processor file and upload your answers to the provided Turnitin
submission page by the due time. You can upload a Word or PDF file.
Collaboration with others in completing this exam is not allowed. The work you submit should be your
own. Any evidence of copying or collusion will be referred to the Faculty Discipline Committee. Note
that your submissions will be passed through Turnitin to identify copying from the Internet or from other
students.
1. (10 marks) You are working as a Data Scientist in a big retail store, say Woolworths, and your task is
to optimise various retail processes such as inventory management, product placement, and customised
offers. Using the CRISP-DM model, can you explain what you will do in each stage of the data science
project life cycle, what your input will be, and what you will deliver at each stage? (Write no more
than 500 words in total)
2. The following graph 1 shows the relationship between the US spending on science and the number
of suicides (by hanging, strangulation, and suffocation). Based on this graph, answer the following
questions.
(a) (5 marks) What does the correlation mean in this context? What does the R2 value mean?
(Write no more than 200 words in total)
(b) (5 marks) One of your friends Mr. Citizen thinks that this correlation is because of the increasing
pressure on researchers to continuously produce output. How would you evaluate this explanation? Looking at the numbers in the data displayed, can you determine whether this explanation
could account for the effect shown? (Write no more than 200 words in total)
3. (a) (5 marks) For the following data scenarios, which chart should you use to visualise? Justify your
answers. (Write no more than 200 words in total)
(1) Bureau of Meteorology data having average monthly rainfall in Sydney from 2016 to 2020.
(2) Hospital data having systolic pressure and weight of 2000 patients.
(3) Australian Bureau of Statistics data having yearly household expenses (grocery, transport,
education, rent/mortgage, and entertainment) for Australian population
(4) Australian Bureau of Statistics providing Census data showing population density for each
suburb across New South Wales.
(5) Bureau of Meteorology weather data having multiple weather conditions in Sydney with
features including date, precipitation, max temperature, min temperature, wind speed, and
weather (drizzle, rain, sunny, snow, and fog).
(b) (5 marks) You are working on a project that analyses the census data provided by Australian
Bureau of Statistics. Table 1 shows a sample dataset. What data cleaning and normalisation
1Data sources: U.S. Office of Management and Budget and Centers for Disease Control & Prevention
Page 2
techniques should you apply on this data so that you can apply unsupervised learning methods?
(Write no more than 200 words in total)
Table 1: Sample Census dataset from Australian Bureau of Statistics
Census Code Suburb State Area sqkm
CED101 Berowra NSW 78644.32
CED101 wentworthville New South Wales 89232.53645
CED101 north sydney nsw 10324.45
CED101 mt. druitt 10583.12
CED105 st. Kilda Vic. 8524.96762
CED105 South melb. vic 45321.87
CED105 gelong Victoria 24534.2534
4. (a) (5 marks) I have data on different laptops from different brands with features for weight (grams),
size (cm), RAM (GB), Hard Drive (GB), Processor (Intel core i5, Intel core i7, Intel core i3, AMD
Ryzen, AMD Athlon, etc), and price (Australian Dollars). I want to cluster similar laptops based
on their specifications. Discuss your approach to applying a clustering algorithm on this data.
What transformations would be needed before you could work with this data and why? (Write
no more than 200 words in total)
(b) (5 marks) You built a regression model to predict baby length based on mother’s height and
mother’s age. Based on the training regression model using training data, the model coefficient’s
for mother’s height and mother’s age are [0:2539; -0:0075] and intercept is 4:7623. What is your
interpretation from these coefficients and intercept values? Can you figure out how change in
variables effect the baby’s length? (Write no more than 200 words in total)
5. You plan to build a machine learning model to predict whether a patient in a hospital is healthy” or
not healthy” based on the patient’s medical measurements. The dataset is highly imbalanced where
not healthy” outnumbered healthy” individuals.
(a) (5 marks) To evaluate the performance of a trained model, you can create a confusion matrix
for the comparison between the predicted results and the testing data class labels. From the
confusion matrix, you calculated accuracy score. Explain why reporting accuracy score on such
dataset is not indicative of model’s true performance. What measures you should take to mitigate
any inflated results. What other metrics can you formulate from confusion matrix which are true
indicative of model’s robust performance. (Write no more than 200 words in total)
(b) (5 marks) If the training data size is very big (e.g., 1 billion data instances) and the testing dataset
has 1000 instances, which model do you prefer to use, KNN (k-Nearest Neighbors) classifier or
Na¨ıve Bayes classifier? Justify your answer. (Write no more than 200 words in total)
6. There is a robot in an animal shelter which needs to learn to discriminate Dogs and Cats based on
the fur and colour features. You are required to train the robot with classification models on the
following dataset (Table 2) and make a prediction on a testing data instance. The feature Fur takes
one of the two possible values (Coarse and Fine), and Colour also takes one of the two possible values
(Brown and Black). For denotation convenience, you can use X1 and X2 to represent the two features
respectively, and Y to represent the prediction target during the inference.
Page 3
Table 2: Animal Data

Index Fur Colour class
#1
#2
#3
#4
#5
Coarse Brown
Fine Black
Coarse Black
Coarse Black
Fine Brown
Dog
Cat
Cat
Dog
Cat

(a) (5 marks) You are required to build a KNN (k-Nearest Neighbors) classification model and predict
the class label for the following data instance (#6 in Table 3). You can randomly choose k from its
possible value range to consider the k-nearest neighbors. The distance between two data instances
is calculated as the number of features having different values. For example, the distance between
the 1st and the 2nd data instances is 2 because they differ from each other on both features ‘Fur’
and ‘Colour’. Specify the value of k you will use, and show the details of learning and prediction.
Table 3: Testing Dataset

Index Fur Colour class
#6 Fine Brown

(b) (10 marks) You are required to build a Na¨ıve Bayes classifier from the dataset and predict
the class label for the data instance #6, using the Laplacian correction technique if the zeroprobability issue occurs. Show the details of learning and prediction.
7. (a) (5 marks) The linear regression model can be regarded as a simple type of artificial neural network. From the perspective of artificial neural networks, what activation function corresponds
to the linear regression model? Specify the mathematical form of the activation function. Is it a
good idea to build multi-layer neural network models with this activation function? Justify your
answer. (Write no more than 200 words in total)
(b) (10 marks) As the gradient descent method can be used to learn model parameters in neural
network models, you can use it to estimate the parameters in a linear regression model. You
are required to perform the initial steps of gradient descent on the following dataset (Table 4) to
estimate the parameters w0 and w1 for the linear regression model y = w0 + w1x. The sum of
squared errors is used for the loss function. Concretely, you need to formulate the loss function
L(w0; w1) and derive its gradient rL(w0; w1) = (@L(@w w00;w1); @L(@w w01;w1)). Then, pick a pair of values
randomly to initialize w0 and w1, and evaluate the gradient with the w0 and w1 values. Show the
key steps of inference and calculation.
Table 4: 2-Dimensional Data

Index X Y
#1
#2
1
2
1
3

Page 4
(c) (5 marks) Based on the gradient obtained in the above step, update the estimate for w0 and w1.
Assume that the learning rate η is 0.5. Show the key steps of inference and calculation.
8. The following dataset (Table 5) describes COVID-19 testing records for 5 people. You want to build
a decision tree classification model from the dataset to predict if a person suffers from COVID-19 or
not according to the two symptoms Cough and Fever. Both the feature Cough and Fever take one of
the two possible values yes (having a symptom) and no (not having a symptom). The target attribute
COVID-19 also takes one of the two possible values yes (infected) and no (normal). For denotation
convenience, you can use X1 and X2 to represent the two features respectively, and Y to represent the
prediction target.
Table 5: COVID-19 Data

Index Cough Fever COVID-19
#1 no no no
#2 yes yes yes
#3 no yes yes
#4 no yes no
#5 yes no no

(a) (10 marks) You are required to build a decision tree with the Gini impurity heuristic. Show the
key steps of inference and calculation.
(b) (5 marks) Which issue might the decision tree model built above suffer from, overfitting or underfitting? Propose two different strategies to mitigate the possible issue with justification. (Write
no more than 200 words in total)
Page 5

Data Science

Posted in Uncategorized