week 1. introduction to data miningtest 11、which one is not the description of data mining?
a、extraction of interesting patterns or knowledge
b、explorations and analysis by automatic or semi-automatic means
c、discover meaningful patterns from large quantities of data
d、appropriate statistical analysis methods to analyze the data collected
2、which one describes the right process of knowledge discovery?
a、selection-preprocessing-transformation-data mining-interpretation/evaluation
b、preprocessing-transformation-data mining- selection- interpretation/evaluation
c、data mining- selection- interpretation/evaluation- preprocessing-transformation
d、transformation-data mining- election-preprocessing- interpretation/evaluation
3、which one is not belong to the process of kdd?
a、data mining
b、data description
c、data cleaning
d、data selection
4、which one is not the right alternative name of data mining?
a、knowledge extraction
b、data archeology
c、data dredging
d、data harvesting
5、which one is the nominal variables?
a、occupation
b、education
c、age
d、color
6、which one is wrong about classification and regression?
a、regression analysis is a statistical methodology that is most often used for numeric prediction.
b、we can construct classification models (functions) without some training examples.
c、classification predicts categorical (discrete, unordered) labels.
d、regression models predict continuous-valued functions.
7、which one is wrong about clustering and outliers?
a、clustering belongs to supervised learning.
b、principles of clustering include maximizing intra-class similarity and minimizing interclass similarity.
c、outlier analysis can be useful in fraud detection and rare events analysis.
d、outlier means a data object that does not comply with the general behavior of the data.
8、about data process, which one is wrong?
a、when making data discrimination, we compare the target class with one or a set of comparative classes (the contrasting classes).
b、when making data classification, we predict categorical labels excluding unordered one.
c、when making data characterization, we summarize the data of the class under study (the target class) in general terms.
d、when making data clustering, we would group data to form new categories.
9、outlier mining such as density based method belongs to supervised learning.
10、support vector machines can be used for classification and regression.
week 2. data pre-processingtest 21、which is not the reason we need to preprocess the data?
a、to save time
b、to avoid unreliable output
c、to eliminate noise
d、to make result meet our hypothesis
2、how to construct new feature space by pca?
a、new feature space by pca is constructed by choosing the most important features you think.
b、new feature space by pca is constructed by normalizing input data.
c、new feature space by pca is constructed by selecting features randomly.
d、new feature space by pca is constructed by eliminating the weak components to reduce the size of the data.
3、which one is right about wavelet transforms?
a、wavelet transforms store large fractions of the strongest of the wavelet coefficients.
b、wavelet transforms are completely different from discrete fourier transform (dft).
c、it can be used for reducing data and smoothing data.
d、wavelet transforms means applying to pairs of data, resulting in two set of data of length l.
4、which one is wrong about methods for discretization?
a、histogram analysis and binging are both unsupervised methods.
b、clustering analysis only belongs to top-down split.
c、interval merging by 2 analysis can be applied recursively.
d、decision-tree analysis is entropy-based discretization.
5、which one is wrong about equal-width (distance) partitioning and equal-depth (frequency) partitioning?
a、equal-width partitioning is the most straightforward, but outliers may dominate presentation.
b、equal-depth partitioning divides the range into n intervals, each containing approximately same number of samples.
c、the interval of the former one is not equal.
d、the number of tuples is the same when using the latter one.
6、which one is wrong way to normalize data?
a、min-max normalization
b、simple scaling
c、z-score normalization
d、normalization by decimal scaling
7、which are the major tasks in data preprocessing?
a、cleaning
b、integration
c、transition
d、reduction
8、which are the right way to fill in missing values?
a、smart mean
b、probable value
c、ignore
d、falsify
9、which are the right way to handle noise data?
a、regression
b、cluster
c、wt
d、manual
10、which are the common used ways to sampling?
a、simple random sample without replacement
b、simple random sample with replacement
c、stratified sample
d、cluster sample
11、discretization means dividing the range of a continuous attribute into intervals.
assignment 21、suppose you obtained a dataset which has some missing values, how will you deal with these missing values?
2、gave the following data (in increasing order) for the attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. (a) use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0]. (b) use z-score normalization to transform the value 35 for age, where the standard deviation of age is 12.94 years. (c) use normalization by decimal scaling to transform the value 35 for age. (d) comment on which method you would prefer to use for the given data, giving reasons as to why.
week 3. instance based learningtest 31、what's the difference between eager learner and lazy learner?
a、eager learners would generate a model for classification while lazy learner would not.
b、eager learners classify the tuple based on its similarity to the stored training tuple while lazy learner not.
c、eager learners simply store data (or does only a little minor processing) while lazy learner not.
d、lazy learner would generate a model for classification while eager learner would not.
2、how to choose the optimal value for k?
a、cross-validation can be used to determine a good value by using an independent dataset to validate the k values.
b、low values for k (like k=1 or k=2) can be noisy and subject to the effect of outliers.
c、a large k value can reduce the overall noise so the value for 'k' can be as big as possible.
d、historically, the optimal k for most datasets has been between 3-10.
3、what’s the major components in knn?
a、how to measure similarity?
b、how to choose "k"?
c、how are class labels assigned?
d、how to decide the distance?
4、which the following ways can be used to obtain attribute weight for attribute-weighted knn?
a、prior knowledge / experience.
b、pca, fa (factor analysis method)
c、information gain.
d、gradient descent, simplex methods and genetic algorithm.
5、at learning stage knn would find the k closest neighbors and then decide classify k identified nearest label.
6、at classification stage knn would store all instance or some typical of them.
7、normalizing the data can solve the problem of different attributes have different value ranges.
8、by euclidean distance or manhattan distance, we can calculate the distance between two instances.
9、data normalization before measure distance is generally to avoid errors caused by different dimensions, self-variations, or large numerical differences.
10、the way to obtain the regression for a new instance from the k nearest neighbors is to calculate the average value of k neighbors.
11、the way to obtain the classification for a new instance from the k nearest neighbors is to calculate the majority class of k neighbors.
12、the way to obtain instance weight for distance-weighted knn is to calculate the reciprocal of the distance squared between object and neighbors.
assignment 31、you are required to build a knn model with the given data sets. in a section of the highway, 19 sensors are set to collect speed and volume of vehicles in each point. travel time required to pass this section is also captured. so each instance contains 41 attributes, including serial number, time tag, speed and volume in 19 positions and travel-time. there are totally 1605 instances. we generated 5 files in xlsx format by different random sample proportion. four files of them consist of two sheets. the sheet named ‘train’ is the training set and ‘test’ is the testing set. one data set consists of one sheet named ‘cv-data’ generated by cross-validation. you need to finish the following tasks. input: speed1, volume1, speed2, volume2, speed3,volume3, …,speed19, volume19 output: travel time note: 1) different attributes have different value ranges, so normalization needs to be done firstly before distances are calculated. 2) in task 3 and task 4, you are required to apply dw-knna (distance-weighted k-nearest neighbor algorithm) to predict travel time. the definition of dw-knna can be referred in the following paper. [1] song j, zhao j, dong f, et al. a novel regression modeling method for pmslm structural design optimization using a distance-weighted knn algorithm[j]. ieee transactions on industry applications, 2018, 54(5): 4198-4206. task: (1)use different k (k=3,4,6), the number of neighbors, to prediction travel time, how about their prediction accuracy? (use data sets from file, named 60% for training and 40% for testing_knn.xlsx) (2)use hold-one-out cross-validation to select the best k value. what is it? plot the scatter diagram of predicted travel time and measured travel time. (use data sets from file, named hold-one-out_cv_knn.xlsx) (3)use different k (k=3,4,6), the number of neighbors, to prediction travel time, how about their prediction accuracy when using dw-knna? (use data sets from, named 60% for training and 40% for testing_dw-knna.xlsx) (4)use different proportions of training set (60%, 70%, 80%), how about their prediction accuracy when using dw-knna (k=10)? (use data sets from , named 60% for training and 40% for testing_dw-knna.xlsx, 70% for training and 30% for testing_dw-knna.xlsx, 80% for training and 20% for testing_dw-knna.xlsx)
week 4. decision treestest 41、which description is right about nodes in decision tree?
a、internal nodes test the value of particular features
b、leaf nodes specify the class
c、branch nodes decide the result
d、root nodes decide the start point
2、computing information gain for continuous value attribute when using id3 consists of the following procedure:
a、sort the value a in increasing order.
b、consider the midpoint between each pair of adjacent values as a possible split point.
c、select the minimum expected information requirement as the split-point.
d、split
3、which is the typical algorithm to generate trees?
a、id3
b、c4.5
c、cart
d、pca
4、which one is right about underfitting and overfitting?
a、underfitting means poor accuracy both for training data and unseen samples.
b、overfitting means high accuracy for training data but poor accuracy for unseen samples.
c、underfitting implies the model is too simple that we need to increase the model complexity.
d、overfitting occurs too many branches that we need to decrease the model complexity.
5、which one is right about pre-pruning and post-pruning?
a、both of them are methods to deal with overfitting problem.
b、pre-pruning does not split a node if this would result in the goodness measure falling below a threshold.
c、post-pruning removes branches from a “fully grown” tree.
d、there is no need to choose an appropriate threshold when making pre-pruning.
6、post-pruning in cart consists of the following procedure:
a、first, consider the cost complexity of a tree.
b、then, for each internal node, n, compute the cost complexity of the subtree at n.
c、and also compute the cost complexity of the subtree at n if it were to be pruned.
d、at last, compare the two values. if pruning the subtree at node n would result in a smaller cost complexity, the subtree is pruned. otherwise, the subtree is kept.
7、the cost complexity pruning algorithm used in cart evaluate cost complexity by the number of leaves in the tree, and the error rate.
8、gain ratio is used as attribute selection measure in c4.5 and the formula is gainratio(a) = gain(a)/ splitinfo(a).
9、rule is created for each part from its root to its leaf notes.
10、id3 use information gain as its attribute selection measure. and the attribute with the lowest information gain is chosen as the splitting attribute for note n.
assignment 41、calculation of information gain of a traffic conflict problem. the questions can be seen in file assignment 4.docx.
week 5. support vector machinetest 51、what is the feature of svm?
a、extremely slow, but are highly accurate.
b、much less prone to overfitting than other methods.
c、black box model.
d、provide a compact description of the learned model.
2、which is the typical common kernel?
a、linear
b、polynomial
c、radial basis function (gaussian kernel)
d、sigmoid kernel
3、what adaptations can be made to allow svm to deal with multiclass classification problem?
a、one versus rest (ovr)
b、one versus one (ovo)
c、error correcting input codes (ecic)
d、error correcting output codes (ecoc)
4、what is the problem of ovr?
a、sensitive to the accuracy of the confidence figures produced by the classifiers.
b、the scale of the confidence values may differ between the binary classifiers.
c、the binary classification learners see unbalanced distributions.
d、only when the class distribution is balanced can balanced distributions attain.
5、which one is right about the advantages of svm?
a、they are accurate in high-dimensional spaces.
b、they are memory efficient.
c、the algorithm is not prone for over-fitting compared to other classification method.
d、the support vectors are the essential or critical training tuples.
6、kernel trick was used to avoid costly computation and deal with mapping problems.
7、there is no structured way and no golden rules for setting the parameters in svm.
8、error correcting output codes (ecoc) is a kind of problem transformation techniques.
9、regression formulas including three types: linear, nonlinear and general form.
10、if you have a big dataset, svm is suitable for efficient computation.
assignment 51、svm for incident duration prediction in this exercise, you are required to use support vector machines (svms) to prediction incident duration. by this assignment on svms, you can get deep understanding of how to use svms. the data comes from the national incident management center for towing operations. these data were provided by towing officers, police and rijkswaterstaat road-inspectors who perform incident handling. the data was collected from 1st may to 13th september 2005 on the region of utrecht. you can find 1853 registrations of incident in total in the incidentduration.csv. test_set.csv extract 50% of data and is used to test svm, while train_set.csv contains the remaining and is used to train svm. attributes includes as follows: 1. incident type (stopped vehicle, lost load, accident); 2. kind of vehicles involved (passenger cars, trucks, n/a); 3. police required (yes, no) 4. track research (yes, no) 5. ambulance required (yes, no) 6. fire brigades required (yes, no) 7. repair service required (yes, no) 8. tow truck required (yes, no) 9. road inspector (yes, no) 10. lane closer (yes, no) 11. road repair required (yes, no) 12. fluid to be cleaned (yes, no) 13. damage of road equipment (yes, no) 14. number of vehicles (involved single, two, more) 15. type bergings task (onb, cmi, cmv) 16. by the week (workdays, weekend) 17. start and end time (during peek hour, off peek hour) 18. duration (short, long) build prediction models with svm to complete the following tasks. a. use train_set.csv as train data set to build a model, and test on test_set.csv. report the accuracy of train model and test model you get. (suggestions: in data preprocess, you can deal with independent variables which are nominal variables by using one-hot representation, so that the corresponding value of each feature is guaranteed to be 0 or 1. it is easy to implement with get_dummies(data) function in the pandas package.) b. use incidentduration.csv to build a model using 10-fold cross validation. report the accuracy of train model and test model you get. c. build a prediction model again after feature reduction (keep 80% variance). report the accuracy of train model and test model you get. d: which model gives the highest accuracy on the test set? why? give you explanation.
week 6. outlier miningtest 61、which description is right to describe outliers?
a、outliers caused by measurement error
b、outliers reflecting ground truth
c、outliers caused by equipment failure
d、outliers needed to be dropped out always
2、what is application case of outlier mining?
a、traffic incident detection
b、credit card fraud detection
c、network intrusion detection
d、medical analysis
3、which one is the method to detect outliers?
a、statistics-based approach
b、distance-based approach
c、bulk-based approach
d、density-based approach
4、how to pick the right k by a heuristic method for density based outlier mining method?
a、k should be at least 10 to remove unwanted statistical fluctuations.
b、pick 10 to 20 appears to work well in general.
c、pick the upper bound value for k as the maximum of “close by” objects that can potentially be global outliers.
d、pick the upper bound value for k as the maximum of “close by” objects that can potentially be local outliers.
5、which one is right about three methods of outlier mining?
a、statistics-based approach is simple and fast but difficult to deal with periodicity data and categorical data.
b、the efficiency of distance-based approach is low for the great data set in high dimensional space.
c、distance-based approach cannot be used in multidimensional data set.
d、density-based approach spends low cost on searching neighborhood.
6、distance-based outlier mining is not suitable to data set that does not fit any standard distribution model.
7、statistic-based method needs to require knowing the distribution of the data and the distribution parameters in advance.
8、when identifying outliers with a discordancy test, the data point is considered as an outlier if it falls within the confidence interval.
9、mahalanobis distance accounts for the relative dispersions and inherent correlations among vector elements, which is different from euclidean distance.
10、an outlier is a data object that deviates significantly from the rest of the objects, as if it were generated by a different mechanism.
assignment 61、you are required to use outlier mining methods to detect the outliers with given data sets. in a section of a city road, several cameras are set to collect the plate of vehicles from 2017-06-09 to 2017-06-12, as well as the date and time when passing the start point and the finish point. travel time is calculated later. time serial is another form of transformation from start time. so each instance contains 8 attributes, including serial number, license plate number, date and time passing start/end point, time serial and travel time. there are totally 4977 instances. you need to finish the following tasks. task: (1)use statistic-based approach to detect the outliers of travel time. calculate the mean value and the variance of travel time. write out the confidence interval. take time serial as x-axis and the travel time as y-axis. plot the scatter diagram and mark the outliers you have recognized. (2)use distance-based approach to detect the outliers of travel time. an object o in data set d is defined as an outlier with parameters r and π, described as db(r, π), if a fraction of the objects in d lie at a distance less than r from o is less than π, o is an outlier. let parameter r vary from 0.1 to 0.3 with the step of 0.1, and π vary from 30 to 90 with the step of 30, find the outliers and the number of the outliers. you can use the euclidian distance. (3)use density-based approach to detect the outliers of travel time. with different k (from 3 to 400 with the step of 5), the number of neighbors, calculate the lof for each data point. set 2.0 as a threshold for lof and an object is labeled as an outlier if its lof exceeds 2.0. firstly, take k value as x-axis and the number of outliers as y-axis. plot the line chart. secondly, calculate the lof for each data point and give the top 4 outliers. use k=350 and the euclidian distance.
week 7. ensemble learningtest 71、how to deal with imbalanced data in 2-class classification?
a、oversampling
b、undersampling
c、threshold-moving
d、ensemble techniques
2、which one is right when dealing with the class-imbalance problem?
a、oversampling works by decreasing the number of minority positive tuples.
b、undersampling works by increasing the number of majority negative tuples.
c、smote algorithm adds synthetic tuples that are close to the minority tuples in tuple space.
d、threshold-moving and ensemble methods were empirically observed to outperform oversampling and undersampling.
3、which step is necessary when constructing an ensemble model?
a、creating multiple data set
b、constructing a set of classifiers from the training data
c、combining predictions made by multiple classifiers to obtain final class label
d、find the best performing predictions to obtain final class label
4、ensembles tend to yield better results when there is a significant diversity among the base models.
5、ensemble method cannot parallelizable because not every base classifier can be allocated to a different cpu.
6、to generate the single classifier, different model may be used to deal with different data subset.
7、in random forest, using a random selection of attributes at each node to determine the split.
8、forest ri creates new attributes that are a linear combination of the existing attributes.
9、the principle threshold-moving is to move the decision threshold, so that the rare class tuples are easier to classify.
10、neutral network classifiers can be used as classifiers in threshold-moving approach.
assignment 71、suppose you have trained three support vector machines, h1, h2, and h3, returning binary classifications ( 1 or −1). the observed accuracy of each of the hypotheses is 70%. assuming that the errors of these hypotheses are independent, what is the predicted accuracy of an ensemble hypothesis h?
2、suppose you have done one iteration of adaboost to produce classifier h1 and find that h1 has the following results on the training data. (assume the initial weights on the training data are uniform.)
courseworkanalysis of driving behavior1、in this coursework, you are required to use techniques of data mining to study the abnormal driving behavior. please download the attachment and read the detail information of the coursework in coursework.docx file. you need to choose one to do from task 1 and task 2, and then choose one to do from task 3 and task 4. hope you get good understanding after learning this course.
猜你喜欢
- 2022-12-05 21:23
- 2022-12-05 20:44
- 2022-12-05 20:38
- 2022-12-05 20:28
- 2022-12-05 20:15
- 2022-12-05 20:05
- 2022-12-05 19:55
- 2022-12-05 19:50
- 2022-12-05 19:21
- 2022-12-05 18:56