类型中科院刘莹数据挖掘课程课后复习2.doc

收藏

编号:2565492    类型:共享资源    大小:453.13KB    格式:DOC    上传时间:2020-04-20
  
8
金币
分享到微信 分享到微博 分享到QQ空间
关 键 词:
中科院 刘莹 数据 挖掘 发掘 课程 课后 复习 温习
资源描述:
-! HW2 Due Date: Nov. 23 Part I: written assignment 1. a) Compute the Information Gain for Gender, Car Type and Shirt Size. 本题的class有两类;即C0和C1 IC0,C1= I10,10=1 inforgender(D)=1020 I6,4+1020 I4,6 =1020 -610 log2610-410 log2410+1020 -610 log2610-410 log2410=0.971 Gain(gender)= IC0,C1-inforgender(D)=1-0.971=0.029 inforCarType(D)=420 I1,3+820 I8,0+820 I1,7 =420-14 log214-34 log234+820 -18 log218-78 log278=0.3797 Gain(CarType)= IC0,C1-inforgender(D)=1-0.3797=0.6203 inforShirtSize(D)=520 I3,2+720 I3,4+420 I2,2+420 I2,2 =520-35 log235-25 log225+720 -37 log237-47 log247+410 -24 log212-24 log212=0.9876 Gain(shirtSize)= IC0,C1-inforgender(D)=1-0.9876=0.0124 b) Construct a decision tree with Information Gain. ① 由a知,CarType的information Gain最大,故本题应该选择CarType作为首要分裂属性。 CarType的类别有Luxury family Sport(因全部属于C0类,此类无需再划分) ② 对Luxury进一步划分: IC0,C1= I1,7=0.5436 inforgender(D)=18 I1,0+78 I1,6=0+78 -17 log217-67 log267=0.5177 Gain(gender)= IC0,C1-inforgender(D)=0.5436-0.5177=0.0259 inforShirtSize(D)=28 I0,2+38 I0,3+28 I1,1+18 I0,2=0.25 Gain(shirtSize)= IC0,C1-inforgender(D)=0.5436-0.25=0.2936 故此处选择ShirtSize进行属性分裂。 ③ 对family进一步划分: IC0,C1= I1,3=0.811 Gain(gender)= IC0,C1-inforgender(D)=0.811- I1,3=0 Gain(shirtSize)= IC0,C1-inforgender(D) =0.811- 14 I1,0- 14 I0,1- 14 I0,1- 14 I0,1=0.811 故此处选择ShirtSize进行属性分裂。 ④ 根据以上的计算可得本题的决策数如下: 2. (a) Design a multilayer feed-forward neural network (one hidden layer) for the data set in Q1. Label the nodes in the input and output layers. 根据数据的属性特点易知输入层有8个节点,分别为: x1 Gender ( Gender = M: x1 = 1; Gender = F: x1 = 0 ) x2 Car Type = Sports ( Y = 1; N = 0) x3 Car Type = Family( Y = 1; N = 0) x4 Car Type = Luxury ( Y = 1; N = 0) x5 Shirt Size = Small ( Y = 1; N = 0) x6 Shirt Size = Medium ( Y = 1; N = 0) x7 Shirt Size = Large ( Y = 1; N = 0) x8 Shirt Size = Extra Large ( Y = 1; N = 0) 隐藏层有三个节点x9、x10和x11. 输出为二类问题, 因此只有1个节点x12(C0=1;C2=0). 神经网络图如下:(其中Wij表示输入层第i个节点到隐藏层第j个节点所付权重,为方便计算,第i个节点到第9/10/11个节点的权重设置一样;Wi-j则表示隐藏层第i个节点到输出层节点所赋予的权重 ) c) Using the neural network obtained above, show the weight values after one iteration of the back propagation algorithm, given the training instance “(M, Family, Small)". Indicate your initial weight values and biases and the learning rate used. 对于 (M, Family, Small), 其类标号为C0, 其训练元祖为{1, 0, 1, 0, 1, 0, 0, 0}. 表 1初始输入、权重、偏倚值和学习率 X1 X2 X3 X4 X5 X6 X7 X8 W1j W2j W3j W4j 1 0 1 0 1 0 0 0 0.1 0.2 0.1 0.2 W5j W6j W7j W8j W9-12 W10-12 W11-12 θ9 θ10 θ11 θ12 L 0.1 0.2 0.3 -0.1 0.1 0.2 -0.1 0.1 0.1 -0.1 0.2 0.9 表 2净输入和净输出计算 单元j 净输入Ij 净输出Oj 9 1*0.1+1*0.1+1*0.1+0.1=0.4 1+(1+e-0.4)=0.51 10 1*0.1+1*0.1+1*0.1++0.1=0.4 1+(1+e-0.4)=0.51 11 1*0.1+1*0.1+1*0.1-0.1=0.2 1+(1+e-0.2)=0.78 12 0.51*0.1+0.51*0.2-0.78*0.1=0.075 1+(1+e-0.075)=0.92 表 3每个节点误差的计算 单元j Errj 12 0.92*(1-0.92) *(1-0.92)=0.0059 11 0.78*(1-0.78)* 0.0059*(-0.1)=-0.00014 10 0.51*(1-0.51)* 0.0059*(0.2)=0.00029 9 0.51*(1-0.51)* 0.0059*(0.1)=0.00016 表 4权重和偏差更新计算 权重或偏差 新值 W19 0.1+0.9*0.00016*1=0.1 W110 0.1+0.9*0.00029*1=0.1 W111 0.1+0.9*(-0.00014)*1=0.1 W29 0.2+0.9*0.00016*0=0.2 W210 0.2+0.9*0.00029*0=0.2 W211 0.2+0.9*(-0.00014)*0=0.2 W39 0.1+0.9*0.00016*1=0.1 W310 0.1+0.9*0.00029*1=0.1 W311 0.1+0.9*(-0.00014)*1=0.1 W49 0.2+0.9*0.00016*0=0.2 W410 0.2+0.9*0.00029*0=0.2 W411 0.2+0.9*(-0.00014)*0=0.2 W59 0.1+0.9*0.00016*1=0.1 W510 0.1+0.9*0.00029*1=0.1 W5111 0.1+0.9*(-0.00014)*1=0.1 W69 0.2+0.9*0.00016*0=0.2 W610 0.2+0.9*0.00029*0=0.2 W611 0.2+0.9*(-0.00014)*0=0.2 W79 0.3+0.9*0.00016*0=0.3 W710 0.3+0.9*0.00029*0=0.3 W711 0.3+0.9*(-0.00014)*0=0.3 W89 -0.1+0.9*0.00016*0=-0.1 W810 -0.1+0.9*0.00029*0=-0.1 W811 -0.1+0.9*(-0.00014)*0=-0.1 W912 0.1+0.9*0.0059*0.51=0.103 W1012 0.2+0.9*0.0059*0.51=0.203 W1112 -0.1+0.9*0.0059*0.78=-0.096 0.1+0.9*0.00016=0.1 0.1+0.9*0.00029=0.1 -0.1+0.9*(-0.00014)=-0.1 0.2+0.9*0.0059=0.2 3. a) Suppose the fraction of undergraduate students who smoke is 15% and the fraction of graduate students who smoke is 23%. If one-fifth of the college students are graduate students and the rest are undergraduates, what is the probability that a student who smokes is a graduate student? U for Undergraduate student, G for Graduate student. and S for Smoking 则,PS|U=0.15, PS|G=0.23, PG=0.2, PU=0.8. 故 PG|S=PS|GPGpS=PS|GPGPS|U PU+PS|GPG=0.230.20.150.8+0.230.2=0.277. b) Given the information in part (a), is a randomly chosen college student more likely to be a graduate or undergraduate student? 因为PU>PG 故 Undergraduate student, c) Suppose 30% of the graduate students live in a dorm but only 10% of the undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke. 令D for Dorm. PD|U=0.1, PD|G=0.3. PG|D∩SPD∩S=PD∩S|GPG=PD|GPS|GPG=0.30.230.2=0.0138. PU|D∩SPD∩S=PD∩S|UPU=PD|UPS|UPU=0.10.150.8=0.012. 因为PG|D∩SPD∩S> PU|D∩SPD∩S, 所以PG|D∩S>PU|D∩S, 所以更可能是graduate student. 4. (a) The three cluster center after the first round execution 第一轮:center A1(4,2,5) B1(1,1,1) C1(11,9,2) 表格 1各点与原始中心点距离 A1 A2 A3 B1 B2 B3 C1 C2 C3 C4 4 10 5 1 2 3 11 1 9 5 2 5 7 1 3 6 9 4 1 6 5 2 8 1 2 9 2 6 7 7                     piA1 7.35 5.92   3.74 5.74   3.74 5.48 4.58 piB1 9.90 10.05   2.45 9.64   5.83 10.00 8.77 piC1 4.12 8.72   10.82 11.05   11.87 9.64 8.37 ① 判断各点与中心点的距离(A1在表格中的点表示为(A4,A5,A6),piA1表示各点到A1点的距离,piB1表示各点到B1点的距离,piC1表示各点到C1点的距离,下同) ② 由以上表格可知: Cluster1: A1 A3 B3 C3 C4 Cluster2: B2 B1 Cluster3: C1 A2 (b) The final three clusters 第二轮:计算每簇的均值。 Cluster1: M1(5.2, 4.4, 7.2 ) Cluster2: M2(1.5, 2, 1.5) Cluster3: M3(10.5, 7, 2) ① 各点到簇中心点的距离: 表格 2各点与第一次聚类中心点距离   A1 A2 A3 B1 B2 B3 C1 C2 C3 C4 4 10 5 1 2 3 11 1 9 5 2 5 7 1 3 6 9 4 1 6 5 2 8 1 2 9 2 6 7 7 piM1 3.47 7.10 2.73 8.22 6.26 3.26 9.05 4.39 5.10 1.62 piM2 4.30 9.03 8.92 1.22 1.22 8.63 11.81 4.95 9.35 7.65 piM3 8.73 2.06 8.14 11.28 9.39 10.31 2.06 10.74 7.95 7.50 ② 再次聚类后的类簇为: Cluster1: A1 A3 B3 C3 C4 Cluster2: B2 B1 Cluster3: C1 A2 ③结果分析:第二轮聚类结果与第一轮一致,故算法停止 Part II: Lab Question 1 1. Build a decision tree using data set “transactions” that predicts milk as a function of the other fields. Set the “type” of each field to “Flag”, set the “direction” of “milk” as “out”, set the “type” of COD as “Typeless”, select “Expert” and set the “pruning severity” to 65, and set the “minimum records per child branch” to be 95. Hand-in: A figure showing your tree. 2. Use the model (the full tree generated by Clementine in step 1 above) to make a prediction for each of the 20 customers in the “rollout” data to determine whether the customer would buy milk. Hand-in: your prediction for each of the 20 customers. 由程序运行的结果可知:customer(2,3,4,5,9,10,13,14,17,18) 会购买Milk。 3. Hand-in: rules for positive (yes) prediction of milk purchase identified from the decision tree (up to the fifth level. The root is considered as level 1). Compare with the rules generated by Apriori in Homework 1, and submit your brief comments on the rules (e.g., pruning effect) 利用决策树产生的关联规则: Table 1 决策树产生的关联规则 Consequent Antecedent1 Antecedent2 milk Juice milk Juice water milk pasta milk Juice pasta milk Tomato source milk Juice Tomato source milk biscuits milk Juice biscuits milk Yoghurt milk Yoghurt water milk Yoghurt biscuits milk Brioches milk Yoghurt Brioches milk beer milk beer biscuits milk rice milk beer rice milk Frozen vegetables milk Frozen vegetables biscuits Table 2 Apriori产生的关联规则 可以说决策树产生的关联规则和Apriori产生的关联规则是相似的。在决策树中少了部分规则是因为这些规则在第六以及第七层以下,被剪枝。 Question 2: Churn Management 1. Perform decision tree classification on training data set. Select all the input variables except state, area_code, and phone_number (since they are only informative for this analysis). Set the “Direction” of class as “out”, “type” as “Flag”. Then, specify the “minimum records per child branch” as 40, “pruning severity” as 70, click “use global pruning”. Hand-in the confusion matrices for validation data. 2. Perform neural network on training data set using default settings. Again, select all the input variables except state, area_code, and phone_number. Hand-in the confusion matrix for validation data. 3. Perform logistic regression on training data set using default settings. Again, select all the input variables except state, area_code, and phone_number. Hand-in the confusion matrix for validation data. 4. Hand-in your observations on the model quality for decision tree, neural network and logistic regression using the confusion matrices. 对上述数据进行分析,决策树的准确率最高。在三个模型中,预测的动摇的顾客的数量分别为48、30、44,因此当对这些特殊顾客进行促销时,使用神经网络做出的决策的花费最小。但是商家也很关心在预测中忠实的顾客在现实中并非如此,对于这个误判,三个模型的误判数据分别为25、9、33. 由此可见神经网络的表现最优秀。
展开阅读全文
提示  得力文库 - 分享文档赚钱的网站所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。
关于本文
本文标题:中科院刘莹数据挖掘课程课后复习2.doc
链接地址:https://www.deliwenku.com/p-2565492.html
关于得利文库 - 版权申诉 - 用户使用规则 - 积分规则 - 联系我们

本站为文档C TO C交易模式,本站只提供存储空间、用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。本站仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知得利文库网,我们立即给予删除!客服QQ:136780468 微信:18945177775 电话:18904686070

工信部备案号:黑ICP备15003705号-8 |  经营许可证:黑B2-20190332号 |   黑公网安备:91230400333293403D

© 2020-2023 www.deliwenku.com 得利文库. All Rights Reserved 黑龙江转换宝科技有限公司 

黑龙江省互联网违法和不良信息举报
举报电话:0468-3380021 邮箱:hgswwxb@163.com  

收起
展开