YOLOv3.pdf

资源ID：2509779 资源大小：2.14MB 全文页数：6页
资源格式： PDF 下载积分：1金币

快捷下载

会员登录下载

微信登录下载

三方登录下载：

微信扫一扫登录

下载资源需要1金币

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

YOLOv3.pdf

YOLOv3: An Incremental Improvement Joseph Redmon, Ali Farhadi University of Washington Abstract We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network thats pretty swell. Its a little bigger than last time but more accurate.Its still fast though, dont worry. At 320 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50in 51 ms on a Titan X, com- pared to 57.5 AP50in 198 ms by RetinaNet, similar perfor- mance but 3.8 faster. As always, all the code is online at 1. Introduction Sometimes you just kinda phone it in for a year, you know? I didnt do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year 12 1; I managed to make some improvements to YOLO. But, hon- estly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other peoples research a little. Actually, thats what brings us here today.We have a camera-ready deadline 4 and we need to cite some of the random updates I made to YOLO but we dont have a source. So get ready for a TECH REPORT! The great thing about tech reports is that they dont need intros, yall know why were here. So the end of this intro- duction will signpost for the rest of the paper. First well tell you what the deal is with YOLOv3. Then well tell you how we do. Well also tell you about some things we tried that didnt work. Finally well contemplate what this all means. 2. The Deal So heres the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifi er network thats better than the other ones. Well just take you through the whole system from scratch so you can un- derstand it all. 50100150200250 inference time (ms) 28 30 32 34 36 38 COCO AP BC D E F G RetinaNet-50 RetinaNet-101 YOLOv3 Method B SSD321 C DSSD321 D R-FCN E SSD513 F DSSD513 G FPN FRCN RetinaNet-50-500 RetinaNet-101-500 RetinaNet-101-800 YOLOv3-320 YOLOv3-416 YOLOv3-608 mAP 28.0 28.0 29.9 31.2 33.2 36.2 32.5 34.4 37.8 28.2 31.0 33.0 time 61 85 85 125 156 172 73 90 198 22 29 51 Figure 1. We adapt this fi gure from the Focal Loss paper 9. YOLOv3 runs signifi cantly faster than other detection methods with comparable performance. Times from either an M40 or Titan X, they are basically the same GPU. 2.1. Bounding Box Prediction Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes 15. The network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If the cell is offset from the top left corner of the image by (cx,cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to: bx= (tx) + cx by= (ty) + cy bw= pwetw bh= pheth During training we use sum of squared error loss. If the ground truth for some coordinate prediction ist*our gra- dient is the ground truth value (computed from the ground truth box) minus our prediction:t* t*. This ground truth value can be easily computed by inverting the equations above. YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bound- ing box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior 1 (tx) (ty) pw ph bh bw bw=pwe bh=phe cx cy bx=(tx)+cx by=(ty)+cy tw th Figure 2. Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of fi lter application using a sigmoid function. This fi gure blatantly self-plagiarized from 15. is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, follow- ing 17. We use the threshold of .5. Unlike 17 our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predic- tions, only objectness. 2.2. Class Prediction Each box predicts the classes the bounding box may con- tain using multilabel classifi cation. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifi ers. Dur- ing training we use binary cross-entropy loss for the class predictions. This formulation helps when we move to more complex domains like the Open Images Dataset 7. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data. 2.3. Predictions Across Scales YOLOv3 predicts boxes at 3 different scales. Our sys- tem extracts features from those scales using a similar con- cept to feature pyramid networks 8. From our base fea- ture extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, ob- jectness, and class predictions.In our experiments with COCO 10 we predict 3 boxes at each scale so the tensor is N N 3(4+1+80) for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions. Next we take the feature map from 2 layers previous and upsample it by 2. We also take a feature map from earlier in the network and merge it with our upsampled features using concatenation. This method allows us to get more meaningful semantic information from the upsampled fea- tures and fi ner-grained information from the earlier feature map. We then add a few more convolutional layers to pro- cess this combined feature map, and eventually predict a similar tensor, although now twice the size. We perform the same design one more time to predict boxes for the fi nal scale. Thus our predictions for the 3rd scale benefi t from all the prior computation as well as fi ne- grained features from early on in the network. We still use k-means clustering to determine our bound- ing box priors.We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were: (1013),(1630),(3323),(3061),(6245),(59 119),(116 90),(156 198),(373 326). 2.4. Feature Extractor We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 33 and 11 convolutional layers but now has some shortcut connections as well and is signifi cantly larger. It has 53 convolutional layers so we call it. wait for it. Darknet-53! Type Convolutional Convolutional Convolutional Convolutional Residual Convolutional Convolutional Convolutional Residual Convolutional Convolutional Convolutional Residual Convolutional Convolutional Convolutional Residual Convolutional Convolutional Convolutional Residual Avgpool Connected Softmax Filters 32 64 32 64 128 64 128 256 128 256 512 256 512 1024 512 1024 Size 3 3 3 3 / 2 1 1 3 3 3 3 / 2 1 1 3 3 3 3 / 2 1 1 3 3 3 3 / 2 1 1 3 3 3 3 / 2 1 1 3 3 Global 1000 Output 256 256 128 128 128 128 64 64 64 64 32 32 32 32 16 16 16 16 8 8 8 8 1 2 8 8 4 Table 1. Darknet-53. This new network is much more powerful than Darknet- 19 but still more effi cient than ResNet-101 or ResNet-152. Here are some ImageNet results: BackboneTop-1Top-5Bn OpsBFLOP/sFPS Darknet-19 1574.191.87.291246171 ResNet-101577.193.719.7103953 ResNet-152 577.693.829.4109037 Darknet-5377.293.818.7145778 Table 2. Comparison of backbones. Accuracy, billions of oper- ations, billion fl oating point operations per second, and FPS for various networks. Each network is trained with identical settings and tested at 256256, single crop accuracy. Run times are measured on a Titan X at 256 256. Thus Darknet-53 performs on par with state-of-the-art classifi ers but with fewer fl oating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5 faster. Darknet-53 has similar perfor- mance to ResNet-152 and is 2 faster. Darknet-53 also achieves the highest measured fl oating point operations per second. This means the network struc- turebetterutilizestheGPU,makingitmoreeffi cienttoeval- uate and thus faster. Thats mostly because ResNets have just way too many layers and arent very effi cient. 2.5. Training Westilltrainonfullimageswithnohardnegativemining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing 14. 3. How We Do YOLOv3 is pretty good! See table 3. In terms of COCOs weird average mean AP metric it is on par with the SSD variants but is 3 faster. It is still quite a bit behind other backboneAPAP50AP75APSAPMAPL Two-stage methods Faster R-CNN+ 5ResNet-101-C434.955.737.415.638.750.9 Faster R-CNN w FPN 8ResNet-101-FPN36.259.139.018.239.048.2 Faster R-CNN by G-RMI 6Inception-ResNet-v2 2134.755.536.713.538.152.0 Faster R-CNN w TDM 20Inception-ResNet-v2-TDM36.857.739.216.239.852.1 One-stage methods YOLOv2 15DarkNet-19 1521.644.019.25.022.435.5 SSD513 11, 3ResNet-101-SSD31.250.433.310.234.549.8 DSSD513 3ResNet-101-DSSD33.253.335.213.035.451.1 RetinaNet 9ResNet-101-FPN39.159.142.321.842.750.2 RetinaNet 9ResNeXt-101-FPN40.861.144.124.144.251.2 YOLOv3 608 608Darknet-5333.057.934.418.335.441.9 Table 3. Im seriously just stealing all these tables from 9 they take soooo long to make from scratch. Ok, YOLOv3 is doing alright. Keep in mind that RetinaNet has like 3.8 longer to process an image. YOLOv3 is much better than SSD variants and comparable to state-of-the-art models on the AP50metric. models like RetinaNet in this metric though. However, when we look at the “old” detection metric of mAP at IOU= .5 (or AP50in the chart) YOLOv3 is very strong. It is almost on par with RetinaNet and far above the SSD variants. This indicates that YOLOv3 is a very strong detector that excels at producing decent boxesfor ob- jects. However, performance drops signifi cantly as the IOU threshold increases indicating YOLOv3 struggles to get the boxes perfectly aligned with the object. In the past YOLO struggled with small objects. How- ever, now we see a reversal in that trend. With the new multi-scale predictions we see YOLOv3 has relatively high APSperformance. However, it has comparatively worse performance on medium and larger size objects. More in- vestigation is needed to get to the bottom of this. When we plot accuracy vs speed on the AP50metric (see fi gure 5) we see YOLOv3 has signifi cant benefi ts over other detection systems. Namely, its faster and better. 4. Things We Tried That Didnt Work We tried lots of stuff while we were working on YOLOv3. A lot of it didnt work. Heres the stuff we can remember. Anchor box x,y offset predictions. We tried using the normal anchor box prediction mechanism where you pre- dict the x,y offset as a multiple of the box width or height using a linear activation. We found this formulation de- creased model stability and didnt work very well. Linear x,y predictions instead of logistic. We tried using a linear activation to directly predict the x,y offset instead of the logistic activation. This led to a couple point drop in mAP. Focal loss. We tried using focal loss. It dropped our mAP about 2 points. YOLOv3 may already be robust to the problem focal loss is trying to solve because it has sep- arate objectness predictions and conditional class predic- tions. Thus for most examples there is no loss from the class predictions? Or something? We arent totally sure. 50100150200250 inference time (ms) 48 50 52 54 56 58 COCO mAP-50 BC D E F G RetinaNet-50 RetinaNet-101 YOLOv3 Method B SSD321 C DSSD321 D R-FCN E SSD513 F DSSD513 G FPN FRCN RetinaNet-50-500 RetinaNet-101-500 RetinaNet-101-800 YOLOv3-320 YOLOv3-416 YOLOv3-608 mAP-50 45.4 46.1 51.9 50.4 53.3 59.1 50.9 53.1 57.5 51.5 55.3 57.9 time 61 85 85 125 156 172 73 90 198 22 29 51 Figure 3. Again adapted from the 9, this time displaying speed/accuracy tradeoff on the mAP at .5 IOU metric. You can tell YOLOv3 is good because its very high and far to the left. Can you cite your own paper? Guess whos going to try, this guy 16. Oh, I forgot, we also fi x a data loading bug in YOLOv2, that helped by like 2 mAP. Just sneaking this in here to not throw off layout. Dual IOU thresholds and truth assignment. Faster R- CNN uses two IOU thresholds during training. If a predic- tion overlaps the ground truth by .7 it is as a positive exam- ple, by .3.7 it is ignored, less than .3 for all ground truth objects it is a negative example. We tried a similar strategy but couldnt get good results. We quite like our current formulation, it seems to be at a local optima at least. It is possible that some of these techniques could eventually produce good results, perhaps they just need some tuning to stabilize the training. 5. What This All Means YOLOv3 is a good detector. Its fast, its accurate. Its not as great on the COCO average AP between .5 and .95 IOU metric. But its very good on the old detection metric of .5 IOU. Why did we switch metrics anyway?The original COCO paper just has this cryptic sentence: “A full discus- sion of evaluation metrics will be added once the evaluation server is complete”. Russakovsky et al report that that hu- mans have a hard time distinguishing an IOU of .3 from .5! “Training humans to visually inspect a bounding box with IOU of 0.3 and distinguish it from one with IOU 0.5 is sur- prisingly diffi cult.” 18 If humans have a hard time telling the difference, how much does it matter? But maybe a better question is: “What are we going to do with these detectors now that we have them?” A lot of the people doing this research are at Google and Facebook. I guess at least we know the technology is in good hands and defi nitely wont be used to harvest your personal infor- mation and sell it to. wait, youre saying thats exactly what it will be used for? Oh. Well the other people heavily funding vision research are the military and theyve never done anything horrible like killing lots of people with new technology oh wait.1 I have a lot of hope that most of the people using com- puter vision are just doing happy, good stuff with it, like counting the number of zebras in a national park 13, or tracking their cat as it wanders around their house 19. But computervisionisalreadybeingputtoquestionableuseand as researchers we have a responsibility to at least consider the harm our work might be doing and think o

注意事项

本文（YOLOv3.pdf）为本站会员（枫叶）主动上传，得力文库 - 分享文档赚钱的网站仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知得力文库 - 分享文档赚钱的网站（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。