YOLOv3.pdf
《YOLOv3.pdf》由会员分享,可在线阅读,更多相关《YOLOv3.pdf(6页珍藏版)》请在得力文库 - 分享文档赚钱的网站上搜索。
1、YOLOv3: An Incremental Improvement Joseph Redmon, Ali Farhadi University of Washington Abstract We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network thats pretty swell. Its a little bigger than last time but more accurate.Its s
2、till fast though, dont worry. At 320 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50in 51 ms on a Titan X, com- pared to 57.5 AP50in 198 ms by RetinaNet, similar perfor- m
3、ance but 3.8 faster. As always, all the code is online at 1. Introduction Sometimes you just kinda phone it in for a year, you know? I didnt do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year 12 1;
4、I managed to make some improvements to YOLO. But, hon- estly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other peoples research a little. Actually, thats what brings us here today.We have a camera-ready deadline 4 and we need to cite som
5、e of the random updates I made to YOLO but we dont have a source. So get ready for a TECH REPORT! The great thing about tech reports is that they dont need intros, yall know why were here. So the end of this intro- duction will signpost for the rest of the paper. First well tell you what the deal is
6、 with YOLOv3. Then well tell you how we do. Well also tell you about some things we tried that didnt work. Finally well contemplate what this all means. 2. The Deal So heres the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifi er network thats better than
7、 the other ones. Well just take you through the whole system from scratch so you can un- derstand it all. 50100150200250 inference time (ms) 28 30 32 34 36 38 COCO AP BC D E F G RetinaNet-50 RetinaNet-101 YOLOv3 Method B SSD321 C DSSD321 D R-FCN E SSD513 F DSSD513 G FPN FRCN RetinaNet-50-500 RetinaN
8、et-101-500 RetinaNet-101-800 YOLOv3-320 YOLOv3-416 YOLOv3-608 mAP 28.0 28.0 29.9 31.2 33.2 36.2 32.5 34.4 37.8 28.2 31.0 33.0 time 61 85 85 125 156 172 73 90 198 22 29 51 Figure 1. We adapt this fi gure from the Focal Loss paper 9. YOLOv3 runs signifi cantly faster than other detection methods with
9、comparable performance. Times from either an M40 or Titan X, they are basically the same GPU. 2.1. Bounding Box Prediction Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes 15. The network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If t
10、he cell is offset from the top left corner of the image by (cx,cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to: bx= (tx) + cx by= (ty) + cy bw= pwetw bh= pheth During training we use sum of squared error loss. If the ground truth for some coordinate pre
11、diction ist*our gra- dient is the ground truth value (computed from the ground truth box) minus our prediction:t* t*. This ground truth value can be easily computed by inverting the equations above. YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1
12、 if the bound- ing box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior 1 (tx) (ty) pw ph bh bw bw=pwe bh=phe cx cy bx=(tx)+cx by=(ty)+cy tw th Figure 2. Bounding boxes with dimension priors and location prediction. We predict the width and he
13、ight of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of fi lter application using a sigmoid function. This fi gure blatantly self-plagiarized from 15. is not the best but does overlap a ground truth object by more than some threshol
14、d we ignore the prediction, follow- ing 17. We use the threshold of .5. Unlike 17 our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predic- tions, only objectness. 2.
15、2. Class Prediction Each box predicts the classes the bounding box may con- tain using multilabel classifi cation. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifi ers. Dur- ing training we use binary cross-entropy l
16、oss for the class predictions. This formulation helps when we move to more complex domains like the Open Images Dataset 7. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case
17、. A multilabel approach better models the data. 2.3. Predictions Across Scales YOLOv3 predicts boxes at 3 different scales. Our sys- tem extracts features from those scales using a similar con- cept to feature pyramid networks 8. From our base fea- ture extractor we add several convolutional layers.
18、 The last of these predicts a 3-d tensor encoding bounding box, ob- jectness, and class predictions.In our experiments with COCO 10 we predict 3 boxes at each scale so the tensor is N N 3(4+1+80) for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions. Next we take the feat
19、ure map from 2 layers previous and upsample it by 2. We also take a feature map from earlier in the network and merge it with our upsampled features using concatenation. This method allows us to get more meaningful semantic information from the upsampled fea- tures and fi ner-grained information fro
20、m the earlier feature map. We then add a few more convolutional layers to pro- cess this combined feature map, and eventually predict a similar tensor, although now twice the size. We perform the same design one more time to predict boxes for the fi nal scale. Thus our predictions for the 3rd scale
21、benefi t from all the prior computation as well as fi ne- grained features from early on in the network. We still use k-means clustering to determine our bound- ing box priors.We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO
22、dataset the 9 clusters were: (1013),(1630),(3323),(3061),(6245),(59 119),(116 90),(156 198),(373 326). 2.4. Feature Extractor We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual netw
23、ork stuff. Our network uses successive 33 and 11 convolutional layers but now has some shortcut connections as well and is signifi cantly larger. It has 53 convolutional layers so we call it. wait for it. Darknet-53! Type Convolutional Convolutional Convolutional Convolutional Residual Convolutional
24、 Convolutional Convolutional Residual Convolutional Convolutional Convolutional Residual Convolutional Convolutional Convolutional Residual Convolutional Convolutional Convolutional Residual Avgpool Connected Softmax Filters 32 64 32 64 128 64 128 256 128 256 512 256 512 1024 512 1024 Size 3 3 3 3 /
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- YOLOv3
限制150内