學習筆記-YOLOv5尋找訓練問題與《停止用垃圾投喂你的模型》薦讀-記錄
前言:之前用yolov5訓練的模型,效果不大行,開始尋找問題所在,尋求改進。
任務:了解前兩次訓練模型問題
狀態(tài):基本完成
使用體驗來看問題:
問題的解決源自問題的提出與定位,而一切問題的發(fā)現(xiàn)又來源于使用,所以先體驗一下我自己的模型:
首先,調用攝像頭實時識別,結果很抱歉,很多手勢識別不出來,而識別出來也容易識別錯誤,容易識別成door,即便不是。而不是手勢的那些也往往識別成手勢。圖片也是如此,視頻不再額外測試了。自己的圖片很容易出錯。
然后,用了下買的源數(shù)據(jù)集的圖片,發(fā)現(xiàn)對方的數(shù)據(jù)集的val集的識別效果很好,不出錯。
綜合來看,只對于他自己的圖像好用,對于我的圖像的泛化能力很差,顯然是過擬合了。
猜想與驗證:
我又把對方的圖片觀察了下,發(fā)現(xiàn)對方的數(shù)據(jù)集基本上都是白色墻壁,黑色衣服背景下的手勢。于是,有沒有可能是環(huán)境導致的泛化能力極差呢?

由于手勢都是直接位于黑色衣物下的,所以,我將本地背景換成黑色的,再識別。嘗試后發(fā)現(xiàn),識別率和精確率大大上升,也就是能識別到有手勢,并能識別正確手勢情況的比例大大增加。果然,背景單一,導致黑色被我訓練的手勢識別模型認為是手勢的主要特征了。對方的數(shù)據(jù)集看來是個比較糟糕的數(shù)據(jù)集。
收獲:通過對問題原因,即對方手勢環(huán)境背景,的確認,認識到了目標識別問題特征的含義。特征是目標的重要的結構性的組成部分,是目標之所以為目標而不是其他事物的區(qū)分。
tips:我認為,考慮問題似乎可以從為什么把人識別成door,轉變?yōu)?,為什么識別不出手勢。也就是將考慮的對象牢牢的釘在手勢上,因為這是手勢識別。這樣有助于厘清思路。
既然我目前對于什么樣的數(shù)據(jù)集是好的或壞的的認知時不清晰不夠的,那么,下一步就去學習什么是好的數(shù)據(jù)集吧。
學習什么樣的數(shù)據(jù)集是好的數(shù)據(jù)集:《停止用垃圾投喂你的模型》:
學習對象:《Stop Feeding Garbage To Your Model! — The 6 biggest mistakes with datasets and how to avoid them》https://hackernoon.com/stop-feeding-garbage-to-your-model-the-6-biggest-mistakes-with-datasets-and-how-to-avoid-them-3cb7532ad3b7(需要科學上網),真是句句珠璣,令人膜拜。
Introduction
If you haven’t heard it already, let me tell you a truth that you should, as a data scientist, always keep in a corner of your head:
“Your results are only as good as your data.”
Many people make the mistake of trying to compensate for their ugly dataset by improving their model. This is the equivalent of buying a supercar because your old car doesn’t perform well with cheap gasoline. It makes much more sense to refine the oil instead of upgrading the car. In this article, I will explain how you can easily improve your results by enhancing your dataset.
非常有趣的一句話:“你的模型好壞和你的數(shù)據(jù)集一樣”,這大概讓我明白了什么是當前工作的中心了。數(shù)據(jù)集,在某種意義上,意味著一切了。
1.Not enough data數(shù)據(jù)不夠
If your dataset is too small, your model doesn’t have enough examples to find discriminative features that will be used to generalize. It will then overfit your data, resulting in a low training error but a high test error.
Solution #1: gather more data. You can try to find more from the same source as your original dataset, or from another source if the images are quite similar or if you absolutely want to generalize.
Caveats: This is usually not an easy thing to do, at least without investing time and money. Also, you might want to do an analysis to determine how much additional data you need. Compare your results with different dataset sizes, and try to extrapolate.
Solution #2: augment your data by creating multiple copies of the same image with slight variations. This technique works wonders and it produces tons of additional images at a really low cost. You can try to crop, rotate, translate or scale your image. You can add noise, blur it, change its colors or obstruct parts of it. In all cases, you need to make sure the data is still representing the same class.

All this images still represent the “cat” category
This can be extremely powerful, as stacking these effects gives exponentially numerous samples for your dataset. Note that this is still usually inferior to collecting more raw data.

Combined data augmentation techniques. The class is still “cat” and should be recognized as such.
Caveats: all augmentations techniques might not be usable for your problem. For example, if you want to classify Lemons and Limes, don’t play with the hue, as it would make sense that color is important for the classification.

This type of data augmentation would make it harder for the model to find discriminating features.
數(shù)據(jù)不夠,會使得模型找不到有結構性的區(qū)別性的特征來泛化分類,train集錯誤低,val集錯誤高,出現(xiàn)過擬合的問題。
方法一:弄到更多的圖片數(shù)據(jù),不過要花時間和金錢,可能難以實現(xiàn),考慮到需求的數(shù)據(jù)數(shù)量(指以萬或十萬張圖片才能達成目標)。如果要弄得話,最好先根據(jù)不同數(shù)量級別的訓練性能走向來評估下額外還需要多少張。
方法二:是我此前提到的數(shù)據(jù)增強。對一張圖像進行放縮,裁剪,旋轉,噪點,模糊,遮擋得到有相同標注的副本。這很有用,短時間得到指數(shù)巨量的數(shù)據(jù)。不過效果較采集額外數(shù)據(jù)會差些。
tips:你進行數(shù)據(jù)增強得保證不影響用來區(qū)分對象的特征,比如lemon和lime就是通過顏色好區(qū)分,你進行hue增強,這不給自己找不自在呢。
2.Low quality classes分類質量差
It’s an easy one, but take time to go through your dataset if possible, and verify the label of each sample. This might take a while, but having counter-examples in your dataset will be detrimental to the learning process.
Also, choose the right level of granularity for your classes. Depending on the problem, you might need more or less classes. For example, you can classify the image of a kitten with a global classifier to determine it’s an animal, then run it through an animal classifier to determine it’s a kitten. A huge model could do both, but it would be much harder.

Two stage prediction with specialized classifiers.
首先,盡可能把你的數(shù)據(jù)集一張一張過一遍,判斷標注是否良好,是很有意義的。
另外,選擇合適的分類規(guī)模,比如貓的分類,先判斷屬于動物的分類,然后再通過動物分類器得到貓的結果??梢赃@樣兩階段用不同的專用分類器分類。比一個通用復雜模型容易很多。
3.Low quality data低質量數(shù)據(jù)/圖片
As said in the introduction, low quality data will only lead to low quality results.
You might have samples in your dataset that are too far from what you want to use. These might be more confusing for the model than helpful.
Solution: remove the worst images. This is a lengthy process, but will improve your results.

Sure, these three images represent cats, but the model might not be able to work with it.
Another common issue is when your dataset is made of data that doesn’t match the real world application. For instance if the images are taken from completely different sources.
Solution: think about the long term application of your technology, and which means will be used to acquire data in production. If possible, try to find/build a dataset with the same tools.

Using data that doesn’t represent your real world application is usually a bad idea. Your model is likely to extract features that won’t work in the real world.
低質量的數(shù)據(jù)只能得到低質量的結果。將那些低質量的圖片移除,即便這是個耗時的活,但可以改善訓練結果。
給出的三張貓貓圖片是不適合分類的。(我個人認為可以說是對特征結構的干擾。)
另外,可以從你的項目長期的實際生產應用中獲取的數(shù)據(jù)的風格形式來考慮選擇的數(shù)據(jù)集??赡艿脑挘褂猛粋€工具創(chuàng)建數(shù)據(jù)集。
4.Unbalanced classes不均衡的類別
If the number of sample per class isn't roughly the same for all classes, the model might have a tendency to favor the dominant class, as it results in a lower error. We say that the model is biased because the class distribution is skewed. This is a serious issue, and also why you need to take a look at precision, recall or confusion matrixes.
Solution #1: gather more samples of the underrepresented classes. However, this is often costly in time and money, or simply not feasible.
Solution #2: over/under-sample your data. This means that you remove some samples from the over-represented classes, and/or duplicate samples from the under-represented classes. Better than duplication, use data augmentation as seen previously.

We need to augment the under-represented class (cat) and leave aside some samples from the over-represented class (lime). This will give a much smoother class distribution.
如果訓練集中的各類別張數(shù)不是大體相同的,那么模型可能會有對某一特定類別的偏好。因為模型如果推斷圖片屬于數(shù)量多的類別,更少出錯。模型具有偏向性是很嚴重的問題,可以通過precision,recall和confusion matrixes看出。
方案一:通過采集增加量相對少的類別的圖片數(shù)量。
方案二:通過增強增加量相對少的類別的圖片數(shù)量,通過刪除減少被過分代表類的圖片的數(shù)量。
最終使得各個類別的圖片作用是更平衡的。
5.Unbalanced data格式不統(tǒng)一的數(shù)據(jù)
If your data doesn’t have a specific format, or if the values don’t lie in the certain range, your model might have trouble dealing with it. You will have better results with image that are in aspect ratio and pixel values.
Solution #1: Crop or stretch the data so that it has the same aspect or format as the other samples.

Two possibilities to improve a badly formatted image.
你的數(shù)據(jù)要有相同的格式,否則你將很難處理它們。如果你的圖片格式是一致的,你將能得到更好的訓練結果。比如圖片的長寬比。
tips:我用的yolov5本身已經進行了這部步處理。
6.No validation or testing沒有驗證集或測試集
Once your dataset has been cleaned, augmented and properly labelled, you need to split it. Many people split it the following way: 80% for training, and 20% for testing, which allow you to easily spot overfitting. However, if you are trying multiple models on the same testing set, something else happens. By picking the model giving the best test accuracy, you are in fact overfitting the testing set. This happens because you are manually selecting a model not for its intrinsic value, but for its performance on a specific set of data.
Solution: split the dataset in three: training, validation and testing. This shields your testing set from being overfitted by the choice of the model. The selection process becomes:
Train your models on the training set.
Test them on the validation set to make sure you aren’t overfitting.
Pick the most promising model. Test it on the testing set, this will give you the true accuracy of your model.

Note: Once you have chosen your model for production, don’t forget to train it on the whole dataset! The more data the better!
不要在測試集上驗證是否過擬合。因為這也是訓練即擬合數(shù)據(jù)的過程。這可能會導致模型對測試集本身過擬合,導致測試結果出問題。所以,測試集只測試一次。不過,對于他的文章,我其實不是很理解,可以參考另一位老哥的文章:https://zhuanlan.zhihu.com/p/612928760,更好理解一些:
“為什么將數(shù)據(jù)集劃分為三部分?
在深度學習中,將數(shù)據(jù)集分為訓練集、驗證集和測試集非常重要。
這樣做的主要原因是評估模型在未見數(shù)據(jù)上的性能。通過將數(shù)據(jù)分成不同的集合,模型可以在訓練集上進行訓練,在驗證集上進行調整超參數(shù),最后在測試集上進行評估,以估計其泛化性能。
訓練集用于通過優(yōu)化算法(如反向傳播)調整模型的參數(shù)來訓練模型。
驗證集用于調整模型的超參數(shù),例如學習率、隱藏層數(shù)、每層神經元數(shù)等,以提高模型在驗證集上的性能。
測試集用于評估模型在超參數(shù)在驗證集上調整后的最終性能。它作為模型在未見數(shù)據(jù)上表現(xiàn)如何的估計。
將數(shù)據(jù)分成這三個集合有助于防止過擬合,當模型在訓練集上表現(xiàn)良好但在未見數(shù)據(jù)上表現(xiàn)不佳時,就會發(fā)生過擬合。通過在單獨的驗證和測試集上評估模型,我們可以確保模型沒有過擬合,并能夠泛化到未見數(shù)據(jù)。
……”