I investigated whether m-EDL has the same performance as EDL through comparative experiments. I also investigated whether m-EDL has an advantage when including class u in the training data. The objective of this evaluation was to determine the following:
(Q1): whether the use of m-EDL reduces the prediction accuracy for a class k when the same training and test data are given to EDL and m-EDL models;
(Q2): whether a) an m-EDL model that has learned class u has the same prediction accuracy for a class k when compared with an EDL model that cannot learn class u, and b) m-EDL predicts class u with higher accuracy than EDL;
(Q3): if the ratio of class u data included in the training data affects the accuracy of predicting classes k and u in the test data;
(Q4): what happens when the properties of class u data that are blended with the training data and test data in Q2 and Q3 are exactly the same.
To answer these questions, several datasets and models were prepared. Conditions that depended on whether data from class u were included in the training and/or test data, as well as which model was used to learn the data, were used in the evaluation.
Here, I evaluate whether the performance of m-EDL is comparable to that of EDL in the situation assumed by EDL; that is, the situation where all training and test data belong to class k. In other words, both the training and test data were composed only of images from MNIST, and the following two conditions were compared: (1) the EDL model trained and tested on datasets with no class u data and (2) the m-EDL model trained and tested on datasets with no class u data.
Figure3 compares the accuracies of EDL (thin solid red line) and m-EDL (thick solid blue line). Each line shows the mean value and the shaded areas indicate the standard deviation. The accuracy of EDL changes with respect to each uncertainty threshold; the accuracy is plotted on the vertical axis with the uncertainty threshold indicated by the horizontal axis. The accuracy of EDL improves as the threshold decreases because only a classification result the model is confident of is treated as a classification result. Figure3a shows the results when (widehat{{{varvec{p}}}_{{{varvec{k}}}^{+}}}) is used for the classification results of m-EDL. An uncertainty threshold is not used for the classification result of m-EDL; a result parallel to the horizontal axis is obtained. In contrast, Fig.3b shows the results when (widehat{{{varvec{p}}}_{{{varvec{k}}}^{+}}}) is converted to (overline{{{varvec{p}} }_{{varvec{k}}}}) and the uncertainty threshold used for EDL is also used for m-EDL.
Accuracy of EDL and m-EDL when both the training and test datasets contain no class u data. (a) Results when (widehat{{{varvec{p}}}_{{{varvec{k}}}^{+}}}) is used in m-EDL classification. (b) Results when (overline{{{varvec{p}} }_{{varvec{k}}}}) is converted from (widehat{{{varvec{p}}}_{{{varvec{k}}}^{+}}}) and used in m-EDL classification with the same uncertainty threshold as that of EDL.
These graphs show that the accuracy of m-EDL is lower than that of EDL, except in the region where the uncertainty threshold is 0.9 or more. However, no substantial decrease in accuracy is observed, and it can be said that the performance of m-EDL would be sufficient depending on the application.
In this experiment, the properties of the class u data that are included in the training and test data are completely different; that is, they are obtained from different datasets. This makes it possible to confirm whether the learned uncertain class features are regarded as features that are not class k rather than features that are class u learned during training.
First, I consider whether an m-EDL model that has learned class u has the same prediction accuracy for class k when compared with an EDL model that cannot learn class u (Q2a). I then consider whether it can determine class u with higher prediction accuracy (Q2b).
The following two cases are considered: (1) EDL is tested on data that include Fashion MNIST data, and m-EDL is trained on data that include EMNIST data, but tested on data that include Fashion MNIST data. Figure4ac shows the results for class u rates of 25%, 50%, and 75% in training data, respectively. The lines of different colors indicate the results for class u rates of 25%, 50%, and 75% in the test data (12). These are percentages of the number of MNIST data. Additionally, Table 1 presents the mean accuracies of EDL and mEDL for each condition. (2) EDL is tested on data that include EMNIST data, and m-EDL is trained on data that include Fashion MNIST data, but tested on data that include EMNIST data. Figure4df shows the results for class u rates of 25%, 50%, and 75% in the training data, respectively. The lines of different colors indicate the results for class u rates of 25%, 50%, and 75% in test data. These are percentages of the number of MNIST data. Additionally, Table 2 presents the mean accuracies of EDL and mEDL for each condition.
Accuracy comparison of EDL and m-EDL. Line colors indicate the proportion of class u in the test data, and top and bottom plots show the accuracy for class k data and class u data, respectively. Results when m-EDL has learned class u (EMNIST data) but is tested on Fashion MNIST data for class u mix rates in the training data of (a) 25%, (b) 50%, and (c) 75%. These are percentages of the number of MNIST data. Results when m-EDL has learned class u (Fashion MNIST data) but is tested on EMNIST data for class u mix rates in the training data of (d) 25%, (e) 50%, and (f) 75%.
Under these two conditions, the one-hot vector yj of the data has K=10 dimensions. Therefore, all elements of the one-hot vectors of class u (EMNIST or Fashion MNIST data) in the test data were set to 0. In each of the following cases, the same processing was applied when EDL was tested on data including class u data.
The left plots of Fig.4ac and Table 1 (avg. accuracy for k) show the results for class k data for the first condition. The line color indicates the ratio of the class u data included in the test data, and it is assumed that the accuracy decreases as the mix ratio of class u in the test data increases. The results show that the accuracy of m-EDL with respect to class k is high and robust for the mix rate of class u in the training and test data: it can be seen from the left plots in Fig.4ac that when the m-EDL model that has learned class u is compared with the EDL model, which cannot learn class u, it has equal or higher accuracy with respect to class k. Moreover, the accuracy of m-EDL is not easily affected by the ratio of class u in the test data as well as the training data.
The right plots of Fig.4ac and Table 1 (avg. accuracy for u) show the accuracy for class u data, that is, the accuracy that the data that was judged as I do not know is actually different from the data classes learned so far. The right plots of Fig.4ac show that the accuracy of m-EDL with respect to class u is high and robust for the mix rate of class u in the training and test data. It is natural to increase the accuracy for class u of EDL when the ratio of class u increases because the accuracy increases when the ratio of class u increases even if class u is randomly classified via EDL.
Figure4df and Table 2 (avg. accuracy for k) show the results for the second condition, which is exactly the same as the first condition except that the EMNIST and Fashion MNIST datasets switch roles. Again, the accuracy of m-EDL with respect to class k is high and robust, as in the left plots of Fig.4ac. The results in the left plots of Fig.4df reveal that the m-EDL model that learned class u, when compared with EDL, achieved an equal or higher accuracy with respect to class k, and the accuracy of m-EDL was not easily affected by the ratio of class u in the test and training data.
However, the right plots of Fig.4df and Table 2 (avg. accuracy for u) show that the accuracy of m-EDL with respect to class u cannot be said to be better than that of EDL.
In the comparison of the two patterns in "Performance comparison of EDL and m-EDL when class u is included in the training and test data (Q2)", if the ratio of class u in the training data affects the prediction accuracy of the class k and u data, then the ratio of class u included in the training data must be appropriately selected. To answer whether this is the case, I used the results from "Performance comparison of EDL and m-EDL when class u is included in the training and test data (Q2)" (Fig.4ac and df, which have training data mix ratios of 25%, 50%, and 75%, respectively), and added the following two cases:1) Fashion MNIST is included in the test data, but neither EDL nor m-EDL are trained on class u data (a training data mix ratio of 0%; Fig.5a) and 2) EMNIST is included in the test data, but neither EDL nor m-EDL are trained on class u data (a training data mix ratio of 0%; Fig.5b). The lines of different colors indicate the results for class u rates of 25%, 50%, and 75% in the test data.
Accuracy comparison of EDL and m-EDL when neither EDL nor m-EDL have learned class u. Line colors indicate the mix rate of class u in the test data, and left and right plots show the accuracy for class k data and class u data, respectively. (a) Results for Fashion MNIST data. (b) Results for EMNIST data.
In the left plot of Fig.5a, the accuracy improved for class k as shown in the left plots of Fig.4ac, whereas in the right plot of Fig.5a, there was no improvement in accuracy for class u. In the right plots of Fig.4ac, the accuracy for class u was improved even when the ratio of class u in the training data was small. These results suggest that the accuracy for class u may be improved by having m-EDL learn even a small amount of class u data. Moreover, there is no particular need for these data to be related to the class u data in the test data.
The right plot of Fig.5b shows that m-EDL did not lead to improvements in accuracy for class u. Moreover, in the right plots of Fig.4df, the accuracy of m-EDL for class u is not better than that of EDL; however, when compared with the results in the right plot of Fig.5b, it is clear that the accuracy of m-EDL for class u is improved even if the ratio of class u in the training data is small.
It can be inferred from these comparisons that the amount of accuracy improvement for class u changes depending on the characteristics of class u in the training and test data.
As shown in "Performance comparison of EDL and m-EDL when class u is included in the training and test data (Q2)" and "Effect of the ratio of the class u included in the training data on the prediction accuracy of classes k and u in the test dataset (Q3)", the amount of improvement in accuracy for class u data changes depending on the characteristics of u in the training data and test data. Hence, I evaluated whether the accuracy for class u always improves when the characteristics of u in the training and test data are exactly the same (i.e., when the class u data are from the same dataset).
The following two conditions were considered: (1) when Fashion MNIST is included in both the test and training data [Fig.6ac and Table 3 (avg. accuracy for k and u)] and (2) when EMNIST is included in both the test and training data [Fig.6df and Table 4 (avg. accuracy for k and u)].
Accuracy comparison of EDL and m-EDL. Line colors indicate the proportion of class u in the test data, and top and bottom plots show the accuracy for class k data and class u data, respectively. Results when m-EDL has learned class u (Fashion MNIST) for class u mix rates in the training data of (a) 25%, (b) 50%, and (c) 75%. These are percentages of the number of MNIST data. Results when m-EDL has learned class u (EMNIST)for class u mix rates in the training data of (d) 25%, (e) 50%, and (f) 75%.
The differences in Fig.6ac and df are the mix rates of class u in the training data (25%, 50%, and 75%, respectively). The lines of different colors indicate the results for class u rates of 25%, 50%, and 75% in the test data. These are percentages of the number of MNIST data. In particular, the right-hand side plots of Fig.6af confirm that the accuracy of m-EDL is higher than that in the cases considered for Q2 and Q3 and is almost 100%.
In the cases of Q2 and Q3, the class u data in the training and/or test data have different characteristics, and the accuracy of m-EDL on the class u data changed depending on the combination. Meanwhile, in the Q4 cases, class u data had the same characteristics during both training and testing, and hence, the accuracy is very high. From this, it is clear that the feature learning of class u in the training data contributes to the improvement in accuracy that m-EDL exhibits when learning class u. However, in the comparisons of Q2, particularly when m-EDL was trained using EMNIST and both EDL and m-EDL were tested on data including Fashion MNIST, examples can be found where the accuracy improved even when the unknown classes in the training and test data differ. Therefore, m-EDL has the potential to improve accuracy by excluding uncertain data as a result of learning unrelated data that do not belong to class k data, although this depends on the combination of class u data in the training and test data.
Here, we hypothesize regarding the combination of class u datasets to be mixed during training that will increase the class u accuracy in testing. The hypothesis is that if class u data whose characteristics are as close as possible to those of class k are learned during training, class u data in the test can be discriminated as class u as long as the characteristics of class u given during the test are different from those in training; i.e., if a boundary that can distinguish the range of class k more strictly with u whose characteristics are close to those of class k is learned via mEDL, class u can be easily distinguished. Conversely, if the class u data during training are far from the characteristics of k, the decision boundary between k and u is freely determined, and if the class u data in the test are close to k, they may be incorrectly classified.
To test this hypothesis, I introduced another dataset (Cifar-1040) and evaluated the similarity of the characteristics of different datasets. The Cifar-10 dataset used had images of 2828 pixels for similarity calculation (consistent with the other dataset), which were grayscaled using a previously proposed method41. Table 5 presents the similarity of MNIST, EMNIST, Fashion-MNIST, and Cifar-10. Here, the structural similarity (SSIM) was determined by randomly selecting 500,000 images of the datasets to be compared, and the mean and variance were calculated as the similarity between the datasets.
The distance between datasets was determined as the inverse of the SSIM, and the positional relationship of the datasets on a two-dimensional plane was estimated via multidimensional scaling (MDS)41, as shown in Fig.7.
Location of each dataset estimated via MDS, where the points M, F, E, and C represent the locations of the MNIST, Fashion-MNIST, EMNIST, and Cifar-10 datasets, respectively, and the distance between points is proportional to the inverse of the similarity. The numbers on the horizontal and vertical axes are dimensionless.
As shown in Fig.7, EMNIST was more similar to Fashion-MNIST than to EMNIST. The newly introduced Cifar-10 is an image dataset with characteristics that are more different from those of MNIST than those of both EMNIST and Fashion-MNIST. The hypothesis explains the result presented in "Performance comparison of EDL and m-EDL when class u is included in the training and test data (Q2)" that the accuracy of class u was higher in Case 1 when u was trained with EMNIST and classified with test data containing Fashion MNIST than in Case 2 when u was trained with Fashion-MNIST and classified with test data containing EMNIST. The reason why the accuracy of class u was higher in Case 1 is because the characteristics of EMNIST were closer than those of Fashion-MNIST to the those of MNIST. mEDL-trained EMNIST was able to identify Fashion-MNIST, which was given during testing and had more distant characteristics than EMNIST, as class u. To verify this hypothesis, I compared the accuracy of class u in Case 3, where class u was trained with Cifar-10 and was classified with the test data containing EMNIST, with those for Cases 1 and 2. If the hypothesis is correct, the accuracy of class u should decrease in the following order: Case 1>Case 2>Case 3.
Table 6 presents the accuracies of mEDL for class u in each case. Indeed, the accuracy of Case 3 was the lowest, suggesting that if class u has characteristics close to those of class k during training, class u in the test can be detected as class u as long as the characteristics of class u given during testing are farther than those in the training.
Read the original:
Learning and predicting the unknown class using evidential deep learning | Scientific Reports - Nature.com
Read More..