November 19, 2023
Ensuring logical consistency in predictions is a crucial yet overlooked aspect in multi-attribute classification. We explore the potential reasons for this oversight and introduce two pressing challenges to the field: 1) How can we ensure that a model, when trained with data checked for logical consistency, yields predictions that are logically consistent? 2) How can we achieve the same with data that hasn’t undergone logical consistency checks? Minimizing manual effort is also essential for enhancing automation. To address these challenges, we introduce two datasets, FH41K and CelebA-logic, and propose LogicNet, an adversarial training framework that learns the logical relationships between attributes. Accuracy of LogicNet surpasses that of the next-best approach by 23.05%, 9.96%, and 1.71% on FH37K, FH41K, and CelebA-logic, respectively. In real-world case analysis, our approach can achieve a reduction of more than 50% in the average number of failed cases compared to other methods.
Where does the logical consistency problem arise in computer vision? In the realm of multi-attribute classification, models are trained to predict the attributes represented in a given image. Examples include facial
attributes [1]–[3], clothing styles [4], [5], animal attributes [6], human action recognition [7], and others. Whenever multiple attributes are predicted for an image, logical
relationships may potentially exist among these attributes. For instance, in a popular scheme for predicting attributes in face images [1], the attributes
goatee
, no beard
, and mustache
are predicted independently. Logically, however, if no beard
is predicted as true, then both goatee
and mustache
should be predicted as false. Note
that, in CelebA, mustache is a type of beard based on the ground truth. Logical consistency issues may arise in more subtle interactions as well. For example, if wearing hat
is predicted as true, then the information required to make a
prediction for bald
is occluded. Similarly, if wearing sunglasses
is predicted as true, then the information to predict narrow eyes
or eyes closed
is occluded. Analogous logical relationships exist in the
clothing style dataset; long-sleeve
and sleeveless
cannot both be predicted as true for a single garment. If floral
is predicted as false, then floral print
cannot logically be predicted as true, since it
is a specific type of floral design. It is evident that issues of logical consistency emerge in computer vision, particularly in the area of multi-attribute classification.
Why is logical consistency on predictions important? Face and body attributes have been extensively utilized in various research domains, including face matching/recognition [2], [8]–[12], re-identification [13]–[15], training GANs [16]–[19], bias analysis [3], [20]–[23], and others. For a fair accuracy comparison across demographic groups, it is pivotal to balance the distribution of
non-demographic attributes among the groups [23]. To train a face editing GAN, it is necessary to classify training images based on that
attribute. However, if images exhibit logically inconsistent sets of attribute values, these applications of the attributes become problematic and prone to errors. For example, if a group wants to understand how facial hair affects the face recognition
accuracy across demographic groups, they have to tightly control variation on facial hair. However, if a model predicts {clean-shaven
and beard-length-short
} or {beard-at-chin-area
and full-beard
} for the
same image, this type of predictions will put same images in two conflicting categories and significantly impact the statistical observation. Hence, logical consistency of attribute predictions is crucial for essentially all higher-level computer
vision tasks.
Why has logical consistency not received more attention? 1) Higher complexity and cost for considering the logical relationship during attribute marking. Labeling training images with attribute values is already a labor-intensive task. Requiring the manually-assigned meta-data to be logically consistent may make the problem worse. 2) Predominant focus on algorithmic accuracy over ground truth accuracy: Researchers often prioritize achieving accuracy improvements on established benchmarks, which is commendable. However, as accuracy levels approach a plateau, there may be a misconception that the problem has been resolved, whereas the plateau might merely reflect the level of (in)consistency in the attribute values within the training data. 3) Ambiguity of attribute names. CelebA is a notable face attribute dataset, but [24], [25] report that ambiguous attributes are a big problem; that is, attributes such as “high cheekbones”, “pointy nose”, “oval face”, etc. This is a problem not only of CelebA but of all face attribute datasets that use similar attributes. The ambiguity hinders logical consistency research since it is hard to find strong logical relationships between two ambiguous attributes. Consequently, none of the recent survey papers [26]–[29] mentions this crucial topic.
This paper introduces two challenging tasks to the domain of multi-attribute classification: (1) Training a model with labels that have been checked for logical consistency, aiming to improve the accuracy and logical consistency of predictions without involving post-processing steps; and (2) Training a model without labels that have been checked for logical consistency, also aiming to improve the accuracy and logical consistency of predictions without involving post-processing steps. The contributions of this work include:
Provide an explanation of why logical consistency on predictions is a crucial but overlooked topic, and two challenging tasks.
Provide a larger benchmark, FH41K, with more samples and better balance across attributes, to better evaluate the performance for facial hair attribute classification.
Provide a set of logical relationship cleaned annotations for CelebA validation and testing sets to support a more challenging task: train a logically consistent model with logical consistency unchecked data.
Propose an adversarial training method, LogicNet, to achieve higher accuracy and lower logical inconsistency across three datasets.
In the NLP domain, logical reasoning is a crucial topic and a detailed discussion appears in a recent survey [30]. There are various types of logical-reasoning-oriented benchmarks [31]–[35] for researchers to dig out the rations in order to improve the logical consistency of the results.
In the Computer Vision domain, a myriad of attribute relationships have been leveraged to enhance performance. These encompass positional relationships [36], [37], correlational relationships [38]–[40], logical relationships [3], [41], etc. Such relationships facilitate a deeper understanding and processing of visual data, thereby contributing to the advancement of the field. However, to our best knowledge, except [3], none of the previous works considered the logical consistency of the predictions. [3] proposed a Logical Consistency Prediction loss (LCPloss) in order to leverage the logical relationship between attributes and maintain the logical predictions. Tables 2 and 3 of this work indicate that, after considering the logical consistency on predictions, the accuracy drops significantly. Although the proposed post-processing step, label compensation (LC) strategy, reduces a large number of logically consistent predictions, it is not a general solution and needs intensive manual work to achieve a proper design. Moreover, since the existing multi-attribute classification datasets did not consider the logical relationship when they were assembled, manually cleaning them needs a large cost, so how to force the model, which is trained with unclean dataset, making accurate and logical predictions is a crucial problem.
This paper reports on the first general method, LogicNet, for causing the learned model to make logical predictions. This work also providse a benchmark for understanding and designing approaches in order to further research on the problem of logical consistency of predictions.
The ambiguity of attribute names and the reasons listed in Section 1 result in a lack of datasets that are appropriate to evaluate the model performance on the dimension of logical consistency.
FH37K is the first dataset checking both logical consistency and accuracy of the annotations. It contains 37,565 images, coming from a subset of CelebA [1] and a subset of WebFace260M [42]. Each image has 22 attributes of facial hair and baldness. However, due to the small amount of positive samples of attribute "Long" and attribute "Bald Sides Only", insufficient train/val/test samples is limitation of the FH37K dataset. To address this, we augment this dataset by adding more positive samples to minority classes. Note that FH37K is still a benchmark dataset in this paper.
FH41K is our extended dataset based on FH37K. We added 3,712 images from 2,096 identities from WebFace260M [42], specifically to increase the number of positive examples of attributes that had too low of a representation in FH37K. Specifically, we used the best facial hair classification model trained with FH37K1 to select the images that have confidence value higher than 0.8 for both "Long" and for "Bald Sides Only". We then engaged a human annotator, with the prior knowledge learnt from the documentation provided by [3], to manually check the selected images in order to promise the accuracy and logical consistency of the added images.
Both FH37K and FH41K have a set of rigorously defined rules based on the logical relationships including mutually exclusive, dependency, and collectively exhaustive. The annotations are evaluated based on these relationships. However, generating training sets that have accurate and logically consistent sets of attribute labels is an expensive and time-consuming process. Previous existing datasets were created without considering the issue of logical consistency annotations. This raises an important question. Is it possible to train a model to produce logically consistent attribute predictions using a training dataset that does not have logically consistent annotations? We compiled an additional dataset specifically for studying this question.
CelebA-logic is the variation of CelebA, where the logical relationship between attributes is checked for both validation and test sets. Given the absence of a definitive guide of how these 40 attributes are marked and what the definition of each attribute, we categorized the attribute relationships into three groups based on our knowledge, as shown in Figure 2. To make a fair set of logical rules, only Strong relationships are used to check the logical consistency. Moreover, [24], [25] reported that CelebA suffers from a substantial rate of inaccurate annotations. Hence, we conducted an annotation cleaning process for those strong relationship attributes upon the MSO-cleaned-annotations [24]. To get the cleaned facial hair and baldness related attributes, we converted the FH37K annotations back to the CelebA version and updated the labels to the corresponding images. Two human annotators then marked “Bangs”, “Receding Hairline” and “Male” based on the designed definitions for all the images in the validation and test sets. To ensure the consistency and accuracy of the new annotations, a third human annotator with knowledge of definitions marked 1,000 randomly selected samples. The estimated consistency rate is 93.87%. Consequently, 1) all images are cropped and aligned to 224x224 based on the given landmarks, 2) 975 images are omitted from the original dataset, 3) 63,557 (31.8%) images have at least one different label than the original, and 4) all test and validation annotations obey the Strong logical relationships.
To provide a solution for the challenges of logical consistency, we propose LogicNet, which exploits an adversarial training strategy and a label generation algorithm, Bag of Labels (BoL). LogicNet enables the classifier to learn the logical relationship between attributes, thereby enhancing the model’s capacity to generate logically consistent predictions.
We propose an adversarial training framework, shown in Figure 3, to compel the classifier \(\mathcal{C}\) to make logically consistent predictions while improving the accuracy of predictions. Formalizing the desired goal, we consider a set of training images as \(X\in \{x_1, x_2,..., x_N\}\), from which we want to train a model, \(\mathcal{F}(X)\), to project \(X\) to the ground truth labels \(L_{gt}\in\{l_1, l_2,...,l_N\}\), where each \(l_N\) is a set of attribute labels of \(x_N\). The classification loss is the binary cross entropy loss: \[\begin{align} \mathcal{L}_{bce}(\mathcal{F}(X;\Phi), L_{gt}) = &-\frac{1}{N}\sum^{N}_{i=1}[l_{i}log(\mathcal{F}(x_i;\Phi))\\&+ (1-l_{i})log(1-\mathcal{F}(x_i;\Phi))] \end{align}\] Where \(\Phi\) is the parameter vector of the classifier \(\mathcal{C}\). For the adversarial learning, a discriminator that can judge the logical consistency of the predictions is needed. Here, we use a simple and effective multi-headed self-attention network to give a probability, \(\mathcal{P}_{logic}\in [0, 1]\), for the logical consistency of labels, \(L'\). The loss of the multi-attribute classifier, \(\mathcal{L}_C\), becomes: \[\underset{\Phi}{\min} \underset{\Theta}{\max}(1-\lambda)\mathcal{L}_{bce}(\mathcal{F}(X;\Phi), L_{gt})+\lambda log(-\mathcal{D}(\mathcal{F}(L');\Theta))\] Where \(\mathcal{D}\) is the parameter frozen discriminator, \(\Theta\) is the parameter vector of the discriminator, and \(\lambda\) is used for loss trade-off.
To train a discriminator, the straightforward approach [40] is to directly feed the predictions (ground truth labels) and treat them as negative (positive) samples. Since the training labels of CelebA are not yet cleaned, using them could mislead the discriminator and cause it to learn incorrect patterns. Hence, we propose a Bag of Labels algorithm 4 that can automatically generate logically inconsistent labels based on the given ground truth labels while detecting the logical consistency of the original label. This algorithm is used in two parts of the LogicNet approach: Condition Group Setup and Label Poisoning.
Condition Group Setup: To give accurate logic labels \(L_{logic}\) to \(L_{gt}\), following the rules, we separate the corresponding attributes of each rule to two groups: \(g_{c1}\) and \(g_{c2}\), where the attributes in \(g_{c1}[i]\) have strong logical relationships with the attributes in \(g_{c2}[i]\). For both FH37K and FH41K, we followed the rules given by [3]. For CelebA, we followed the rules in Figure 2.
Label Poisoning: To generate logically inconsistent labels, we first categorize the rules in three cases: inter-class impossible poisoning, intra-class impossible poisoning, and intra-class incomplete poisoning. Inter-class impossible poisoning aims to generate labels where the logical inconsistency happens between the attributes in different classes (e.g. Beard Area(clean shaven)=true and Beard Length(short)=true; no beard=true and goatee=true). Intra-class impossible and intra-class incomplete poisoning aim to generate labels where there are multiple positive predictions within one class (e.g. Beard Area(clean shaven)=true and Beard Area(chin area)=true) or no positive predictions within on class. These two poisoning strategies apply to FH37K and FH41K; attributes in CelebA do not have this level of detail and so do not have these logical relationships. After each poisoning, the initialized logic labels, \(L_{logic}\), are updated on-the-fly. The objective function is: \[\underset{\Theta}{\min}\mathcal{L}_{\mathcal{D}} = \mathcal{L}_{bce}(L_{logic}, \mathcal{D}(L'))\] Where \[L' = \left\{\begin{matrix} N_{random} > 0.5, & L_{bol}\\ Others, & L_{pred} \end{matrix}\right.\] Here, \(L_{bol}\) is from BoL algorithm, \(L_{pred}\) is from classifier, \(N_{random}\) is a randomly generated float number between 0 and 1.
Methods | \(Acc_{avg}\) | \(Acc^{n}_{avg}\) | \(Acc^{p}_{avg}\) | \(Acc_{avg}\) | \(Acc^{n}_{avg}\) | \(Acc^{p}_{avg}\) | |
---|---|---|---|---|---|---|---|
Logical consistency is not taken into account... | |||||||
BCE | 79.23 | 94.72 | 63.73 | 83.88 | 95.50 | 72.27 | |
BCE-MOON | 86.21 | 90.67 | 81.75 | 88.02 | 91.29 | 84.75 | |
BF | 76.92 | 95.43 | 58.41 | 75.81 | 97.78 | 52.85 | |
BCE+LCP | 79.64 | 95.98 | 63.30 | 84.93 | 95.09 | 74.77 | |
Ours | 83.65 | 93.46 | 73.83 | 85.66 | 94.23 | 77.10 | |
W/ label compensation... | |||||||
BCE\(^\dagger\) | 80.14 | 91.49 | 68.78 | 79.12 | 87.43 | 70.81 | |
BCE-MOON\(^\dagger\) | 42.59 | 50.55 | 34.62 | 42.79 | 47.96 | 37.61 | |
BF\(^\dagger\) | 78.48 | 90.91 | 66.05 | 82.85 | 93.53 | 73.17 | |
BCE+LCP\(^\dagger\) | 81.44 | 92.65 | 70.23 | 79.31 | 87.31 | 71.31 | |
Ours\(^\dagger\) | 78.28 | 87.23 | 69.32 | 81.53 | 89.10 | 73.96 | |
W/o label compensation (what we care!)... | |||||||
BCE\(^\dagger\) | 48.50 | 54.59 | 42.40 | 56.71 | 62.14 | 51.27 | |
BCE-MOON\(^\dagger\) | 40.25 | 47.54 | 32.95 | 40.68 | 45.39 | 35.98 | |
BF\(^\dagger\) | 36.20 | 40.95 | 31.45 | 22.38 | 23.84 | 20.92 | |
BCE+LCP\(^\dagger\) | 38.69 | 43.70 | 33.67 | 64.54 | 70.40 | 58.67 | |
Ours\(^\dagger\) | 71.55 | 79.37 | 63.73 | 74.50 | 81.41 | 67.59 |
*Methods | W/o considering logical consistency | Considering logical consistency (What we care!) | ||||
---|---|---|---|---|---|---|
2-7 | \(Acc_{avg}\) | \(Acc^{n}_{avg}\) | \(Acc^{p}_{avg}\) | \(Acc_{avg}\) | \(Acc^{n}_{avg}\) | \(Acc^{p}_{avg}\) |
AFFACT (original) | 81.25 | 95.72 | 66.78 | 79.11 | 93.55 | 64.67 |
ALM (original) | 81.97 | 94.25 | 69.69 | 79.04 | 91.04 | 67.03 |
AFFACT | 79.71 | 95.48 | 63.95 | 77.72 | 93.31 | 62.12 |
ALM | 80.53 | 94.10 | 66.95 | 77.63 | 90.88 | 64.39 |
BCE | 80.89 | 94.96 | 66.70 | 77.94 | 92.34 | 63.54 |
BCE-MOON | 87.13 | 87.95 | 86.32 | 74.76 | 76.24 | 73.28 |
BF | 76.44 | 96.77 | 56.11 | 75.28 | 95.82 | 54.75 |
BCE+LCP | 81.91 | 94.16 | 69.66 | 78.07 | 90.26 | 65.87 |
Ours | 82.18 | 93.74 | 70.63 | 79.08 | 90.89 | 67.28 |
In this section, we evaluate the proposed approach from two aspects: accuracy and logical consistency. For accuracy, the traditional average accuracy measurement (Eq. 1 , where \(N\) = total number of images, \(N_{tp}\) = number of true positive predictions, \(N_{tn}\) = number of true negative predictions) ignores the unbalanced number of positive and negative images for each attribute. \[AccT_{avg} = \frac{1}{N}(N_{tp} + N_{tn})\times100 \label{eq:traditional}\tag{1}\] This results in an unfair measure of model performances since the multi-attribute classification datasets suffer from sparse annotations. For example, in the original CelebA annotations, if all predictions are negative, the overall test accuracy is 76.87%. Hence, we follow the suggestion in [43] and use average value of the positive accuracy, \(Acc^{p}_{avg}\), and negative accuracy, \(Acc^{n}_{avg}\), to consider the imbalanced issue. The equation is as: \[Acc_{avg} = \frac{1}{2}(Acc^{p}_{avg}+Acc^{n}_{avg})\] In addition, to show how logical consistency of predictions affects the accuracy, we measure the performance under two conditions: 1) without considering the logical consistency on predictions, 2) with considering the logical consistency on predictions, in this case, logically inconsistent predictions are deemed incorrect. For FH37K and FH41K, we also include the label compensation strategy [3] experiments to complete the accuracy comparison. To measure the model performance on logical consistency, we performed logical consistency checking on the predictions of 600K images from WebFace260M [42]. We also independently compare the accuracy values on the strong relationship attributes in CelebA-logic.
To give a comprehensive study of the lack of consideration on logical consistency when the models give predictions, we choose four training methods to do the comparisons. Binary Cross Entropy Loss (BCE) is a baseline which only considers the entropy between predictions and ground truth labels. Binary Focal Loss (BF) [44] aims to focus more on hard samples in order to mitigate the effect of imbalanced data. BCE-MOON [45] calculates the ratio of positive and negative samples for each attribute as the weights added to loss values before back propagation. It tries to balance the effect of positive and negative samples. Logically Consistent Prediction Loss (BCE+LCP) [3] utilizes the conditional probability to force the probability of the mutually exclusive attributes happen at the same time being 0 and the probability of the dependency attributes happen at the same time being 1.
We train all the classifiers starting with the pretrained ResNet50 [46] from Pytorch2. The FH37K results in Table 1 are adopted from [3] except the values of \(Acc_{avg}\). We resize images to 224x224 for all three datasets. The batch size and learning rate are {256, 0.0001} for FH37K and FH41K, and {64, 0.001} for CelebA-logic. We use random horizontal flip for both FH37K and FH41K. We use random horizontal flip, color jitter, and random rotation for CelebA-logic. AFFACT [47] and ALM [48] are the two SOTA models that are available online, which we used for performance comparison on CelebA-logic. The \(\lambda\) values for FH37K, FH41K, and CelebA are \(\{0.15, 0.2, 0.1\}\). The discriminator consists with 8 multi-headed self-attention blocks and there is no position embedding implemented. It is necessary to know that ALM algorithm resizes the original (178x218) CelebA images to 128x128 for testing, the other methods are using the cropped images mentioned in Section 3 for testing.
Methods | 5 O’ S | Bald | Bangs | Goatee | Male | Mustache | No_Beard | \(^*\)Hairline | \(^*\)Hat | \(Acc_{avg}\) |
---|---|---|---|---|---|---|---|---|---|---|
AFFACT | 72.24 | 90.27 | 85.36 | 70.41 | 95.51 | 61.86 | 85.47 | 65.69 | 93.93 | 80.08 |
ALM | 76.34 | 81.81 | 85.08 | 74.27 | 93.88 | 64.51 | 87.66 | 62.84 | 90.62 | 79.67 |
BCE | 65.88 | 75.68 | 86.69 | 76.73 | 94.78 | 87.80 | 80.68 | 67.28 | 89.96 | 80.61 |
BCE-MOON | 69.79 | 70.77 | 82.22 | 80.73 | 82.09 | 82.73 | 77.40 | 59.67 | 84.20 | 76.62 |
BF | 59.38 | 78.05 | 78.52 | 69.33 | 96.82 | 87.59 | 83.12 | 62.88 | 92.13 | 78.65 |
BCE+LCP | 69.83 | 82.91 | 84.00 | 75.33 | 92.72 | 88.54 | 81.30 | 63.57 | 89.70 | 80.88 |
Ours | 68.10 | 86.03 | 87.79 | 79.05 | 94.54 | 89.40 | 81.45 | 65.55 | 91.39 | 82.59 |
Methods | \(N_{incomp}\) | \(N_{imp}\) | \(R_{failed}\) | \(N_{incomp}\) | \(N_{imp}\) | \(R_{failed}\) | |
---|---|---|---|---|---|---|---|
W/ label compensation... | |||||||
BCE | 0 | 11,134 | 1.84 | 0 | 7,464 | 1.24 | |
BCE-MOON | 0 | 330,115 | 54.66 | 0 | 341,114 | 56.48 | |
BF | 0 | 14,007 | 2.32 | 0 | 3,530 | 0.58 | |
BCE+LCP | 0 | 5,595 | 0.93 | 0 | 5,788 | 0.96 | |
Ours | 0 | 21,731 | 3.60 | 0 | 19,194 | 3.18 | |
W/o label compensation (what we care!)... | |||||||
BCE | 240,761 | 6,001 | 40.86 | 352,061 | 585 | 58.39 | |
BCE-MOON | 31,512 | 313,044 | 57.05 | 34,415 | 321,872 | 59.00 | |
BF | 339,136 | 1,295 | 56.37 | 587,056 | 0 | 97.21 | |
BCE+LCP | 307,576 | 300 | 50.98 | 248,768 | 2,416 | 41.59 | |
Ours | 139,184 | 14,660 | 25.47 | 133,245 | 13,838 | 24.36 |
Table 1 and Table 2 show the accuracy values, tested on FH37K, FH41K, and CelebA-logic, under two measurement conditions. In the traditional case of not considering logical consistency of predictions, every method reaches \(>75\%\) average accuracy, where BCE-MOON is {2.56%, 2.36%, 4.95%} higher than the next-highest accuracy on {FH37K, FH41K, CelebA-logic}. The main reason is that BCE-MOON has outstanding performance on positive label prediction, which is {7.92%, 7.65%, 15.69%} higher than the second highest accuracy on {FH37K, FH41K, CelebA-logic}. However, with considering the logical consistency, BCE-MOON has a significant accuracy decrease, {45.96%, 47.43%, 12.37%}, on the three datasets. Note that the accuracy decrease happens across all training methods.
For FH37K and FH41K, except for the proposed method, the average decreases in accuracy are 39.59% and 37.03% respectively. Seven out of eight results have \(<60\%\) accuracy and the lowest accuracies are only 36.2% and 22.38%. These results show how much the traditional methods suffer from predicting logically inconsistent labels. Note that these methods aim to solve different problems in multi-attribute classification. The proposed method has {12.1%, 11.16%} decreasing on accuracy and the overall accuracy is {23.05%, 9.96%} higher than the second highest accuracy, {35.35%, 52.12%} higher than the lowest accuracy.
[3] proposed a post-processing step, termed label compensation strategy, to resolve incomplete predictions. By using this strategy, the methods beside except BCE-MOON have a significant, 30.85% on average, increase in accuracy. This results in two conclusions: 1) Methods that aim to mitigate the imbalanced data effect might give an illusion of high accuracy driven by positive predictions; 2) Other methods can somewhat catch the logical patterns, but need to involve post-processing steps. However, the label compensation strategy is only for solving the collectively exhaustive case (i.e. the model must give one positive prediction in an attribute group). For example, in FH37K and FH41K, the attributes, {clean-shaven, chin-area, side-to-side, beard-area-information-not-visible}, in the Beard Area group can cover any case that is related to beard area. Implementing this type of strategies necessitates extensive manual analysis to determine the most judicious decision-making process, underscoring the imperative for continued research in this domain.
*Methods | W/o considering logical consistency | Considering logical consistency (What we care!) | ||||
---|---|---|---|---|---|---|
2-7 | \(Acc_{avg}\) | \(Acc^{n}_{avg}\) | \(Acc^{p}_{avg}\) | \(Acc_{avg}\) | \(Acc^{n}_{avg}\) | \(Acc^{p}_{avg}\) |
LogicNet (preds) | 82.63 | 93.42 | 71.83 | 65.94 | 74.18 | 57.70 |
LogicNet (BoL) | 81.90 | 93.04 | 70.77 | 65.04 | 73.23 | 56.86 |
LogicNet (preds + BoL) | 83.65 | 93.46 | 73.83 | 71.55 | 79.37 | 63.73 |
LogicNet (preds) | 85.72 | 94.12 | 77.32 | 72.83 | 79.38 | 66.28 |
LogicNet (BoL) | 85.48 | 94.05 | 76.91 | 73.03 | 79.96 | 66.11 |
LogicNet (preds + BoL) | 85.66 | 94.23 | 77.10 | 74.50 | 81.41 | 67.59 |
LogicNet (preds) | 81.46 | 94.42 | 68.49 | 77.95 | 91.07 | 64.82 |
LogicNet (BoL) | 80.65 | 94.18 | 67.12 | 78.15 | 91.72 | 64.58 |
LogicNet (preds + BoL) | 82.18 | 93.74 | 70.63 | 79.08 | 90.89 | 67.28 |
For CelebA-logic, when considering logical consistency of predictions, the patterns echo the previous observations. For both AFFACT and ALM, we use the original model weights provided by the authors. The top half of Table 2 shows that either using the original annotations or the cleaned annotations, there is a 2.49% accuracy decreasing after considering the logical consistency. The average accuracy decrease of the models tested on the cleaned annotations is 4.04%, where the BF has the smallest accuracy difference and BCE-MOON has the largest accuracy difference. Our speculation is that BF over-focuses on negative attributes but the logical relationship mostly happens between positive side, so BF has lower probability to disobey logical relationships. Conversely, BCE-MOON over-focuses on positive side, so it has higher probability to disobey logical relationships. Results in Table 2 and Table 3 show that the proposed method has the best performances on average accuracy of all attributes and strong relationship attributes, where it is {1.01%, 1.71%} higher than the second-highest accuracy. Therefore, the proposed method has the best ability to learn the logical relationships.
To evaluate the performance of logical consistency of prediction in the real-world case, we use the subset of WebFace260M, which contains 603,910 images, as a test set. Since there are no ground truth labels, we only measure the ratio of failed (logically inconsistent) predictions for each method.
Table 4 shows that without the post-processing step, the average failed rate is in the range of {51%, 64.05%} for four commonly used methods. BF trained with FH41K predicts too many negative labels which causes the outlier ratio, 97%. The proposed method significantly reduces the number of failed cases, where the failed ratio, {25.47%, 24.36%}, is less than half of the average failed ratio. When we implement the post-processing strategy, all the incomplete cases are gone, which results in a low failed ratio for all methods other than BCE-MOON. This supports the aforementioned speculation, where BCE-MOON over-focuses on positive side and existing methods can somewhat learn the pattern but need to involve post-processing steps. The logical consistency test on the classifiers trained with CelebA-logic is in Supplementary Material.
To show the effectiveness of our method in adversarial training, we conducted an experiment to compare the performance of three ways of training a discriminator: 1) Directly feed the predictions to the discriminator as negative samples, 2) Directly feed the poisoned labels to the discriminator, 3) Randomly feed the predictions and poisoned labels to the discriminator. The proposed method can achieve {5.61%, 1.47%, 0.93%} accuracy increase on three datasets with the logical consistency considered.
We point out that the problem of logical consistency of attribute predictions in computer vision has received no attention to date. To fill this void, we provide two new datasets for two logical consistency challenges: 1) Train a classifier with logical-consistency-checked data to leard a classifier that makes logically consistent predictions, and 2) Train a classifier with training data that contains logically inconsistent labels and still achieve logically consistent predictions. To our best knowledge, this is the first work that comprehensively discusses the problem of logical consistency of predictions in multi-attribute classification.
We propose the LogicNet, which does not involve any post-processing step, and significantly increases the performance, {23.05% (FH37K), 9.96% (FH41K), 1.71% (CelebA-logic)} higher than the second best, under logical consistency checked condition for all three datasets. For the real-world case analysis, the proposed method can largely reduce the failed ratio of the predictions.
The proposed method provides a general solution to cause model predictions to be more logically consistent than the previous methods, but the accuracy difference before and after consider the logical consistency on predictions is still large and the failed ratio is not negligible for both challenges. Further research is needed to improve logical consistency in attribute predictions.