That Explains It… __HOT__
The Language Level symbol shows a user's proficiency in the languages they're interested in. Setting your Language Level helps other users provide you with answers that aren't too complex or too simple.
That Explains It…
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Inferior temporal (IT) cortex in human and nonhuman primates serves visual object recognition. Computational object-vision models, although continually improving, do not yet reach human performance. It is unclear to what extent the internal representations of computational models can explain the IT representation. Here we investigate a wide range of computational model representations (37 in total), testing their categorization performance and their ability to account for the IT representational geometry. The models include well-known neuroscientific object-recognition models (e.g. HMAX, VisNet) along with several models from computer vision (e.g. SIFT, GIST, self-similarity features, and a deep convolutional neural network). We compared the representational dissimilarity matrices (RDMs) of the model representations with the RDMs obtained from human IT (measured with fMRI) and monkey IT (measured with cell recording) for the same set of stimuli (not used in training the models). Better performing models were more similar to IT in that they showed greater clustering of representational patterns by category. In addition, better performing models also more strongly resembled IT in terms of their within-category representational dissimilarities. Representational geometries were significantly correlated between IT and many of the models. However, the categorical clustering observed in IT was largely unexplained by the unsupervised models. The deep convolutional network, which was trained by supervision with over a million category-labeled images, reached the highest categorization performance and also best explained IT, although it did not fully explain the IT data. Combining the features of this model with appropriate weights and adding linear combinations that maximize the margin between animate and inanimate objects and between faces and other objects yielded a representation that fully explained our IT data. Overall, our results suggest that explaining IT requires computational features trained through supervised learning to emphasize the behaviorally important categorical divisions prominently reflected in IT.
Computers cannot yet recognize objects as well as humans can. Computer vision might learn from biological vision. However, neuroscience has yet to explain how brains recognize objects and must draw from computer vision for initial computational models. To make progress with this chicken-and-egg problem, we compared 37 computational model representations to representations in biological brains. The more similar a model representation was to the high-level visual brain representation, the better the model performed at object categorization. Most models did not come close to explaining the brain representation, because they missed categorical distinctions between animates and inanimates and between faces and other objects, which are prominent in primate brains. A deep neural network model that was trained by supervision with over a million category-labeled images and represents the state of the art in computer vision came closest to explaining the brain representation. Our brains appear to impose upon the visual input certain categorical divisions that are important for successful behavior. Brains might learn these divisions through evolution and individual experience. Computer vision similarly requires learning with many labeled images so as to emphasize the right categorical divisions.
This raises the question if any existing computational vision models, whether motivated by engineering or neuroscientific objectives, can more fully explain the IT representation and account for the IT category clustering. IT clearly represents visual shape. However, the degree to which categorical divisions and semantic dimensions are also represented is a matter of debate , . If visual features constructed without any knowledge of either category boundaries or semantic dimensions reproduced the categorical clusters, then we might think of IT as a purely visual representation. To the extent that knowledge of categorical boundaries or semantic dimensions is required to build an IT-like representation, IT is better conceptualized as a visuo-semantic representation.
We also tested models that were supervised with category labels. Two of the models (GMAX and supervised HMAX)  were trained in a supervised fashion to distinguish animates from inanimates, using 884 training images. In addition, we tested a deep supervised convolutional neural network , trained by supervision with over a million category-labeled images from ImageNet .
Internal representations of the HMAX model (the C2 stage) and several computer-vision models performed well on EVC. Most of the models captured some component of the representational dissimilarity structure in IT and other visual regions. Several models clustered the human faces, which were mostly frontal and had a high amount of visual similarity. However, all the unsupervised models failed to cluster human and animal faces that were very different in visual appearance in a single face cluster, as seen for human and monkey IT. The unsupervised models also failed to replicate IT's clear animate/inanimate division. The deep supervised convolutional network better captured the categorical divisions, but did not fully replicate the categorical clustering observed in IT. We proceeded to remix the features of the deep supervised model to emphasize the major categorical divisions of IT using maximum-margin linear discriminants. In order to construct a representation resembling IT, we combined these discriminants with the different representational stages of the deep network, weighting each discriminant and layer of the deep network so as to best explain the IT representational geometry. The resulting IT-geometry model, when tested with crossvalidation to avoid overfitting to the image set, explains our IT data. Our results suggest that intensive supervised training with large sets of labeled images might be necessary to model the IT representational space.
The IT RDMs (black frames) for human (A) and monkey (B) and the seven most highly correlated model RDMs (excluding the representations in the strongly supervised deep convolutional network). The model RDMs are ordered from left to right and top to bottom by their correlation with the respective IT RDM. These are the seven most higly correlated RDMs among the 27 models that were not strongly supervised and their combination model (combi27). Biologically motivated models are in black, computer-vision models are in gray. The number below each RDM is the Kendall τA correlation coefficient between the model RDM and the respective IT RDM. All correlations are statistically significant. For statistical inference, see Figure 2. For model abbreviations and RDM-correlation p values, see Table 1. For other brain ROIs (i.e. LOC, PPA, FFA, EVC) see Figure S1 and Table 1. The RDMs here are 9696, including the four stimuli we did not have monkey data for. The corresponding rows and columns are shown in blue in the mIT RDM and were ignored in the RDM comparisons.
Descriptive category-clustering analysis as in Figure 3, but for the deep supervised network. We used a linear combination of category-cluster RDMs (Figure S5) to model the categorical structure. The fitted linear-combination of category-cluster RDMs is shown in the middle columns. This descriptive visualization shows to what extent different categorical divisions are prominent in each layer of the deep supervised model. The layers show some of the categorical divisions emerging. However, remixing of the features (linear SVM readout) is required to emphasize the categorical divisions to a degree that is similar to IT. The final IT-geometry-supervised layer (weighted combination of layers and SVM discriminants) has a categorical structure that is very similar to IT. Overfitting to the image set was avoided by crossvalidation. For statistical inference, see Figure 9.
Among the not-strongly-supervised models, the seven models with the highest RDM correlations with hIT and mIT are shown in Figure 1 (for other brain regions, see Figure S1 and Table 1). Visual inspection suggests that the models capture the human-face cluster, which is also prevalent in IT. However, the models do not appear to place human and animal faces in a single cluster. In addition, the inanimate objects appear less clustered in the models.
Combining features from the not-strongly-supervised models improved the RDM correlations to IT. Model features were combined by summarizing each model representation by its first 95 principal components and then concatenating these sets of principal components. This approach ensured that each model contributed equally to the combination (same number of features and same total variance contributed).
The combination of the 27 not-strongly-supervised models (combi27) has a higher RDM correlation with both hIT and mIT than any of the 27 contributing models. Second to the combi27 model, internal representations of the HMAX model have the highest RDM correlation with hIT and mIT. This might reflect the fact that the architecture and parameters of the HMAX model closely follow the literature on the primate ventral stream.
The main categorical divisions observed in IT appear weak or absent in the best fitting models (Figure 1). To measure the strength of categorical clustering in each model and brain representation, we fitted a linear model of category-cluster RDMs to each model and brain RDM (Materials and Methods, Figure S5). The fitted models (Figure 3) descriptively visualize the categorical component of each RDM, summarizing sets of within- and between-category dissimilarities by their averages. The fits for several computational models show a strong human-face cluster, and a weak animate cluster. The human-face cluster is expected on the basis of the visual similarity of the human-face images (all frontal aligned human faces of the same approximate size). The animate cluster could reflect the similar colors and more rounded shapes shared by the animate objects. However, IT in both human and monkey exhibits additional categorical clusters that are not easily accounted for in terms of visual similarity. First, the IT representation has a strong face cluster that includes human and animal faces of different species, which differ widely in shape, color, and pose. Second, the IT representation has an inanimate cluster, which includes a wide variety of natural and artificial objects and scenes of totally different visual appearance. These clusters are largely absent from the not-strongly-supervised models (Figures 3, S6, S7, S8). 350c69d7ab