Background literature & project rationale

he project rests on 40 years of human face perception research

The visual processing stream for faces. Face stimuli activate multiple areas throughout the ventral visual stream. In mid- and high-level vision, this includes areas that respond equally to faces and other object classes and appear to be involved primarily in coding basic shapes (eg. V4) and complex objects (eg. Lateral Occipital Complex LOC); plus areas that respond more strongly to faces than other objects, including a series of areas apparently primarily involved in coding face identity (Occipital Face Area, OFA; Fusiform Face Area FFA, in inferotemporal cortex; and Anterior Temporal Lobe ATL / Anterior Face Patch AFP), plus at least one area generally thought to be more involved in processing facial expression (STS) (for review [8]). Neural responses in these areas are typically much less sensitive to the specific format and spatial frequency distribution of the face (eg. variations in colour versus grey scale versus line-drawn; changes in stimulus size or stimulus contrast) than coding in early visual areas (V1, V2) [8].

Three types of low-resolution faces. Because mid- and high-level face processing is relatively insensitive to low level face format, manipulations targeted at improving face recognition within these stages ought to produce advantages that are of similar strength across a wide variety of low-resolution image types. To test this prediction, the present project will use three very different low-resolution formats (Fig 1). Each was selected based on practical relevance to real-world situations. 1. Modern digital camera CCTV faces are pixelated –broken into square regions each of uniform colour/brightness– with vertical and horizontal edge boundaries between adjacent pixels. Resolution varies with distance of the photographed person from the camera, but in crime settings is commonly poor (eg 48 x 28 pixels for the face height by face width in Fig 1). 2. Phosphenised faces provide simulations, in normal-vision observers, of how faces might appear to people fitted with visual prosthetics, i.e., 'bionic eyes' implanted to activate neurons in the retina or V1; these produce images with spatially separated arrays of light points [5]. Resolution (phosphenes per face width/height) depends both on the number of array elements (current literature uses 16x16 to 40x40 with 30% assumed nonoperational) and on face size (viewing distance from the face). 3. Smooth blurring of the face occurs when human vision filters out high (and eventually medium) spatial frequencies as the face becomes more distant and/or is viewed more peripherally. Detailed models of this blurring have been presented by [3,9], and these models are used to make the blurred stimuli in the present project. Blurring is relevant when an eyewitness sees a perpetrator at a distance or, alternatively, close up but with peripheral vision.

Current approaches relevant to improving low-quality face images. There are a number of methods used in the computer science "image enhancement" literature that can increase the visibility of faces to human observers (eg. if a photograph is too dark, or washed out) [7]. None of these use any theoretical knowledge about human face or object coding per se. Instead, they are focussed on improving "front end" processing within early stages of the visual system. Most work by modifying image brightness or contrast (overall or in specific brightness bands), or intensifying specific spatial frequency components [7]. For example, in the context of age-related macular degeneration, which produces blurring in the image due to reliance on peripheral vision (central vision is destroyed), increasing the contrast of medium and high spatial frequencies (>5 cycles per face width) has been shown to make face detail more visible and thus to improve recognition [6,10]. These current approaches use the logic that enhanced early-vision coding will then flow through to provide higher quality input to later visual areas, including the high-level areas in which individual faces, and their expressions, are eventually recognised. These approaches can certainly be helpful [6,7]. But, they have two important limitations. First, different specific types of low-resolution formats require individualised types of "low-level" image changes [7]. For example, increasing contrast of HSFs helps recognition of blurred faces, but in a CCTV face the same method would make things worse by enhancing the visibility of the HSF edges between pixels. Second, only one stage of the visual system is being targeted, meaning that potential benefits from additionally targeting other (higher-level) stages of the system are missed.

Higher-level coding of faces. Theoretically, we know a great deal about higher-level perceptual coding of faces. Extensive behavioural literature has revealed facial identity is coded in at least three ways: via its deviation from an average in a perceptual face-space ([11], McKone lab paper [12]); holistically ([13,14], McKone [15]); and as local parts [16,17]. There is evidence that facial expression is also coded in all these three ways [18,19] (although apparently in different brain regions [8]). Thus, potentially, stimulus alterations targeted to any of these higher-level coding styles might be able to improve face recognition in low-resolution images.

Face-space coding and the caricature advantage. One way faces are coded is via their location relative to an average in a perceptual face-space [11,20] (eg. see evidence from cognitive psychology [21]; psychophysics [22], McKone [12]; single-cell recording [23,24]). As applied to face identity (see [19] for similar ideas in face expression), individual faces are coded as points in a multi-dimensional space, the dimensions of which represent facial attributes tuned to the set of previous faces to which the observer has been exposed (the specific attributes remain unknown), and the physically average face lies at the centre of the space [11]. Face-space is used to explain improvements in face recognition with caricaturing (Fig 1), as compared to the original 'veridical' face. Photographic caricatures are made by placing 120+ key location markers on the face and using morphing software to exaggerate all ways in which an individual's face differs from an average face [25]. For example, if a face is slightly narrower, has a slightly more pointed nose, and slightly closer together eyes than does the average, then the caricature will become narrower still, have an even more pointed nose, and even closer together eyes. Various levels of caricature strength can be created, shifting a face along its original vector further away from the average. This shifting improves face perception [26] and matching [27]: eg. two people's faces are rated as more different from each other after caricaturing, theoretically because the distance between the two faces in face-space has been increased [25]. Caricaturing also improves face memory [25], theoretically because face-space contains a lower density of faces (people previously learned from everyday life) farther from the centre of face-space than closer in (as demonstrated by multidimensional scaling data [21]), which produces better recognition because the caricatured target face then has fewer nearby neighbouring exemplars with which it can be confused [11]. When faces are high-resolution photographs, improvements in perception and memory with caricaturing are well established (identity [25,28], expression [29]). For poor-quality images, caricaturing also often improves recognition in line drawings [20,30]. But, importantly, line drawings retain the opposite spatial-frequency profile of face information (high-SF) from the low-resolution images examined here (low-SF). For low-resolution face images, my own lab has recently had accepted the first paper showing caricaturing improves recognition (Fig 2 [26]). This work tested the format of blur, using identity recognition, in own-race (faces and observers Caucasian) upright, unfamiliar faces.

Wholes and parts, and a predicted advantage from whole-then-part alternation. Parallel to the literature on facespace approaches to coding faces, another long tradition has examined holistic and part-based processing for faces. Holistic perceptual integration, in both identity and expression processing, is demonstrated by the classic Thatcher illusion [33] and the composite illusion in which naming the identity or expression of one half of a face is impaired when the other half displays a different identity/expression (relative to a control condition of the halves spatially misaligned, [13],McKone[15]). The exact nature of holistic mental representations is not fully understood, but a key point in the present context is that they do not code only low spatial frequency information about a person's face (McKone [34]: eg. small but significant holistic processing occurs for face stimuli containing only medium or only high spatial frequencies [35]). This means that, although a low-resolution face image will activate holistic processing [35], it might fail to map onto the stored holistic representation of the correct person's face because several different people might share the same low-SF face information, while the differentiating higher-spatial-frequency information in the stored holistic representation is missing from the image.

Faces also receive part-based processing [16,17]. For identity, memory is above chance for isolated face features (eg. nose alone) or features seen one at a time in succession through a moving window [36]. Single features also contribute to facial expression judgements (eg. smiling mouth [37]).

A key idea is that holistic and part-based are best thought of as independent parallel [38,39] stages of processing, both of which contribute to total performance level [34]. They are dissociated in brain processing [16,40] and behaviour; for example, holistic perceptual integration is strongly reduced by face inversion (turning it upside down) (eg. composite illusion in the naming task [13,15]; Mooney face detection [38]; also [39]), while part-based processing is not [34] (eg. discrimination of isolated face parts or spatially separated face parts). This independence predicts that a novel approach of maximising quality of part-based processing – by providing high-resolution views of local regions of the face alternated with the low-resolution image of the whole face – should improve recognition relative to viewing only the low-resolution whole face. This whole-then-part alternation presentation format is of practical relevance where high-resolution information is potentially available but cannot be provided to the observer for the whole face simultaneously. This situation can occur (a) in drones, where limits on computer image-compression techniques and wireless download speed [4] may make it more feasible to alternate the information downloaded between low-resolution views of the whole face and high-resolution views of local regions, and (b) in bionic eye simulations, where the fixed number of phosphenes can be used to display the whole face at low resolution, or an expanded up (thus higher resolution) smaller region of the face (eg. just the eyes). There are no previous tests of this original-sized-whole-then-expanded-up-part alternation format. However, there is related evidence arguing it should work: adding a low-resolution (blurred) global face structure to high resolution local features improves face recognition over having either of these sources available separately, at least when both are at their original size, and when the low-resolution region comprises most (rather than all) of the face, and when the two are presented simultaneously (rather than alternated) [36]. (Result comes from a procedure where observers scan a 15° face and are shown: only a small 3.5° window at fixation which reveals high-resolution fixated features; or, only the region outside the window – necessarily blurred by peripheral vision –; or, both.)

Refining theoretical origin of caricature and whole-then-part improvements by testing generalisation of their benefits across face types. Logically, sensitivity of caricature and whole-then-part advantages to variation in face type (viewpoint, race, inversion), can clarify origins of the advantages in specific visual areas, via links to known sensitivities of those areas to the same variables. While knowledge in this field is not complete – eg. FFA properties are fairly well established but more anterior face-selective areas have received much less investigation – the emerging picture from both fMRI-guided single-cell recording [41] and human studies (fMRI, eg. [40,42-44], TMS [16]) can reasonably be described as: 1. Areas coding shape elements or general objects (V4, LOC) do not discriminate upright faces better than inverted faces, do not show other-race effects or generalise across face viewpoint, and do not code whole-face structure. 2. Of the face selective areas, the most posterior (OFA) is not sensitive to whole-face structure and does not show inversion effects, but does show other-race effects; this suggests it codes part-based information about faces that is tuned by experience with specific face types (ie. own-race faces). 3. The 'middle' face area FFA is sensitive to whole-face structure (shows composite illusion) and shows inversion effects that correlate with behavioural inversion effects, and also shows other-race effects; this suggests it codes a holistic representation of the face, also tuned by experience with specific face types. 4. Concerning viewpoint, face coding becomes more view-general as one moves from the posterior to the anterior face-selective regions: in monkeys [41], complete view-specificity in the most posterior region, generalisation over mirror reflections in middle regions, and generalisation over all viewpoints and different images of the same person in anterior regions (including ATL in humans [45,46]). 5. Surprisingly, the neural origin of the caricature advantage appears to have received no explicit discussion. This is perhaps because it is difficult to assess via brain imaging: caricaturing would not necessarily change total BOLD response of a given region but rather increase some neurons in that region while decreasing others tuned to the opposite end of a dimension (McKone[12]). Logically, mid-level and general object areas are just as much candidate origins as high-level face areas: faces contain many basic shape elements (eg. vertically or horizontally elongated parts) for which single cell recording shows opponent-type coding from V4 onwards [47,48], giving the capacity to code caricatures via these shape elements.