Results

Year One

The major research activity during the first year was progress on research objective 1, land use classification using georeferenced ground-level images. Specifically, we developed a classification framework wherein Flickr images are used to map land use on university campuses. This involved: creating ground truth maps of land use for two university campuses, UC Berkeley and Stanford; acquiring images for these regions using the Flickr API; extracting visual features from these images; training a classifier using a labelled training set of images; and applying the classifier to label a test set of images whose locations are then used to create a land use map which is compared with the ground truth map. This initial investigation into research object 1 demonstrated that ground level images can be used to map land use in that the predicted maps resembled the ground truth maps. The predictions were still very noisy and so more work is needed.

Year Two

The major research activity during the second year was progress on research objective 1, land use classification using georeferenced ground-level images.

We first investigated whether state-of-the-art object and concept detectors from computer vision provide an informative signal when applied to georeferenced ground-level images. The motivation here is that such high-level image features might be more informative for geographic discovery than low-level features, such as color and texture, and mid-level features, such as gist. We used the Object Bank descriptors of Li (NIPS 2010) which represent state-of-the-art and detect 177 different objects typically found in urban environments such as television, clock, vase, stove, fork, etc. Reliably detecting these objects could guide effective land use classification particularly in urban areas where using overhead imagery can be ambiguous. We applied the detectors to a large number of Flickr images from a 10x11km region of Great Britain encompassing London. We computed measures of spatial heterogeneity and correlation in order to determine whether the detectors provide a spatially informative signal.

The results of applying the object detectors to Flickr images were mixed. On one hand, it was clear that despite representing state-of-the-art in computer vision, the Object Bank descriptors were very noisy when applied to a real-life dataset (versus the hand-crafted datasets commonly used in evaluations). They resulted in a lot of false-positive detections. While the output of the detectors was noisy, they did vary spatially. In particular, the categories boot, desktop computer, clock, and basketball hoop were the most spatially heterogeneously distributed (clustered) objects. Spatial co-occurrence analysis of the object detections did discover some object pairings which made sense such as desks and desktop computers, and plates and fruit. However, other less logical pairings were also discovered such as clams and gallery. We conclude from these results that recognizing specific objects in generic collections of georeferenced ground-level images remains a challenge. However, while the detectors might not be detecting the specific objects they are intended for, they might still represent an informative latent signal when treated as bag-of-objects descriptors. This motivated the use of them for land use classification.

We also investigated land use classification using mid-level gist descriptors and high-level object detectors. Ground truth land use data is difficult to obtain so we instead used a subset of the category assignments available at the photograph level in the Geograph Great Britain and Ireland project. We extracted gist features from a test set of images and trained support vector machine (SVM) classifiers to recognize eight categories: church interior, clock tower, fast-food outlet, ferry, fire station, flats, lighthouse, and memorial. In order to use the 177 object detectors to recognize these eight categories, we formed a 177 length vector composed of the likelihood that a particular object is detected in an image. SVMs were also trained using these bag-of-objects features.

Both the gist and bag-of-objects features performed well for classifying images from the eight urban land use classes. Gist features resulted in classification rates from 21 to 87 percent when applied to a test set of labeled images. The bag-of-objects features improved on this with rates from 42 to 89 percent. We also applied the features to a large collection of georeferenced but unlabeled images and mapped the results. Qualitative evaluation of these maps showed the land use distributions were reasonable: the ferry class was clustered along the River Thames; lighthouse was clustered along the coast; clock tower was clustered in London; and fire station was spatially distributed.

Year Three

The major research activity during the third year was continued work on research objective 1, land use classification using georeferenced ground-level images. Specifically, we investigated recent developments in deep learning for image analysis.

In the first year, we made some initial progress on mapping land use on university campuses using bag-of-visual features which can be considered low-level features. Our results were quite noisy. In the second year, we investigated high-level features, specifically state-of-the-art object and concept detectors. We showed that, in general, such detectors were not quite ready for use on large collections of real world images. Therefore, in this third year, we investigated mid-level features based on convolutional neural networks (CNNs).

We created a much improved ground truth evaluation dataset for land use classification on university campuses. This dataset consists of eight classes (versus three in our earlier work): study, residence, hospital, park, gym, playground, water, and theater. We derived polygonal footprints of several hundred regions on the Stanford campus and manually labeled them with the eight classes. We download almost 80K Flickr images from the Stanford campus, of which approximately 16K were located in one of the regions. This dataset was augmented by 24K additional Flickr images using a keyword search, resulting in an evaluation dataset that is nearly an order of magnitude larger than our previous study.

Our three objectives are: 1) to investigate whether indoor/outdoor filtering would help with land use classification; 2) to perform land use classification at the image and region level; and 3) a new problem we term land use refinement where you know the superclass but need to assign a subclass. This new problem is motivated by our observation that current land use maps, if they exist, are usually restricted to coarse level classes. Being able to refine these classes (say using ground-level images) would thus be a simpler yet very practical problem.

For all three objectives, we investigate image features computed using CNNs. Unlike traditional hand-crafted features, such as color, texture, local invariant features, etc., these features are data-derived and have proven effective for a wide range of image analysis tasks. We use these features as inputs to simple support vector machine classifiers to perform either indoor/outdoor or land use classification at the image level.

Our outdoor/indoor classification results achieved state-of-the-art performance. Our accuracy is over 95% on an evaluation dataset for which the previous best performance was around 90%.

We achieve approximately 80% accuracy land use classification at the image-level which is significant considering how noisy the Flickr images are. Our accuracy at the region level is around 70%. This increases to over 83% if we use a two-stage indoor/outdoor and then land use classification framework.

We achieve over 91% accuracy in the land use refinement problem. This increases to over 93% if we incorporate indoor/outdoor classification.

The overarching outcome is that the CNN features are more effective than the low-level features we utilized before for land use classification on university campuses. While a direct comparison is not possible since we have revised the evaluation dataset, we are achieving a higher accuracy even though we have increased the number of classes from three to eight. These features are also shown to perform well for indoor/outdoor classification.

We introduced the problem of land use refinement and showed our approach achieves excellent performance.

We will make our evaluation dataset available once we have published this work

Year Four

During the fourth year, we continued to make progress on research objective 1, land use classification using georeferenced ground-level images, and started work on research objective 2, mapping public sentiment.

Our work during the third year on mapping land use on university campuses using deep learning, specifically convolutional neural networks (CNNs), received the best poster award at the 2015 ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Encouraged by this, a major activity we began in the fourth year is to extend this work to map land use at the city scale. The closest thing we could find to a standardized set of land use classes is the Land-Based Classification Standards of the American Planning Association. We refined the Function theme of this system to include just those land use classes that we believe are observable in ground level images. We now have a fairly extensive hierarchical ontology with seven root-level classes and over 30 leaf-level classes. Our initial focus is on mapping land use in San Francisco. We have downloaded over 1.3 million georeferenced Flickr images of San Francisco dating from 2006-2016. Evaluating our approach will be difficult because land use maps do not exist at the class granularity we are considering. Therefore, we have started creating our own ground truth map of San Francisco at the parcel level. This is a significant undertaking but will be a useful dataset to share.

During the fourth year, we also began work on classifying activities in georeferenced videos. The motivation is that video is generally richer than still imagery and so should be more effective for mapping land use. In particular, the activities occurring in the video should be informative. Activity recognition in video is currently a vibrant field in computer vision. This allows us to leverage the extensive training datasets that have been made available, such as the recent ActivityNet dataset which contains videos from 100 activity classes. Instead of using existing classifiers, we have developed a novel approach which incorporates depth for activity recognition. Researchers have previously shown that depth, as explicitly sensed using RGB-D cameras for example, improves activity recognition. Most video is not captured with these cameras and so we proposed estimating depth from the video itself. This is a difficult problem and the estimated depth maps are quite noisy. We showed, however, that they are complementary to other approaches and achieve state-of-the-art results when combined with them.

We also investigated another novel approach to activity recognition in video, parallel multi-task learning. Currently, the three related problems of action proposal, action recognition, and action refinement are done independently in parallel. However, we stipulate that these tasks are related and can benefit each other. We therefore developed a parallel CNN model which learns these three tasks together.

Finally, we started work on mapping public sentiment which is research objective 2. We are leveraging recent work in computer vision on emotion detection in images to detect six emotions (anger, disgust, fear, joy, sadness, surprise) in georeferenced images with the goal of detecting spatio-temporal patterns.

Year Five

During the fifth year, we made progress on research objective 1, land use classification using georeferenced ground-level images and videos, and on research objective 2, mapping public sentiment.

Our 2015 ACM SIGSPATIAL paper "Land use classification using convolutional neural networks applied to ground-level images" demonstrated that ground-level images could be used to map land use on university campuses. During the fifth year of the project, we worked on scaling the method to a larger area and a broader range of land use classes. Our study area is now the City of San Francisco for which we have acquired all the parcel footprints. We have derived a set of 30 land use classes based on the Land-Based Classification Standards of the American Planning Association. Our research goal is to use ground-level images to assign land-use classes to the parcel footprints. A fundamental challenge, though, is that there is no land-use ground truth (land-use maps at our class granularity do not exist-this is one of the main motivations for the work). We are thus deriving by hand a small ground truth dataset to evaluate our approach. We are also deriving a surrogate land use ground truth by aggregating data from Google Maps, Bing Maps, and Open Street Map. Aggregating this information is in itself an interesting problem which we expect to publish on.

During the fifth year, we began to investigate using georeferenced videos for geographic discovery. The motivation is that event and activity detection should be easier using videos than images. The PhD student funded by the grant has developed several state-of-the-art activity recognition techniques. The paper "Depth2Action: Exploring embedded depth for large-scale action recognition," which was published at the European Conference on Computer Vision (ECCV): Workshop on Web-Scale Vision and Social Media (VSM) in 2016 demonstrated how depth estimated from the video frames can improve activity recognition in video. We are now applying the activity recognition to large collections of georeferenced videos of San Francisco downloaded from YouTube. Preliminary results show we are able to detect where sports are played, the correlation between certain sports and the weather, determine the route of a parade, and other tasks using just the visual content of the videos.

During the fifth year, we also showed that georeferenced ground-level images could be used to map public sentiment. Our 2016 ACM SIGSPATIAL paper "Spatio-temporal sentiment hotspot detection using geotagged photos" demonstrates that recently developed methods to detect sentiment in images could be used to map six different emotions using a large collection of Flickr images of San Francisco: anger, disgust, fear, joy, sadness, and surprise. We performed spatial and spatio-temporal hotspot detection using the Getis-Ord Gi* statistic. While there was no ground truth to evaluate our approach, we were able to detect spatial correlations between disgust and the messiest neighborhoods, and joy and the tourist and recreational locations. We were also able to detect spatio-temporal correlations between joy and the yearly success of the San Francisco Giants baseball team. The 2016 ACM SIGSPATIAL paper was well received and was the runner up for the best fast forward presentation.

The two undergraduates supported by an REU supplement to the grant demonstrated that georeferenced ground-level images could be used map pet ownership. They applied cat and dog detectors to a large collection of Flickr images of San Francisco. Again, there was no ground truth for evaluation but the results indicated that dogs were more likely to be found where there are parks, cats are more likely to be found in residential areas, and that tourist areas contain little pet activity. This research was presented as the paper "City-scale mapping of pets using georeferenced images" at the 2016 ACM SIGSPATAIL Student Research Competition. The work was well received and won first place in the undergraduate division.

Year Six

During the sixth year, we continue to make progress on research objective 1, land use classification using georeferenced ground-level images and videos.

We completed the journal extension of our 2015 ACM SIGSPATIAL paper "Land use classification using convolutional neural networks applied to ground-level images". In that conference paper, we demonstrated that ground-level images could be used to map a limited number of land use classes on university campuses, regions for which we had ground truth maps. Our journal extension considers 45 land use classes over the entire City of San Francisco. We derived our hierarchical land use taxonomy from the Land-Based Classification Standards of the American Planning Association. Since we are tackling a problem for which there is no ground truth, we derived a surrogate ground truth using a large number of points of interest (POIs) from Google Places. We developed a novel two stream convolutional neural network (CNN) to assign land use labels to ground-level Flickr images. An object stream is pretrained on the ImageNet dataset and a scene stream is pretrained on the Places365 dataset. These streams are then fine tuned using our own training dataset. Once trained, our two stream CNN is applied to a large collection of ground-level Flickr images of the City of San Francisco. Image labels are aggregated at the parcel level (we obtained the parcel shapefiles online). Our experiments compare our CNN-based approach to standard image features and classifiers, demonstrate the complementarity of the object and scene streams of our network, and explore generalizations to other image data sources such as Instagram photos.

Our work above on generating surrogate land use maps using sources such as Google Places was published and presented as the paper "Quantitative comparison of open-source data for fine-grain mapping of land use" at the 3rd ACM SIGSPATIAL Workshop on Smart Cities and Urban Analytics (UrbanGIS 2017).

We started an interesting new research thread on addressing the sparse and uneven distribution of ground level images for our geographic discovery framework. Rather than classify images and then interpolate the labels, we investigated interpolating features extracted from the images and then classifying the interpolated features. We applied this interpolate-then-classify framework to mapping land use using ground-level images. We also investigated how prior knowledge about region boundaries, such as parcels, can be used to improve the interpolation through spatial morphing kernel regression.

We extended this interpolate-then-classify framework to incorporate overhead imagery. The ground-level images provide a better perspective for mapping land use than overhead images but they are sparse and unevenly distributed. The overhead imagery is available everywhere but is limited for mapping many land use classes. We therefore used conditional generative adversarial networks (CGANs) to generate dense ground-level views and image features using overhead imagery. The features can then be used for dense land use classification, for example.

Finally, we demonstrated that ground-level videos can be used to map human activity. We developed and applied state-of-the-art action recognition methods to a large collection of YouTube videos of San Francisco. We were able to recognize and map a range of activities including eight different sports and the route a parade took.

Final Year

During the final year, we continued to make progress on research objective 1, land use classification using georeferenced ground-level images and videos. In particular, we developed an innovative framework for addressing the sparse spatial distribution of ground-level images. This work attracted interest from the technical popular press due to its innovation.

A fundamental challenge to inferring geographic information from ground-level images is their sparse and uneven spatial distribution. Overhead imagery has complete and uniform distribution. However, the ground-level images that this project exploits for tasks such as land use classification are only available at certain locations. This motivated several solutions to this problem.

We first considered spatially interpolating the image features that are extracted from the ground-level images before performing classification. Indeed, we showed in our 2018 IEEE International Conference on Image Processing (ICIP) paper titled "Spatial morphing kernel regression for feature interpolation," that this interpolate-then-classify framework was effective and enabled dense land use mapping from sparse and unevenly distributed ground-level images.

In this last year, we investigated a more innovative approach. We used conditional generative adversarial networks (cGANs) to synthesize what the view on the ground looked like given co-located overhead imagery.

GANs have proven successful for learning the distributions of images in high-dimensional image spaces. That is, one can train a GAN through adversarial learning to generate, for example, images of cats. Such GANs learn where in the high-dimensional image space that certain types of images exist (e.g,, cats) so that, given a random vector from this region, they can generate a completely synthetic but realistic looking image of that type.

However, we didn't want to generate just any ground-level views but ones that were representative of their geographic locations. For this, we used conditional GANs (cGANs). These models generate images conditioned on some additional information. For example, they have been used to generate stylized versions of images. That is, a cGAN that has learned the distribution of drawings of objects conditioned on images of those objects can then be used to generate a drawing of a novel object instance from an image of it.

We trained a cGAN to generate ground-level views of a location given a co-located overhead image. During training, a generator network is given an overhead image patch and tries to generate a ground-level view of that location that a discriminator network cannot distinguish from the true ground-level view of the location. Through iteration, the generator gets better at generating ground-level views that are similar to the true ground-level views in the training dataset.

Once trained, the cGAN can then generate a ground-level image anywhere there is overhead imagery available. This results in a complete and uniform distribution of ground-level views albeit synthesized ones.

It is important to note here that the goal was not necessarily to generate ground-level views that look realistic (as in many applications of GANs) but to generate images that help the research objective of mapping land use. In order to demonstrate this, we showed that the image features extracted from the synthesized ground-level images were more effective for performing dense land use mapping than our previous approach which interpolated the image features from the sparse true ground-level images. Further, we showed that the synthesized ground-level images were complementary to the overhead image patches. That is, the trained cGAN model was imbued with geographic knowledge, in particular the complex visual correspondence between overhead and ground-level views, and was not just performing a feature or image transformation.

This work was published at the 2018 ACM SIGSPATIAL conference as a full research paper titled "What is it like down there? Generating dense ground-level views and image features from overhead imagery using conditional generative adversarial neural networks."

We have extended this work in two ways. First, instead of conditioning the cGAN on the raw pixel values of the overhead image patch, we instead condition it on a feature embedding of the patch. This allows the patch to be larger and thus the cGAN to observe more of the overhead image. Second, we explore different scales of the feature embedded from the overhead image. Both of these extensions are shown to improve the land use classification results using the generated ground-level images. We are preparing this work for publication.