Encoding Video Locations with SatCLIP: A New Frontier in Geographic Machine Learning

Community Article Published July 29, 2024

In the realm of machine learning and computer vision, understanding the geographic context of visual data has become increasingly important. While significant strides have been made in image-based location encoding, video content has remained a challenge. Today, we're excited to introduce a novel wrapper for SatCLIP that bridges this gap, enabling efficient location encoding for video content.

What is SatCLIP?

Before diving into our new wrapper, let's briefly recap what SatCLIP is. SatCLIP, or Satellite Contrastive Location-Image Pretraining, is a powerful model from Microsoft that learns to associate geographic coordinates with satellite imagery. It creates dense, meaningful embeddings that capture the essence of a location, from its climate and terrain to its level of urbanization.

The VLE Challenge

Videos present a unique challenge for location encoding. Unlike static images, videos can span multiple locations and contain temporal information. This represents the Video Location Encoding challenge. Our goal was to create a solution that could distill this complex spatio-temporal data into a single, informative embedding.

Our Solution: The SatCLIP Video Wrapper

Our new wrapper extends SatCLIP's capabilities to video content. Here's how it works:

Frame Extraction: The wrapper first extracts frames from the input video at regular intervals.
Coordinate Extraction: For each frame, we extract the corresponding geographic coordinates. This assumes the video has some form of geotag information.
SatCLIP Encoding: Each set of coordinates is then passed through the SatCLIP model. We use a ViT16 vision encoder and a spherical harmonics location encoder with L=40, allowing for high-resolution spatial embeddings.
Embedding Aggregation: The embeddings for all frames are then averaged to create a single 256-dimensional vector representing the entire video.
Output: This final embedding serves as a compact representation of the video's geographic context.

How It Works

Our wrapper leverages the power of SatCLIP's pretrained weights. The model uses a combination of a Vision Transformer (ViT) and a location encoder based on spherical harmonics.

The ViT16 architecture processes the satellite imagery associated with each coordinate during SatCLIP's training. Although we don't use this part directly in our video wrapper, it's crucial for creating meaningful location embeddings.

The location encoder is where the magic happens for our video wrapper. It uses spherical harmonics with L=40, allowing for high-resolution encoding of geographic coordinates. This means our model can capture fine-grained spatial patterns and differences.

The choice of L=40 is significant. In the original SatCLIP paper, the authors found that higher L values (like 40) performed better for interpolation tasks, while lower values (like 10) were better for geographic generalization. For our video use case, we opted for the higher resolution to capture as much geographic detail as possible.

Why This Approach?

By leveraging SatCLIP's pretrained weights, we can encode video locations without needing to download or process satellite imagery at inference time. This will help improve the overall efficiency of the project.
The wrapper can work with any geotagged video, regardless of its content or duration which means it's much more flexible than what already exists.
The resulting embedding captures implicit geographic context, potentially including information about terrain, climate, urbanization, and more. We have access to rich information which wasn't there previously.
Despite the complexity of the underlying model, the wrapper provides a simple interface for users.

Potential Applications

Predict the location of videos without explicit geotags.
Suggest videos based on geographic similarity.
Find videos from similar geographic contexts.
Categorize videos based on their geographic features.

Conclusion

Our SatCLIP video wrapper represents a significant step forward in geographic machine learning for video content. By extending the capabilities of SatCLIP to the video domain, we're opening up new possibilities for researchers and developers working with location-based video data.

We're excited to see how the community will use and build upon this tool. Whether you're working on video analysis, geographic information systems, or machine learning applications, we believe this wrapper can add a valuable new dimension to your work.

Try it out for yourself, and let us know what you build!

Upvote