Cloudless ‘Synthetic’ Sentinel-2 Data

7 min readApr 1, 2022

This is a short introduction to cloud-free synthetic Sentinel-2 imagery produced using multiple satellites and deep learning. I will try to describe what synthetic earth observation data is and why we believe this to be the future, in what will, hopefully, be less than a 10-minute read.

Before/After True-Color Image (Left: Sentinel-2, Right: Synthetic Sentinel-2 Data).

The Elevator Pitch:

For anyone unfamiliar with what we do or for the people who just stumbled upon this article on Medium, I’ll quickly give you an elevator pitch for our company, ClearSky Vision:

ClearSky Vision is a data fusion company working solely with earth observation data. Our first service recreates the 10-spectral bands of Sentinel-2 without clouds, shadows, image artifacts, and to some degree, errors in atmospheric correction. This is done by combining data from multiple satellites (Sentinel-1, Sentinel-2, Sentinel-3, and Landsat 8/9) with state-of-the-art deep learning. This technique can utilize partly clouded imagery, radar data, microwave data, different spatial resolutions, and different temporal frequencies. It’s simply about getting the most use out of every satellite that’s already orbiting earth in a practical and cost-effective way.

Synthetic Data:

I think it’s important to first clarify what we mean by ‘Synthetic Data’ because it’s already a term used in machine learning and many other industries. That makes the term a bit ambiguous and maybe even misleading. But I also think it’s the most descriptive word of what we are doing with a fair disclaimer for anyone using it (if you know a better word please tell me).

Synthetic data is annotated information that computer simulations or algorithms generate as an alternative to real-world data. Source: Nvidia Blog

Based on this definition it’s mostly used (in machine learning) to describe input data whose origin is not from the real world. However, we use it the other way around as we only use real-world data as input but our output imagery has no direct origin to any pixels from the real world. This is in stark contrast to mosaicing tools and similar techniques where you can trace a pixel value back to its origin. In earth observation, there is a plethora of available data and, thereby, less reason to use synthetic input data. But I would argue that there are plenty of reasons to use synthetic output data and that it's always better than outdated imagery.

The Easy and The Hard Cases:

It should come as no surprise that there are easy cases and very difficult cases when increasing the temporal frequency of optical satellites. Our favorite weather is either hazy, has a lot of cumulus clouds (the small clouds you see in the summer), or has a lot of cirrus clouds (thin high altitude clouds). In these situations, several optical wavelengths penetrate the clouds nicely and make it possible to recreate the entire image and all spectral bands (more or less) perfectly. That’s the easy stuff! However, it is far from trivial (as it requires state-of-the-art deep learning and very smart coders) but relatively speaking it’s much easier than cases with dense clouds looming over Europe in autumn or winter.

Another case that also has visible cloud shadows intertwined with clouds:

What might be a bit surprising is that these “easy cases” are a decent fraction of the total amount of images, as in ~15% of Sentinel-2 images (over Denmark) are naturally cloud-free and ~30% are within the scope of “easy” data fusion. This means that we can fix these cases in what I previously called more or less perfect quality — it’s purposely a bit vague but bear with me on this one. Another case with hazy weather and a few small cumulus clouds:

The harder cases are not as intuitive to showcase the same way, since an attentive reader will point out that we could just as well be faking the example and, therefore, it proves nothing without more in-depth information. I need to present the data utilized for these estimations to give the relevant context, but that sounds like a perfect topic for the next article as I need to introduce a lot of new information. But here is an example of a more difficult cloud-free prediction:

Sentinel-2 image from the 17th of May 2020. The lime-green fields in the lower part are rapeseed blooming.

The Future We Want to Build:

I strongly believe the market is going in this direction even though synthetic data, mosaicking tools, cloud-free composites, and data fusion techniques are rarely used in agricultural monitoring and food security applications. Don’t get me wrong, I understand the hesitation and the need for innovation before utilization, however, the amount of discarded optical imagery is staggering. It feels like a waste not utilizing all this data better and, sadly, just efficiently spotting and discarding this data is being recognized as state-of-the-art innovation... That is, however, not entirely false since it’s very inefficient sending clouded data (white pixels) from server to server to customer while disrupting the workflow of everyone interacting with it. So when we say cloudless imagery we do not mean “good cloud mask” as is the case with s2cloudless but we mean “making clouded imagery cloudless” or “making unusable data useful again”. Ehh, the last one sounds too much like a cheap slogan.

But why do we care about getting more usable imagery than what’s already easily available right now? The simple answer is that it’s a necessity for continuous monitoring.

Continuous Monitoring:

It’s understandable why precision agriculture has such a slow and low uptake in Europe since all decision-making information is lagging behind — and sporadic information from when the sky is blue is mere snapshots of that particular time. What are the odds of you receiving a usable image when it really matters? I would say highly unlikely. So the information you utilize will almost always be outdated imagery based on sporadic (unpredictable) weather-based timing — or god’s will if you are religious. It limits the positive impacts of precision agriculture and makes you uncertain about when the next information might come around — a day, three days, a week, or a month?

One column for each day. Rows from top to bottom: Sentinel-2 True-Color, Synthetic True-Color, Sentinel-2 NDRE, and Synthetic NDRE. NDRE is a Red-Edge Biomass Index.

It’s important to note that no prediction utilizes future information — as this would defeat the whole purpose in the first place — and the use of true-color visualization is simply an easy way to show changes that our human eyes understand. However, we predict all spectral bands of Sentinel-2 and then afterward combine it into true-color images or any of the other vegetation indices you will see throughout our articles. It’s much more difficult recreating the underlying spectral bands than simply a biomass index. But we do it because an arbitrary index is difficult to use for most farmers, agricultural companies, or researchers. My experience is that everyone I talk to uses another index/analysis than the previous to measure biomass changes since no single metric is perfect for all situations.

Growth of Spring Barley across 20 days in May (Denmark, 2020)

It’s not only the absolute values that can change quickly during the spring growing season, it can also be the variation within the field. This means that outdated imagery — even just a few days — can result in bad fertilizer and pesticide distribution. It’s not only a waste for the farmer but also for the environment and our future food security. It can be difficult to utilize historical information for this purpose, and our ever-increasing climate changes heavily limits the usability of models that requires multi-year data — and it would by definition be outdated as well. These synthetic examples do not require multi-year data as it is only dependent on new data — think days' worth of data, not years.

The last example in this round is simply to showcase how it’s very difficult to utilize Sentinel-2 imagery straight out of the box for continuous monitoring or rapid change detection.

11 Days of Sentinel-2 and ClearSky Vision (Denmark, 2020)

The five Sentinel-2 images in this 11-day time series all look very different and present different problems even though three “almost cloud-free” images in this timespan are way above the yearly average.

Time-Lapse:

I have a small treat for anyone still reading, as I think time-lapses are the most fascinating use-case that doesn’t directly bring any value. It’s more for ‘show and tell’ — unless you actually use the georeferenced data behind the video. Anyhow, below is a time-lapse showing seven months from April to October.

True-Color and NDVI Time-Lapse (April to October)

What’s Next? In an upcoming Medium article, I’ll dive more into the data supporting these estimations, and show how a Landsat 8 image can be turned into a fully-fledged Sentinel-2 image with all necessary spectral bands — and what kind of data our algorithms like the best. However, no two cases are the same as ground cover and cloud cover keep changing and that makes simple answers a bit ambiguous. Anyhow, tune back in a week for more thoughts about this and how we utilize SAR data for the tough estimations.

If you want to learn more about us straightaway, I recommend visiting www.clearsky.vision or our LinkedIn profile. Or just reach out to me at mfp@clearsky.vision for sample data or just a good honest discussion about earth observation, data fusion, or deep learning.