How good is the new GFS?
The GFS was upgraded in June and the changes were substantial. The so-called "dynamical core" of the model, the guts of a numerical modeling system that are responsible for solving the equations that govern atmospheric motions, is all new and based on the NOAA Geophysical Fluid Dynamics Laboratory (GFDL) Finite-Volume Cubed-Sphere dynamical core known as the FV3.
The equations that govern atmospheric motions are well known and based on physical principles such as the conservation of mass, conservation of momentum, ideal gas law, etc. Solving these equations efficiently for a sphere (i.e., Earth) on a massively parallel computer, however, isn't straightforward. The old dynamical core of the GFS was dated, not optimal for modern computer infrastructure, and problematic as model grid spacings continued to decrease. Thus, a change was needed.
Numerical modeling systems also need "physics" to account for things like radiation, cloud, land-surface, and other processes that affect the atmosphere. These are more difficult since we either don't know or can't define the physical equations that govern these processes, or they occur on scales that are smaller than can be simulated directly on current computers. Cloud processes provide a good example. We simply cannot simulate directly the formation of every cloud droplet, rain drop, drizzle drop, and ice particle in a cloud. Shortcuts must be made. This is called parameterization.
The new GFS is based primarily on the old GFS parameterizations, except for replacement of the old cloud parameterization with a new one developed at GFDL and a few tweaks to the land surface and ozone/water vapor photochemistry parameterizations.
Perhaps the above is TMI, but it establishes that the new GFS is substantially different than the old, so knowing whether or not this has led to changes in bias and skill is important for predicting deep powder days over the western CONUS.
Fortunately we have the answer, thanks to the efforts of Marcel Caron, a graduate student in my research group who recently completed his M.S.
Prior to the GFS upgrade, the National Weather Service went back and ran a developmental version of the new GFS (known as GFSv15.0) for previous years. This is sometimes called reforecasting. It allows for a comparison with the old GFS (known as GFSv14, without a .0 just to make things confusing). Ultimately, the operational version of the new GFS that was implemented in June was based on GFSv15.1, but for the water equivalent of precipitation presented here, results are expected to be similar.
Further, just to make things interesting, we'll take a look at how the European Center for Medium Range Weather Forecasting (ECMWF) global model (HRES) does as well.
We will focus on the 2017-18 cool season for which we have forecasts or reforecasts from all three modeling systems. Note that the HRES underwent an upgrade this summer as well, so it is at a slight disadvantage relative to GFS since we are evaluating the previous operational version rather than the current one. The 2017-18 cool season was drier than average across much of the west, which is unfortunately, but you do what you can. We will compare model performance at SNOTEL stations, which are located primarily in the mountains, and at ASOS stations, which are located primarily in the valleys.
To begin, let's look at the ratio of total seasonal precipitation produced by each modeling system compared to that observed, which we call the bias ratio. A value greater than one indicates that the model produces too much total precipitation and a value less than one indicates that the model produces too little. All three models, for day 1 (12-36 hour) and day 3 (60-84 hour) forecasts, feature bias ratios that are well above 1 at most ASOS stations, with HRES being the wettest of the three.
|Bias ratios at ASOS stations. From Caron and Steenburgh (2019, submitted).|
|Bias ratios at SNOTEL stations. From Caron and Steenburgh (2019, submitted).|
Total accumulated precipitation, however, doesn't tell you anything about individual events, defined here based on 24-hour accumulations. A model could have a bias ratio near 1, but excessively produce smaller precipitation events and underproduce larger ones. So, it is worth taking a look at the frequency bias, or the ratio of how many events a model produces within a size range compared to observed. Values greater than one indicates the model produces too many events and less than one too few, although we use a range between .85 and 1.2 to indicated "neutral" bias given the precipitation measurement uncertainties. Results are, however, fairly consistent with what one might expect based on bias ratio. At ASOS stations, all three models produce too many events for event sizes up to 7.62 mm (0.25", larger events not presented due to the small sample size), with the HRES producing the largest bias. At SNOTEL stations, all three models produce near-neutral or marginally dry frequency biases, with the HRES producing the largest underprediction of event frequencies for larger events (>25.4 mm or 1").
|Event frequency bias at (a) ASOS and (b) SNOTEL stations. From Caron and Steenburgh (2019, submitted).|
Biases of the type above are valuable since they can help you adjust model forecasts, but ultimately, one also needs to know how well do model forecasts correspond to reality. A model could produce the same frequency of large precipitation events as observed, but on the wrong days. Such a model would have a great bias, but really poor skill.
A common metric for examining model skill is equitable threat score. I won't go into the details of how this is calculated, except to say that higher values are better and that a perfect model would produce an equitable threat score of 1. Below are equitable threat scores for all three models (color and forecast day same as above figures) at ASOS (left) and SNOTEL (right) stations. Whiskers indicate 95% confidence intervals if you are into that sort of thing. At Day 1 (dashed lines) the difference between GFSv14 and GFSv15.0 is nearly indistinguishable in most event thresholds. At Day 3, the GFSv15.0 is producing improvement for most event sizes. The HRES produces more skillful forecasts in general, although the GFSv15.0 closes the gap by day 3, especially for larger events at SNOTEL stations where the HRES has an underprediction issue.
|Equitable threat scores at ASOS (left) and SNOTEL (right) stations. From Caron and Steenburgh (2019, submitted).|
|Equitable threat scores based on event percentile ranking at ASOS (left) and SNOTEL (right) stations. From Caron and Steenburgh (2019, submitted).|
One quick caveat. The post above is based on work that is currently in review.