There has been a fair bit of debate in the meteorological community following this direct competitive criticism of the NWS by Accuweather. Several bloggers (such as Jason Samenow at Capital Weather Gang, Dennis Mersereau at The Vane and Mike Smith at Accuweather) have covered this issue in depth from a variety of viewpoints, so I'm not going to do so here. But this brings up the age-old question of who actually produces the best forecast, something that I've long been interested in.
Let's keep it simple. We're not going to look at forecasting tornadoes, nor are we going to look at multi-week forecasts of questionable origin. Instead, let's look at something that any good forecasting service should provide--- the forecast for the next day's high and low temperature. Easy enough numbers to analyze and understand.
To get a variety of locations, I'm going to follow the WxChallenge weather forecasting competition from this year. Hundreds of meteorology undergraduates, graduates, and faculty participate in the WxChallenge. The competition spends two weeks forecasting the next day's high, low, maximum wind speed and total precipitation at a random city somewhere in the US before switching to a new city. I took a sample of several of the cities used in this year's competition and, for the two week periods where the contest was forecasting for each city, recorded the next-day high and low temperature forecasts from a variety of forecast providers and some model output. All of these forecasts were taken at 2300 UTC the day before. You can see this year's schedule of WxChallenge cities here.
To examine the quality of the forecasts, I evaluate skill scores for each city's two-week forecasting period. A skill score measures your error relative to some baseline for comparison. We're going to use two baselines here---a climatology forecast (where you would simply forecast the climatological average high and low temperature for tomorrow) and a persistence forecast (where you just forecast that whatever happened today will happen again tomorrow). A skill score of 1 means you had a perfect forecast---no error. A skill score of 0 means that you had exactly the same skill as the baseline forecast used for comparison. A negative skill score means you actually did worse than the baseline. Any good forecast should be able to consistently beat the climatology and persistence baselines, so we're looking for scores between 0 and 1.
So, without further ado, here's a big graph of the skill scores:
The blue bars represent the skill scores for the low temperature (lighter bar is the skill against persistence, darker bar is the skill against climatology) and the red bars represent the skill scores for high temperature. Each city has its own plot. The forecast sources are grouped together and they are sorted from left to right with the forecast source that had the highest average skill score to the left. By this definition, the first forecast source you see in each row was the "best" forecaster for that particular two week period at that city.
Some background on these forecast sources---WUTWC is Weather Underground/The Weather Channel; ACUWX is Accuweather; NWS is the National Weather Service; HAMWX, FCSTIO, WWO and METGRP are other private forecasting companies; GFS and NAM are the GFS and NAM MOS forecasts (automated model-generated forecasts) and the USL12Z and USL22Z are another automated model-generated forecast.
So what does the plot show us? Of the six cities looked at, Weather Underground/The Weather Channel (WUTWC) was the best at two of them, the automated USL22Z model was the best at two, and FCSTIO (forecast.io) and Accuweather (ACUWX) were the best at the remaining two. All of these are either private or automated forecasts.
Some cities were quite challenging--a lot of negative skills for low temperature at Butte, Montana (KBTM) show that in some cases forecasts were too wild and didn't capture local changes in low temperature well. In fact, the automated forecasts (USLs, NAM and GFS) all beat out the other forecasts at Butte. It's interesting that for a city with so varied of weather and such a complex forecast as Butte, forecasts that included human meteorologists were unable to have higher skill than any of the automated forecasts...
Where was the National Weather Service in all of this? Decidedly in the middle of the pack. The best they did was 4th at Long Beach, and at all cities either ACUWX or WUTWC (or both) had higher average skill.
So what can we learn from this? Forecast quality depends a lot on when and where you're looking. In some cases, automatic or raw model forecasts can outperform human forecasters. It also would be a mistake to think of the NWS as a "gold standard" when it comes to forecasts for next-day high and low temperature---companies like The Weather Company (which owns Weather Underground and The Weather Channel) and Accuweather routinely provide better high and low temperature forecasts. That being said, this is a very small sample size and there are a ton of other events and variables to forecast, many of which the Weather Service excels at. It's also impressive how many other companies there are that provide forecasting services (see the plethora of other companies on that chart that I didn't mention) that really aren't very good at all. They routinely do worse than freely-available model guidance (like the NAM or the GFS) or the NWS forecast. So choose your forecast provider wisely!