The authors have expanded on their original submission, by extrapolating the estimated weights using a neural network, and adding a third product to the merger. While the manuscript is certainly improved from the first submission, I still have major reservations about the motivation for this work, and the appropriateness/accuracy of the chosen methodology. These issues would need to be address prior to publication.
1.
The motivation for merging these particular products needs to be provided in the manuscript (not just in the reviewer response). Also, there are other products available that you could use, say from reanalyses.
Given that all three input ET products are forced wih the same data, it is not clear to me why using a statistical method (based on random errors) would be beneficial to account for systematic differences (driven by model structure / parameters) between the three products. This is a fundamental flaw in this work.
While this is not actually stated, it seems to me that the main point of this work is to determine whether these three products can be improved upon by merging them, and if so, can they be merged more effectively than using an (unweighted) average. There are two steps to this:
i. Are the weights (or more importantly, the final product) significantly different between the standard averaged, and with the eights calculated using a more sophisticated approach?
Currently this has not been adequately answered, due to the flawed method used to estimate the weights.
ii. Can the weights calculated at the limited number of tower locations be usefully extrapolated globally?
Currently this cannot be answered, as not enough information is given regarding the NN used to extrapolate the weights.
2.
As in the first review. The inverse-error variance weights are based on the assumption of unbiased and independent data sets. The data sets used here are strongly dependent and biased, and this cannot be ignored. At a minimum for publication, the weights (and consequently the averaging) needs to be based on anomalies (to remove the bias, and the more systematic aspects of the dependence), and the lack of independence needs to be acknowledged prominently, including qualifying all conclusions by noting that these data sets were not independent. If you don't have enough data to use anomalies, then you cannot apply this method by ignoring the inconvenient biases.
3.
The NN network is not adequately explained, and reads like it has been blindly applied rather than carefully investigated. As with the first review, I still have concerns about over-fitting. Since so little information is given, it is not clear whether the NN was not useful for extrapolation, because of the limited locations of the tower observations, or because the NN itself was inadequate.
For publication, the NN needs to be more adequately explained. This includes details of how the training data sets were split into training and evaluation data within the NN algorithm (and how this then relates to the experiments where a single station was removed), how the input data was selected / what else was tested, and how you ensure that the output weights equal 1.
Also, it isn't stated whether a separate NN is calculated for each day of the year, or if all data is thrown in to a single NN. I suspect it is the former, but would strongly recommend the latter.
How do you deal with hemispheric differences? It doesn't really make sense to predict SH weights for a day of year, based on NH data only (there is only one SH tower, in Australia).
By training a single NN using data for all days (and not including day of year in the training data) you could withold data for some days to test whether the NN can at least reproduce the sites that are sampled.
MINOR:
1A general grammar check would be useful, with attention paid to missing possessive apostrophes.
L49: please double check that the MTE product is a regression. I thought it was a machine learning approach, but could be wrong.
Section 2.1 (from first review also). Add the spatial resolutions in here.
Replace all instances of 'inverse-variance weighting' with 'inverse error variance weighting'
L210. "Only the systematic component of the error can potentially be captured by the NN, and there is not warranty that all the systematic errors are dependent on the model inputs".
This is incorrect/misleading. "systematic errors" usually refers to biases, and these methods can predict the mean square of the random errors (which is what your inverse error variance weights should reflect). Note that statistical methods like this do not predict individual errors, rather they predict the tendency towards larger/smaller errors. Regarding the second half of the sentence, this is why you need to do some work to show that you have selected appropriate input data sets for NN.
L273: quantify 'too close' . How is "clearly not representing the overall land cover' determined?
Explicitly note here that the towers do not provide comprehensive global coverage, and don't cover many biomes, climate regimes, or the Southern Hemisphere.
L280: representativity errors should be mentioned here (tower to 25 km).
Figure 2 : 100 Wm2 is a huge range for each colorbar. Please plot with a finer discretization.
Figure 4: It is stated repeatedly that the weights don't differ much from 1/3, and yet this plot shows a large deviation. Statistics are needed here to quantify the divergence from 1/3.
Figure 7: It doesn't really make sense to plot global fields for three month blocks.
Figrue 10: THis plot is difficult to understand. Why not plot Ih v. RMSE?
Ih has not been defined anywhere in the manuscript.
L507: what about systematic errors in the obs? (inc. representativity)?
L509: "if the difference with the observations were mostly random in nature, we should not expect the observations to provide much guidance to combined the products".
This is not correct. See comment on L210.
L521: No. You could have defined the bias over the data time period that you do have.
L565: the occurrence of negative weights is because you have dependent products.
L581: Qualify that this approach assumes that the surface water storage has not changed over this time period, and that this assumption will not necessarily hold over the limitted time period you are using.
Note that this is the same precip used in the ET products, and that this does not represent an independent evaluation.
L605: either explain why the slope is an important statistic here, or delete it. |