A mixed distribution to fix the threshold for Peak-Over-Threshold wave height estimation | Scientific Reports – Nature.com

This sections introduces the Extreme Value Theory and presents the proposed methodology of this work.

Extreme Value Theory (EVT) is associated to the maximum sample (M_n = text {max} (X_1, ldots , X_n)), where ((X_1, ldots , X_n)) is a set of independent random variables with common distribution function F. In this case, the distribution of the maximum observation is given by (Pr(M_n < x) = F^n(x)). The hypothesis of independence when the X variables represent the wave height over a determined threshold is quite acceptable, because, for oceanographic data, it is common to adopt a POT scheme which selects extreme wave height events that are approximately independent25. Also, in26, authors affirm that The maximum wave heights in successive sea states can be considered independent, in the sense that the maximum height is dependent only on the sea state parameters and not in the maximum height in adjacent sea states. This (M_n) variable is described with one of the three following distributions: Gumbel, Frechet, and Weibull.

One methodology in EVT is to consider wave height time series with the annual maximum approach (AM)27, where X represents the wave height collected on regular periods of time of one year, and (M_n) is formed by the maximum values of each year. The statistical behaviour of AM can be described by the distribution of the maximum wave height in terms of Generalized Extreme Value (GEV) distribution:

$$begin{aligned}{}&G(x)= left{ begin{matrix} exp left{ -left[ 1 + xi left( frac{x - mu }{sigma } right) right] ^{frac{1}{xi }} right} , &{} xi ne 0,\ exp left{ -exp left( - left( frac{x - mu }{sigma } right) right) right} , &{} xi = 0, end{matrix} right. end{aligned}$$

(1)

where:

$$begin{aligned}{}&0< x < 1 + xi left( frac{x-mu }{sigma } right) , end{aligned}$$

(2)

where (-infty< mu < infty ), (sigma > 0) and (-infty< xi < infty ). As can be seen, the model has three parameters: location ((mu )), scale ((sigma )), and shape ((xi )).

The estimation of the return values, corresponding to the return period ((T_p)), are obtained by inverting Eq. (1):

$$ z_{p} = left{ {begin{array}{*{20}l} {mu - frac{sigma }{xi }left[ {1 - left{ { - log (1 - p)} right}^{{ - xi }} } right],} & {xi ne 0,} \ {mu - sigma log left{ { - log(1 - p)} right},} & {xi = 0,} \ end{array} } right}{text{ }} $$

(3)

where (G(z_p) = 1 - p). Then, (z_p) will be exceeded once per 1/p years, which corresponds to (T_p).

The alternative method in the EVT context is the Peak-Over-Threshold (POT), where all values over a threshold predefined by the user are selected to be statistically described instead of only the maximum values28,29. POT method has become a standard approach for these predictions13,29,30. Furthermore, several improvements over the basic approach have been proposed by various authors since then19,31,32.

The POT method is based on the fact that if the AM approach uses a GEV distribution (Eq. 1), the peaks over a high threshold should result in the related approximated distribution: the Generalized Pareto Distribution (GPD). The GPD fitted to the tail of the distribution gives the conditional non-exceedance probability (P(Xle x | X > u)), where u is the threshold level. The conditional distribution function can be calculated as:

$$ P(X le x|X{text{ > }}u) = left{ {begin{array}{*{20}l} {1 - left( {1 + xi ^{*} left( {frac{{x - u}}{{sigma ^{*} }}} right)} right)^{{frac{1}{{xi ^{*} }}}} ,} & {xi ^{*} ne 0,} \ {1 - exp left( { - left( {frac{{x - u}}{{sigma ^{*} }}} right)} right),} & {xi ^{*} = 0.} \ end{array} } right. $$

(4)

There is consistency between the GEV and GPD models, meaning that the parameters can be related to (xi ^* = epsilon ) and (sigma ^* = sigma + xi (u - mu )). The parameters (sigma ) and (xi ) are the scale and shape parameters, respectively. When (xi ge 0), the distribution is referred to as long tailed. When (xi < 0), the distribution is referred to as short tailed. The methods used to estimate the parameters of the GPD and the selection of the threshold will be now discussed.

The use of the GPD for modelling the tail of the distribution is also justified by asymptotic arguments in14. In this paper, author confirms that it is usually more convenient to interpret extreme value models in terms of return levels, rather than individual parameters. In order to obtain these return levels, the exceedance rates of thresholds have to be determined as (P(X>u)). In this way, using Eq. (4) ((P(X>x|X>u)=P(X>x)/P(X>u))) and considering that (z_N) is exceeded on average every N observations, we have:

$$begin{aligned}{}&P(X>u)left[ 1+xi ^*left( frac{z_N - u}{sigma ^*}right) right] ^{-frac{1}{xi ^*}}=frac{1}{N}. end{aligned}$$

(5)

Then, the N-year return level (z_N) is obtained as:

$$begin{aligned}{}&z_N = u + frac{sigma ^*}{xi ^*}left[ (N * P(X>u))^{xi ^ *} - 1 right] . end{aligned}$$

(6)

There are many techniques proposed for the estimation of the parameters of GEV and GPD. In19, authors applied the maximum likelihood methodology (ML) described in14. However, the use of this methodology for two parameter distributions (i.e. Weibull or Gamma) has a very important drawback: these distributions are very sensitive to the distance between the high threshold ((u_2)) and the first peak16. For this reason, ML could be used with two-parameter distribution when (u_2) reaches a peak. As this peak is excluded, the first value of the exceedance is as far from (u_2) as possible. A solution would be to use the three-parameter Weibull and Gamma distributions. However, ML estimation of such distributions is very difficult, and the algorithms usually fit two-parameter distributions inside a discrete range of location parameters33.

As stated before, in this paper, we present a new methodology to model this kind of time series considering not only extreme values but also the rest of observations. In this way, instead of selecting the maximum values per a period (usually a year) or defining thresholds in the distribution of these extreme wave heights, which has an appreciable subjective component, we model the distribution of all wave heights, considering that it is a mixture formed by a normal distribution and a uniform distribution. The motivation is that the uniform distribution is associated to regular wave height values which contaminate the normal distribution of extreme values. This theoretical mixed distribution is used then to fix the threshold for the estimation of the POT distributions. Thus, the determination of the threshold will be done in a much more objective and probabilistic way.

Let us consider as a sequence of independent random variables, ((X_1, ldots , X_n)) of wave height data. These data follow an unknown continuous distribution. We assume that this distribution is a mixture of two independent distributions: (Y_1 sim N(mu , sigma )), and (Y_2 sim U(mu - delta , mu + delta )), where (N(mu , sigma )) is a Gaussian distribution, (U(mu - delta , mu + delta )) is a uniform distribution, (mu > 0) is the common mean of both distributions, (sigma ) is the standard deviation of (Y_1), and (delta ) is the radius of (Y_2), being (mu - delta > 0). Then (f(x) = gamma f_1(x) + (1-gamma ) f_2(x)), being (gamma ) the probability that an observation comes from the normal distribution, and f(x), (f_1(x)) and (f_2(x)) are the probability density functions (pdf) of X, (Y_1) and (Y_2), respectively.

For the estimation of the values of the four above-mentioned parameters ((mu , sigma , delta , gamma )), the standard statistical theory considers the least squares methods, the method of moments and the maximum likelihood (ML) method. In this context, Mathiesen et al.13 found that the least squares methods are sensitive to outliers, although Goda34 recommended this method with modified plotting position formulae.

Clauset et al.35 show that methods based on least-squares fitting for the estimation of probability-distribution parameters can have many problems, and, usually, the results are biased. These authors propose the method of ML for fitting parametrized models such as power-law distributions to observed data, given that ML provably gives accurate parameter estimates in the limit of large sample size36. The ML method is commonly used in multiple applications, e.g. in metocean applications25, due to its asymptotic properties of being unbiased and efficient. In this regard, White et al.37 conclude that ML estimation outperforms the other fitting methods, as it always yields the lowest variance and bias of the estimator. This is not unexpected, as the ML estimator is asymptotically efficient37,38. Also, in Clauset et al.35, it is shown, among other properties, that under mild regularity conditions, the ML estimation ({hat{alpha }}) converges almost surely to the true (alpha ), when considering estimating the scaling parameter ((alpha )) of a power law in the case of continuous data. It is asymptotically Gaussian, with variance ((alpha -1)^2/n). However, the ML estimators do not achieve these asymptotic properties until they are applied to large sample sizes. Hosking and Wallis39 showed that the ML estimators are non-optimal for sample sizes up to 500, with higher bias and variance than other estimators, such as moments and probability weighted-moments estimators.

Deluca and Corral40 also presented the estimation of a single parameter (alpha ) associated with a truncated continuous power-law distribution. In order to find the ML estimator of the exponent, they proceed by directly maximizing the log-likelihood (l(alpha )). The reason is practical since their procedure is part of a more general method, valid for arbitrary distributions f(x), for which the derivative of (l(alpha )) can be challenging to evaluate. They claim that one needs to be cautious when the value of (alpha ) is very close to one in the maximization algorithm and replace (l(alpha )) by its limit at (alpha =1).

Furthermore, the use of ML estimation for two-parameter distributions such as Weibull and Gamma distributions has the drawback16 previously discussed. Besides, the ML estimation is known to provide poor results when the maximum is at the limit of the interval of validity of one of the parameters. On the other hand, the estimation of the GPD parameters is subject of ongoing research. A quantitative comparison of recent methods for estimating the parameters was presented by Kang and Song41. In our case, having to estimate four parameters, we have decided to use the method of moments, for its analytical simplicity. It is always an estimation method associated with sample and population moments. Besides, adequate estimations are obtained in multi-parametric estimation and with limited samples, as shown in this work.

Considering (phi ) as the pdf of a standard normal distribution N(0,1), the pdf of (Y_1) is defined as:

$$begin{aligned}{}&f_1(x) = frac{1}{sigma }phi (z_x), text { } z_x = frac{x - mu }{sigma }, text { } x in {mathbb {R}}. end{aligned}$$

(7)

The pdf of (Y_2) is:

$$begin{aligned}{}&f_2(x) = frac{1}{2delta }, text { } x in (mu -delta ,mu +delta ). end{aligned}$$

(8)

Consequently, the pdf of X is:

$$begin{aligned}{}&f(x) = gamma f_1(x) + (1 - gamma ) f_2(x), text { } x in {mathbb {R}}. end{aligned}$$

(9)

To infer the values of the four parameters of the wave height time series ((mu ), (sigma ), (delta ), (gamma )), we define, for any symmetric random variable with respect to the mean (mu ) with pdf g and finite moments, a set of functions in the form:

$$begin{aligned}{}&U_k(x) = int _{|t - mu | ge x}|t - mu |^k g(t)dt, text { } x ge 0, text { } k = 1,2,3, ldots , end{aligned}$$

(10)

or because of its symmetry:

$$begin{aligned}{}&U_k(x) = 2int _{x+mu }^{infty }(t - mu )^k g(t)dt, text { } k = 1,2,3, ldots . end{aligned}$$

(11)

These functions are well defined for the same moments of the variable x, because:

$$begin{aligned}{}&U_k(x)< int _{-infty }^{infty }|t - mu |^k g(t)dt < infty , text { } k = 1,2,3, ldots . end{aligned}$$

(12)

Particularly, for the normal and uniform distributions, all the moments are finite, and the same happens for all the (U_k(x)) functions. This function measures, for each pair of values x and k, the bilateral tail from the value x of the moment with respect to the mean of order k of the variable. It is, therefore, a generalization of the concept of probability tail, which is obtained for (k = 0).

Now, if we denote the corresponding moments for the distributions (Y_1) and (Y_2) by (U_{k,1}(x)) and (U_{k,2}(x)), it is verified that:

$$begin{aligned}{}&U_k(x) = gamma U_{k,1}(x) + (1 - gamma ) U_{k,2}(x). end{aligned}$$

(13)

Then, to calculate the function (U_k(x)), we just need to calculate the functions (U_{k,1}(x)) and (U_{k,2}(x)).

From the definition of (f_2(x)) and (U_k(x)), if (mu > delta ):

$$begin{aligned}{}&U_{k,2}(x) = 2 int _{mu +x}^{mu +delta }(t-mu )^kfrac{1}{2delta }dt = left. frac{(t-mu )^{k+1}}{(k+1)delta } right| _{mu +x}^{mu +delta } = frac{delta ^{k+1} - x^{k+1}}{(k+1)delta }, end{aligned}$$

(14)

then,

$$begin{aligned} & U_{k,2}(x) =left{ begin{array}{ll} frac{delta ^{k+1} - x^{k+1}}{(k+1)delta } & 0 le x le delta ,\ quad quad 0 & quad x >delta . end{array}right. end{aligned}$$

(15)

From the definition of the (f_1(x)) and (U_k(x)), we have:

$$begin{aligned}{}&U_{k,1}(x)=frac{2}{sigma }int _{mu +x}^infty (t-mu )^kphi left( frac{t-mu }{sigma } right) dt. end{aligned}$$

(16)

Let the variable u be in the form (u = frac{t-mu }{sigma }), then:

$$begin{aligned}{}&U_{k,1}(x)=2int _{frac{x}{sigma }}^infty (usigma )^kphi (u)du = sigma ^kUpsilon _kleft( frac{x}{sigma } right) , end{aligned}$$

(17)

where (Upsilon _k = 2 int _{x}^infty (u)^kphi (u)du). (Upsilon _k(z)) is the (U_k) function calculated for a N(0,1) distribution, which will be then updated with values of (k=1,2,3).

The following equations are verified:

$$begin{aligned}{}&Upsilon _1(x) = 2 int _{x}^infty uphi (u)du = 2phi (x),end{aligned}$$

(18)

$$begin{aligned}{}&Upsilon _2(x) = 2 int _{x}^infty u^2phi (u)du = 2(1-Phi (x)+xphi (x)),end{aligned}$$

(19)

$$begin{aligned}{}&Upsilon _3(x) = 2 int _{x}^infty u^3phi (u)du = 2(2+x^2)phi (x), end{aligned}$$

(20)

where (Phi ) is the cumulative distribution function (CDF) of the N(0,1) distribution. The demonstration is included below.

The three equations can be obtained using integration by parts, but it is easier to derive the functions (Upsilon _k(x)) to check the result. For the definition of the functions, for each value of k, we have:

$$begin{aligned}{}&Upsilon _k^{'}(x) = frac{partial Upsilon _k(x)}{partial x} = -2x^kphi (x). end{aligned}$$

(21)

Taking into account that (frac{partial phi (x)}{partial x}=-xphi (x)), and (frac{partial Phi (x)}{partial x}=phi (x)):

$$begin{aligned} frac{partial 2phi (x)}{partial x}&= -2xphi (x)= Upsilon _1^{'}(x), end{aligned}$$

(22)

$$begin{aligned} frac{partial (2(1-Phi (x)+xphi (x)))}{partial x}&= 2(-phi (x)+phi (x)-x^2phi (x)) = nonumber \&=-2x^2phi (x)= Upsilon _2^{'}(x), end{aligned}$$

(23)

$$begin{aligned} frac{partial (2(2+x^2)phi (x))}{partial (x)}&=2(2xphi (x)-(2+x^2)xphi (x))=nonumber \&=-2x^3phi (x)=Upsilon _3^{'}(x). end{aligned}$$

(24)

Therefore, the left and right sides of the previous equations differ in, at most, a constant. To verify that they are the same, we check the value (x=0):

$$begin{aligned}{}&Upsilon _1(0) = 2 int _0^infty u phi (u)du = sqrt{frac{2}{pi }},end{aligned}$$

(25)

$$begin{aligned}{}&Upsilon _2(0) = 2 int _0^infty u^2 phi (u)du = 1,end{aligned}$$

(26)

$$begin{aligned}{}&Upsilon _3(0) = 2 int _0^infty u^3 phi (u)du = 2 sqrt{frac{2}{pi }}, end{aligned}$$

(27)

which match with the right sides of Eqs. (18)(20):

$$begin{aligned}{}&Upsilon _1(0) = 2 phi (0) = sqrt{frac{2}{pi }},end{aligned}$$

(28)

$$begin{aligned}{}&Upsilon _2(0) = 2 (1-Phi (0)) = 1,end{aligned}$$

(29)

$$begin{aligned}{}&Upsilon _3(0) = 2 (2)phi (0) = 2 sqrt{frac{2}{pi }}. end{aligned}$$

(30)

Substituting these results in Eq. (17) we have:

Read the original:

A mixed distribution to fix the threshold for Peak-Over-Threshold wave height estimation | Scientific Reports - Nature.com

Related Posts

Comments are closed.