Module pyl4c.stats
Various statistical functions. Note that many have an add_intercept
argument
and that this is True by default, which means the X matrix will have a column
of ones added for the calculation of the intercept. If you are already using
a design/ model matrix with an intercept term, be sure to set
add_intercept = False
.
Functions
def detrend(y, x=None, fill=False)
-
Expand source code
def detrend(y, x = None, fill = False): ''' Detrends a 1D series (linearly). Equivalently, returns the residuals of an OLS regression where the (transpose of the) design matrix is: 1 1 1 1 ... 1 0 1 2 3 ... N This removes the linear (straight line) component, e.g., of a time series with equal-size time steps. Parameters ---------- y : numpy.ndarray The 1D series to detrend x : numpy.ndarray (Optional) The linear series describing the trend; usually a series of consecutive integers, e.g.: 1, 2, 3, ... fill : bool (Optional) True to fill NaNs with mean value (Default: False) Returns ------- numpy.ndarray The detrended y values ''' assert y.ndim == 1, 'Series to detrend must be a 1D vector' n, m = (y.size, 1) if x is not None: if x.ndim == 2: n, m = x.shape # Return all NaNs if the input array is all NaNs if np.all(np.isnan(y)): return np.repeat(np.nan, n, axis = 0) # Optionally, fill NaNs with the mean if fill: y = y.copy() nan_mask = np.isnan(y) y[np.isnan(y)] = np.nanmean(y) x = np.arange(0, n) if x is None else x beta = ols(x, y, add_intercept = True) yhat = np.hstack( (np.ones((n,)).reshape((n, 1)), x.reshape((n, m)))) @ beta if fill: return np.where(nan_mask, np.nan, np.subtract(y, yhat)) return np.subtract(y, yhat)
Detrends a 1D series (linearly). Equivalently, returns the residuals of an OLS regression where the (transpose of the) design matrix is:
1 1 1 1 ... 1 0 1 2 3 ... N
This removes the linear (straight line) component, e.g., of a time series with equal-size time steps.
Parameters
y
:numpy.ndarray
- The 1D series to detrend
x
:numpy.ndarray
- (Optional) The linear series describing the trend; usually a series of consecutive integers, e.g.: 1, 2, 3, …
fill
:bool
- (Optional) True to fill NaNs with mean value (Default: False)
Returns
numpy.ndarray
- The detrended y values
def entropy(seq, base=2)
-
Expand source code
def entropy(seq, base = 2): ''' Returns the Shannon entropy of an input sequence of elements. Default units are bits but other units can be returned by changing the base. All calculations are with base-2 logarithms; change in base is done through multiplying by the constant factor `log_b(a)`` to change from base `a` to base `b`. Adapted from [1]. NOTE: An estimate of the upper limit or "optimal" entropy for base `b` with `N` possible symbols can be obtained: math.log(n, b) 1. http://rosettacode.org/wiki/Entropy#Python Parameters ---------- seq : list or tuple or str Sequence of elements over which entropy is calculated base : int or float Base of the output units, e.g., 2 for "bits," e (2.718...) for "nats," and 10 for "bans" (Default: 2) Returns ------- float ''' # Trivial case, all symbols are the same if np.all(np.equal(seq[0], seq)): return 0.0 # Avoid a "-0.0" return p, lns = Counter(seq), float(len(seq)) e = -sum( count / lns * np.log2(count / lns) for count in p.values()) return e if base == 2 else e * math.log(2, base)
Returns the Shannon entropy of an input sequence of elements. Default units are bits but other units can be returned by changing the base. All calculations are with base-2 logarithms; change in base is done through multiplying by the constant factor `log_b(a)`` to change from base
a
to baseb
. Adapted from [1].NOTE: An estimate of the upper limit or "optimal" entropy for base
b
withN
possible symbols can be obtained:math.log(n, b)
Parameters
seq
:list
ortuple
orstr
- Sequence of elements over which entropy is calculated
base
:int
orfloat
- Base of the output units, e.g., 2 for "bits," e (2.718…) for "nats," and 10 for "bans" (Default: 2)
Returns
float
def harmonic_ols(x, y, period=12)
-
Expand source code
def harmonic_ols(x, y, period = 12): r''' Returns OLS estimates for harmonic series of X, i.e., the matrix `A` of the equation `y = Ax + b` is a linear combination of sines and cosines. $$ y_{i,t} = \alpha_i + \beta_{i,1}\, \mathrm{cos}\left(\frac{2\pi}{T}x_{i,t}\right) + \beta_{i,2}\,\mathrm{sin}\left(\frac{2\pi}{T}x_{i,t}\right) + \varepsilon_{i,t} $$ Parameters ---------- x : numpy.ndarray The independent variable, must be 1D y : numpy.ndarray The dependent variable, must be 1D Returns ------- numpy.ndarray The solution to the least-squares problem ''' assert x.ndim == 1, 'Array x must be 1D' x0 = x.reshape((x.shape[0], 1)) # Create initial design/ model matrix (without intercept), as this is # augmented by ols() a = ((2 * np.pi) / period) # Harmonic coefficient xm = np.hstack((x0, np.cos(a * x0), np.sin(a * x0))) return ols(xm, y, add_intercept = True)
Returns OLS estimates for harmonic series of X, i.e., the matrix
A
of the equationy = Ax + b
is a linear combination of sines and cosines. y_{i,t} = \alpha_i + \beta_{i,1}\, \mathrm{cos}\left(\frac{2\pi}{T}x_{i,t}\right) + \beta_{i,2}\,\mathrm{sin}\left(\frac{2\pi}{T}x_{i,t}\right) + \varepsilon_{i,t}Parameters
x
:numpy.ndarray
- The independent variable, must be 1D
y
:numpy.ndarray
- The dependent variable, must be 1D
Returns
numpy.ndarray
- The solution to the least-squares problem
def linear_constraint(xmin, xmax, form=None)
-
Expand source code
def linear_constraint(xmin, xmax, form = None): ''' Returns a linear ramp function, for deriving a value on [0, 1] from an input value `x`: if x >= xmax: return 1 if x <= xmin: return 0 return (x - xmin) / (xmax - xmin) Parameters ---------- xmin : int or float Lower bound of the linear ramp function xmax : int or float Upper bound of the linear ramp function form : str Type of ramp function: "reversed" decreases as x increases; "binary" returns xmax when x == 1; default (None) is increasing as x increases. Returns ------- function ''' assert form == 'binary' or np.any(xmax >= xmin),\ 'xmax must be greater than/ equal to xmin' if form == 'reversed': return lambda x: np.where(x >= xmax, 0, np.where(x < xmin, 1, 1 - np.divide( x - xmin, xmax - xmin))) if form == 'binary': return lambda x: np.where(x == 1, xmax, xmin) return lambda x: np.where(x >= xmax, 1, np.where(x < xmin, 0, np.divide(x - xmin, xmax - xmin)))
Returns a linear ramp function, for deriving a value on [0, 1] from an input value
x
:if x >= xmax: return 1 if x <= xmin: return 0 return (x - xmin) / (xmax - xmin)
Parameters
xmin
:int
orfloat
- Lower bound of the linear ramp function
xmax
:int
orfloat
- Upper bound of the linear ramp function
form
:str
- Type of ramp function: "reversed" decreases as x increases; "binary" returns xmax when x == 1; default (None) is increasing as x increases.
Returns
function
def ols(x, y, add_intercept=True, use_qr_decomp=True)
-
Expand source code
def ols(x, y, add_intercept = True, use_qr_decomp = True): ''' Returns ordinary least squares (OLS) estimates for X. If X is univariate (1D), returns the slope of the line between Y and X as well as the "y-intercept" or the intersection of this line with the vertical axis. Parameters ---------- x : numpy.ndarray The independent variable(s); should N x M where M is the number of variables y : numpy.ndarray The dependent variable, must be 1D add_intercept : bool True to add a y-intercept term (Default: True) use_qr_decomp : bool True to use QR decomposition to obtain solution (Default: True) Returns ------- numpy.ndarray The solution to the least-squares problem ''' assert y.ndim == 1, 'Array y must be 1D' n = x.shape[0] # Num. of samples m = 1 if x.ndim == 1 else x.shape[1] # Num. of covariates # Create design/ model matrix xm = x.reshape((n, m)) # Without y-intercept term, unless... if add_intercept: xm = np.hstack((np.ones((n,)).reshape((n, 1)), x.reshape((n, m)))) if xm.shape[1] > n: raise ValueError('System of equations is rank deficient') # Generally better to use QR decomposition to obtain \hat{\beta} if use_qr_decomp: q, r = np.linalg.qr(xm) fit = np.linalg.inv(r) @ q.T @ y else: fit = np.linalg.inv(xm.T @ xm) @ xm.T @ y return fit
Returns ordinary least squares (OLS) estimates for X. If X is univariate (1D), returns the slope of the line between Y and X as well as the "y-intercept" or the intersection of this line with the vertical axis.
Parameters
x
:numpy.ndarray
- The independent variable(s); should N x M where M is the number of variables
y
:numpy.ndarray
- The dependent variable, must be 1D
add_intercept
:bool
- True to add a y-intercept term (Default: True)
use_qr_decomp
:bool
- True to use QR decomposition to obtain solution (Default: True)
Returns
numpy.ndarray
- The solution to the least-squares problem
def ols_variance(x, y, beta=None, add_intercept=True)
-
Expand source code
def ols_variance(x, y, beta = None, add_intercept = True): r''' Returns the unbiased estimate of OLS model variance: $$ SSE / (n - p) $$ Where: $$ SSE = (y - X\beta )' (y - X\beta ) $$ SSE is known as the sum of squared errors of prediction and is equivalent to the residual sum of squares (RSS). Parameters ---------- x : numpy.ndarray The independent variable(s); should be N x M where M is the number of variables y : numpy.ndarray The dependent variable, must be 1D beta : numpy.ndarray (Optional) Coefficient estimates, an M-dimensional vector add_intercept : bool True to add a y-intercept term (Default: True) Returns ------- float The sum-of-squared-errors of prediction ''' n = x.shape[0] # Num. of samples p = 1 if x.ndim == 1 else x.shape[1] # Num. of covariates p += 1 if add_intercept else 0 # Alternatively, use xm and beta calculated in sum_of_squares()... # return ((y - xm @ beta).T @ (y - xm @ beta)) / (n - p) return sum_of_squares(x, y, beta, add_intercept, which = 'sse') / (n - p)
Returns the unbiased estimate of OLS model variance: SSE / (n - p)
Where: SSE = (y - X\beta )' (y - X\beta )
SSE is known as the sum of squared errors of prediction and is equivalent to the residual sum of squares (RSS).
Parameters
x
:numpy.ndarray
- The independent variable(s); should be N x M where M is the number of variables
y
:numpy.ndarray
- The dependent variable, must be 1D
beta
:numpy.ndarray
- (Optional) Coefficient estimates, an M-dimensional vector
add_intercept
:bool
- True to add a y-intercept term (Default: True)
Returns
float
- The sum-of-squared-errors of prediction
def rmsd(x1, x2, n=None, weights=None)
-
Expand source code
def rmsd(x1, x2, n = None, weights = None): r''' Returns the root mean-squared deviation (RMSD) between two continuously varying random quantities: $$ RMSD(\hat{x}, x) = \sqrt{n^{-1}\sum_i^N (\hat{x}_i - x_i)^2} $$ Where `N` (or `n`) is the number of samples (e.g., model cells or time steps). NOTE: `NoData` should be filled with `np.nan` prior to calling this function; it is assumed that both vectors have the same missingness. Parameters ---------- x1 : numpy.ndarray A 1D or 2D numeric vector x2 : numpy.ndarray A 1D or 2D numeric vector n : int (Optional) The number of samples, for normalizing; if not provided, calculated as the number of non-NaN samples weights : numpy.ndarray Weights array of a shape that can be broadcast to match both x1 and x2 Returns ------- float ''' assert isinstance(n, int) or n is None, 'Argument "n" must be of integer type' assert x1.ndim <= 2 and x2.ndim <= 2, 'No support for more than 2 axes' diffs = np.subtract(x1, x2) # Optionally weight the residuals first if weights is not None: if weights.ndim == 1: weights = weights.reshape((1, *weights.shape)) diffs = diffs * weights diffs = diffs[~np.isnan(diffs)] n = n if n is not None else diffs.shape[0] return np.sqrt(np.divide(np.sum(np.power(diffs, 2)), n))
Returns the root mean-squared deviation (RMSD) between two continuously varying random quantities: RMSD(\hat{x}, x) = \sqrt{n^{-1}\sum_i^N (\hat{x}_i - x_i)^2}
Where
N
(orn
) is the number of samples (e.g., model cells or time steps).NOTE:
NoData
should be filled withnp.nan
prior to calling this function; it is assumed that both vectors have the same missingness.Parameters
x1
:numpy.ndarray
- A 1D or 2D numeric vector
x2
:numpy.ndarray
- A 1D or 2D numeric vector
n
:int
- (Optional) The number of samples, for normalizing; if not provided, calculated as the number of non-NaN samples
weights
:numpy.ndarray
- Weights array of a shape that can be broadcast to match both x1 and x2
Returns
float
def sum_of_squares(x, y, beta=None, add_intercept=True, which='ssr')
-
Expand source code
def sum_of_squares(x, y, beta = None, add_intercept = True, which = 'ssr'): ''' Calculates the sum-of-squared errors, sum-of-squared regressors, or total sum-of-squared errors based on the observed and predicted responses. Will calculate sum-of-squares for univariate linear regression if vector "x" is the independent variable and "y" is the dependent variable. Parameters ---------- x : numpy.ndarray Observed response or independent variable; y : numpy.ndarray Predicted response or dependent variable; beta : numpy.ndarray (Optional) To avoid re-fitting (or fitting incorrectly) the OLS regression, can provide the coefficients of the OLS regression add_intercept : bool True to add an intercept term when fitting OLS regression which : str Which sum-of-squares quantity to calculate (Default: 'ssr'); should be one of: ('sse', 'ssr', 'sst') Returns ------- float ''' assert y.ndim == 1, 'Array y must be 1D' if beta is None: beta = ols(x, y, add_intercept) n = x.shape[0] # Num. of samples m = p = 1 if x.ndim == 1 else x.shape[1] # Num. of covariates # Create design/ model matrix xm = x.reshape((n, m)) # Without y-intercept term, unless... if add_intercept: p += 1 xm = np.hstack((np.ones((n,)).reshape((n, 1)), x.reshape((n, m)))) yhat = xm @ beta # Calculate predicted values if which == 'sse': # Alternatively: ((y - xm @ beta).T @ (y - xm @ beta)) return np.sum(np.power(np.subtract(y, yhat), 2)) elif which == 'ssr': return np.sum(np.power(np.subtract(yhat, np.nanmean(y)), 2)) elif which == 'sst': if add_intercept: return np.sum(np.power(np.subtract(y, np.nanmean(y)), 2)) else: return np.sum(np.power(y, 2)) else: raise NotImplementedError('Argument "which" not understood')
Calculates the sum-of-squared errors, sum-of-squared regressors, or total sum-of-squared errors based on the observed and predicted responses. Will calculate sum-of-squares for univariate linear regression if vector "x" is the independent variable and "y" is the dependent variable.
Parameters
x
:numpy.ndarray
- Observed response or independent variable;
y
:numpy.ndarray
- Predicted response or dependent variable;
beta
:numpy.ndarray
- (Optional) To avoid re-fitting (or fitting incorrectly) the OLS regression, can provide the coefficients of the OLS regression
add_intercept
:bool
- True to add an intercept term when fitting OLS regression
which
:str
- Which sum-of-squares quantity to calculate (Default: 'ssr'); should be one of: ('sse', 'ssr', 'sst')
Returns
float
def t_statistic(x, y, beta_hat, beta=0, add_intercept=True)
-
Expand source code
def t_statistic(x, y, beta_hat, beta = 0, add_intercept = True): r''' Calculates t-statistics for each of the predictors of beta_hat; by default, the true value of beta is assumed to be zero (i.e., null hypothesis is true). Calculated as: $$ T_{n-p} = \frac{\hat{\beta }_j - \beta_j}{\mathrm{s.e.}(\hat{\beta }_j)}\sim t_{n-p} $$ Parameters ---------- x : numpy.ndarray The independent variable(s); should be N x M where M is the number of variables y : numpy.ndarray The dependent variable, must be 1D beta_hat : numpy.ndarray The estimated value(s) for beta beta : numpy.ndarray (Optional) The true value(s) for beta add_intercept : bool True to add a y-intercept term (Default: True) Returns ------- float ''' assert x.ndim <= 2, 'Array x must be 1D or 2D only' model_var = ols_variance(x, y, beta_hat, add_intercept = add_intercept) # Var(beta) beta_hat_se = np.sqrt(np.diag(np.linalg.inv(x.T @ x) * model_var)) return np.subtract(beta_hat, beta) / beta_hat_se
Calculates t-statistics for each of the predictors of beta_hat; by default, the true value of beta is assumed to be zero (i.e., null hypothesis is true). Calculated as: T_{n-p} = \frac{\hat{\beta }_j - \beta_j}{\mathrm{s.e.}(\hat{\beta }_j)}\sim t_{n-p}
Parameters
x
:numpy.ndarray
- The independent variable(s); should be N x M where M is the number of variables
y
:numpy.ndarray
- The dependent variable, must be 1D
beta_hat
:numpy.ndarray
- The estimated value(s) for beta
beta
:numpy.ndarray
- (Optional) The true value(s) for beta
add_intercept
:bool
- True to add a y-intercept term (Default: True)
Returns
float