Module pyl4c.utils
Convenience functions for working with SMAP L4C data in HDF5 arrays.
NOTE: All of the functions beginning with get_
require access to ancillary
data files.
Functions
def composite(*arrays,
reducer='mean',
target_band=0,
nodata=-9999.0,
dtype=numpy.float32,
processes=1)-
Expand source code
def composite( *arrays, reducer = 'mean', target_band = 0, nodata = -9999.0, dtype = np.float32, processes = 1): ''' Composites multiple raster arrays in a single band; is extremely fast when used with a Process Pool, but does not support normalization and will not composite more than one band at a time. NOTE: This function does NOT calculate memory use and will NOT guard against memory overflow. Parameters ---------- arrays : *numpy.ndarray The input arrays to composite reducer : str The name of the reducer function, either: "median", "min", "max", "mean", or "std" target_band : int The index of the band to composite nodata : int or float The NoData value to ignore in compositing dtype : type The data type to enforce in the output processes : int (Optional) Number of processes to use Returns ------- numpy.ndarray ''' if reducer not in ('sum', 'median', 'min', 'max', 'mean', 'std'): raise ValueError('Invalid reducer name') assert all([isinstance(a, np.ndarray) for a in arrays]), 'Every element of "arrays" must be an instance of np.ndarray' shp = arrays[0].shape assert all(map(lambda x: x == shp, [r.shape for r in arrays])), 'Raster arrays must have the same shape' # For single-band arrays... if arrays[0].ndim < 3: shp = (1, shp[0], shp[1]) arrays = list(map(lambda r: r.reshape(shp), arrays)) # Stack the arrays in a continuous, single-band "tapestry" using # vstack(), then cut out the arrays concatenated in this way into # separate bands using reshape() b = target_band stack = np.vstack(list(map( lambda r: r[b,...].reshape((1, shp[1], shp[2])), arrays))) # Insert nan into NoData locations stack = np.where(stack == nodata, np.nan, stack) # Find the reducer function based on its name (should be a "nan" version) reducer_func = partial(getattr(np, 'nan%s' % reducer), axis = 0) # Avoid traceback masking and memory errors with ProcessPoolExecutor # when only a single process is requested if processes == 1: all_results = reducer_func(stack) else: # Get index ranges for each process to work on work = partition(stack, num_processes = processes, axis = 1) with ProcessPoolExecutor(max_workers = processes) as executor: all_results = executor.map( reducer_func, [stack[:,i:j] for i, j in work]) # Stack each reduced band (and reshape to multi-band image) result = np.concatenate(list(all_results), axis = 0)\ .reshape((1, shp[1], shp[2])) return np.where(np.isnan(result), nodata, result)
Composites multiple raster arrays in a single band; is extremely fast when used with a Process Pool, but does not support normalization and will not composite more than one band at a time.
NOTE: This function does NOT calculate memory use and will NOT guard against memory overflow.
Parameters
arrays
:*numpy.ndarray
- The input arrays to composite
reducer
:str
- The name of the reducer function, either: "median", "min", "max", "mean", or "std"
target_band
:int
- The index of the band to composite
nodata
:int
orfloat
- The NoData value to ignore in compositing
dtype
:type
- The data type to enforce in the output
processes
:int
- (Optional) Number of processes to use
Returns
numpy.ndarray
def composite_hdf(hdf_file_glob, field, subset_id='CONUS', **kwargs)
-
Expand source code
def composite_hdf(hdf_file_glob, field, subset_id = 'CONUS', **kwargs): ''' A convenience wrapper for `composite()`, enabling its use with HDF5 files specified by path names. Accepts additional `**kwargs` as keywoard arguments to `pyl4c.utils.composite()`. NOTE: This function does NOT calculate memory use and will NOT guard against memory overflow. Parameters ---------- hdf_file_glob : str A file path expression for glob.glob() field : str The hierarchical path to a variable within each HDF5 file subset_id : str A well-known identifier for a bounding-box subset Returns ------- numpy.ndarray ''' assert subset_id in SUBSETS_BBOX.keys() assert len(glob.glob(hdf_file_glob)) > 0, 'No files found matching GLOB regular expression' arrays = [] for path in glob.glob(hdf_file_glob): with h5py.File(path, 'r') as hdf: assert field in hdf.keys(), 'Variable "%s" not found in HDF5 file' % field arrays.append(subset(hdf, field, subset_id = subset_id)[0]) return composite(*arrays, **kwargs)
A convenience wrapper for
composite()
, enabling its use with HDF5 files specified by path names. Accepts additional**kwargs
as keywoard arguments tocomposite()
.NOTE: This function does NOT calculate memory use and will NOT guard against memory overflow.
Parameters
hdf_file_glob
:str
- A file path expression for glob.glob()
field
:str
- The hierarchical path to a variable within each HDF5 file
subset_id
:str
- A well-known identifier for a bounding-box subset
Returns
numpy.ndarray
def get_ease2_coords(grid, in_1d=True)
-
Expand source code
def get_ease2_coords(grid, in_1d = True): ''' DEPRECATED: Use `pyl4c.ease2.ease2_coords()` instead. Returns a tuple of 1D arrays, the X- and Y-coordinates of the desired EASE-Grid 2.0, from reading the ancillary data file. Parameters ---------- grid : str The EASE-Grid 2.0 designation: M01, M09, etc. in_1d : bool True to return only 1D arrays, avoiding redundant rows/ columns (Default) Returns ------- tuple X- and Y-coordinate arrays: (numpy.ndarray, numpy.ndarray) ''' return ease2_coords(grid, in_1d)
DEPRECATED: Use
ease2_coords()
instead. Returns a tuple of 1D arrays, the X- and Y-coordinates of the desired EASE-Grid 2.0, from reading the ancillary data file.Parameters
grid
:str
- The EASE-Grid 2.0 designation: M01, M09, etc.
in_1d
:bool
- True to return only 1D arrays, avoiding redundant rows/ columns (Default)
Returns
tuple
- X- and Y-coordinate arrays: (numpy.ndarray, numpy.ndarray)
def get_ease2_slice_idx(grid, subset_id)
-
Expand source code
def get_ease2_slice_idx(grid, subset_id): ''' Returns a tuple `((xmin, xmax), (ymin, ymax))` of the indices that can be used to slice a corresponding EASE-Grid 2.0 in order to extract the desired bounding box. Parameters ---------- grid : str The EASE-Grid 2.0 designation: M01, M09, etc. subset_id : str Keyword designating the desired subset and its corresponding bounding box Returns ------- tuple Tuple of tuples: `((xmin, xmax), (ymin, ymax))` ''' assert grid in EASE2_GRID_PARAMS.keys(),\ 'Could not understand grid argument; must be one of: %s' % ', '.join(EASE2_GRID_PARAMS.keys()) x_coords, y_coords = ease2_coords(grid, in_1d = True) return get_slice_idx_by_bbox(x_coords, y_coords, subset_id)
Returns a tuple
((xmin, xmax), (ymin, ymax))
of the indices that can be used to slice a corresponding EASE-Grid 2.0 in order to extract the desired bounding box.Parameters
grid
:str
- The EASE-Grid 2.0 designation: M01, M09, etc.
subset_id
:str
- Keyword designating the desired subset and its corresponding bounding box
Returns
tuple
- Tuple of tuples:
((xmin, xmax), (ymin, ymax))
def get_ease2_slice_offsets(grid, subset_id)
-
Expand source code
def get_ease2_slice_offsets(grid, subset_id): ''' Returns the X and Y offsets of an EASE-Grid 2.0 subset, so that an output raster array can be aligned properly; see `pyl4c.spatial.array_to_raster()`. Parameters ---------- grid : str The EASE-Grid 2.0 designation: M01, M09, etc. subset_id : str Keyword designating the desired subset and its corresponding bounding box Returns ------- tuple The X- and Y-offset `(xoff, yoff)` ''' assert grid in EASE2_GRID_PARAMS.keys(),\ 'Could not understand grid argument; must be one of: %s' % ', '.join(EASE2_GRID_PARAMS.keys()) slice_idx = get_ease2_slice_idx(grid = grid, subset_id = subset_id) return (slice_idx[0][0], slice_idx[1][0])
Returns the X and Y offsets of an EASE-Grid 2.0 subset, so that an output raster array can be aligned properly; see
array_to_raster()
.Parameters
grid
:str
- The EASE-Grid 2.0 designation: M01, M09, etc.
subset_id
:str
- Keyword designating the desired subset and its corresponding bounding box
Returns
tuple
- The X- and Y-offset
(xoff, yoff)
def get_pft_array(grid, subset_id=None, subset_bbox=None, slice_idx=None)
-
Expand source code
def get_pft_array(grid, subset_id = None, subset_bbox = None, slice_idx = None): ''' Returns an array of PFT codes on the given EASE-Grid 2.0 for the entire globe or a specified subset. Arguments --------- grid : str The EASE-Grid 2.0 designation: M01, M09, etc. subset_id : str (Optional) Instead of `subset_bbox`, can provide keyword designating the desired subset area subset_bbox : tuple (Optional) Instead of `subset_id`, can provide any arbitrary bounding box, i.e., sequence of coordinates: `<xmin>, <ymin>, <xmax>, <ymax>`; these should be in WGS84 (longitude-latitude) coordinates slice_idx : list (Optional) Sequence of row-column slice indices to use, for subsetting the array; provide as: `[(row1, row2), (col1, col2)]` where the array is to be subset as: `array[row1:row2,col1:col2]` Returns ------- numpy.ndarray ''' assert grid in EASE2_GRID_PARAMS.keys(),\ 'Could not understand grid argument; must be one of: %s' % ', '.join(EASE2_GRID_PARAMS.keys()) msg = 'Expected only one of the following arguments: subset_id, subset_bbox, slice_idx' assert not all((subset_id is not None, subset_bbox is not None)), msg assert not all((subset_bbox is not None, slice_idx is not None)), msg m = int(grid[-1]) # Pop off the grid size in km anc_hdf_path = ANCILLARY_DATA_PATHS['smap_l4c_ancillary_data_file_path'] field = ANCILLARY_DATA_PATHS['smap_l4c_%dkm_ancillary_data_lc_path' % m] xf = ANCILLARY_DATA_PATHS['smap_l4c_%dkm_ancillary_data_x_coord_path' % m] yf = ANCILLARY_DATA_PATHS['smap_l4c_%dkm_ancillary_data_y_coord_path' % m] with h5py.File(anc_hdf_path, 'r') as hdf: if subset_id is not None: pft_array, _, _ = subset( hdf, field, hdf[xf][0,:], hdf[yf][:,0], subset_id = subset_id) return pft_array elif subset_bbox is not None: pft_array, _, _ = subset( hdf, field, hdf[xf][0,:], hdf[yf][:,0], subset_bbox = subset_bbox) return pft_array elif slice_idx is not None: y0, y1 = slice_idx[0] x0, x1 = slice_idx[1] return hdf[field][y0:y1,x0:x1] else: return hdf[field][:]
Returns an array of PFT codes on the given EASE-Grid 2.0 for the entire globe or a specified subset.
Arguments
grid
:str
- The EASE-Grid 2.0 designation: M01, M09, etc.
subset_id
:str
- (Optional) Instead of
subset_bbox
, can provide keyword designating the desired subset area subset_bbox
:tuple
- (Optional) Instead of
subset_id
, can provide any arbitrary bounding box, i.e., sequence of coordinates:<xmin>, <ymin>, <xmax>, <ymax>
; these should be in WGS84 (longitude-latitude) coordinates slice_idx
:list
- (Optional) Sequence of row-column slice indices to use, for subsetting
the array; provide as:
[(row1, row2), (col1, col2)]
where the array is to be subset as:array[row1:row2,col1:col2]
Returns
numpy.ndarray
def get_slice_idx_by_bbox(x_coords, y_coords, subset_id=None, subset_bbox=None)
-
Expand source code
def get_slice_idx_by_bbox( x_coords, y_coords, subset_id = None, subset_bbox = None): ''' Returns a tuple `((xmin, xmax), (ymin, ymax))` of the indices that can be used to slice a corresponding EASE2 grid in order to extract the desired bounding box. The returned indices correspond to the approximate locations of the bounding box (defined in geographic space) in array index space; i.e., if `array` is the EASE2 grid of interest, then the following slice will extract an area within the bounding box (bbox): array[ymin:ymax, xmin:xmax] NOTE: Coordinate arrays (`x_coords`, `y_coords`) must be given in the same units as the BBOX definition in `SUBSET_BBOX`, i.e., decimal degrees. Parameters ---------- x_coords : numpy.ndarray A 1D array of X coordinate values y_coords : numpy.ndarray A 1D array of Y coordinate values subset_id : str (Optional) Instead of subset_bbox, can provide keyword designating the desired subset area subset_bbox : tuple (Optional) Instead of subset_id, can provide any arbitrary bounding box, i.e., sequence of coordinates: `<xmin>, <ymin>, <xmax>, <ymax>`; these should be in WGS84 (longitude-latitude) coordinates ''' if subset_id is not None and subset_bbox is not None: raise ValueError('Should provide only one argument: Either subset_id or subset_bbox') assert x_coords.ndim == 1 and y_coords.ndim == 1, 'Must provide 1D coordinate arrays only' bb = subset_bbox if subset_id is not None: bb = SUBSETS_BBOX[subset_id] # Get min, max indices of X coords within bbox; note that we +1 because # ending slice indices are non-inclusive in Python if x_coords[-1] > x_coords[0]: # If X-coordinates are sorted smallest to largest... x_slice_idx = [ np.where(x_coords >= bb[0])[0].min(), np.where(x_coords <= bb[2])[0].max() + 1 ] else: x_slice_idx = [ np.where(x_coords <= bb[0])[0].min(), np.where(x_coords >= bb[2])[0].max() + 1 ] # Get min, max indices of Y coords within bbox if y_coords[-1] > y_coords[0]: # If Y-coordinates are sorted smallest to largest... y_slice_idx = [ np.where(y_coords <= bb[3])[0].max() + 1, np.where(y_coords >= bb[1])[0].min() ] else: y_slice_idx = [ np.where(y_coords <= bb[3])[0].min(), np.where(y_coords >= bb[1])[0].max() + 1 ] x_slice_idx.sort() # Necessary b/c slicing is always small:large number y_slice_idx.sort() return (tuple(x_slice_idx), tuple(y_slice_idx))
Returns a tuple
((xmin, xmax), (ymin, ymax))
of the indices that can be used to slice a corresponding EASE2 grid in order to extract the desired bounding box. The returned indices correspond to the approximate locations of the bounding box (defined in geographic space) in array index space; i.e., ifarray
is the EASE2 grid of interest, then the following slice will extract an area within the bounding box (bbox):array[ymin:ymax, xmin:xmax]
NOTE: Coordinate arrays (
x_coords
,y_coords
) must be given in the same units as the BBOX definition inSUBSET_BBOX
, i.e., decimal degrees.Parameters
x_coords
:numpy.ndarray
- A 1D array of X coordinate values
y_coords
:numpy.ndarray
- A 1D array of Y coordinate values
subset_id
:str
- (Optional) Instead of subset_bbox, can provide keyword designating the desired subset area
subset_bbox
:tuple
- (Optional) Instead of subset_id, can provide any arbitrary bounding
box, i.e., sequence of coordinates:
<xmin>, <ymin>, <xmax>, <ymax>
; these should be in WGS84 (longitude-latitude) coordinates
def get_xy_coords(hdf_or_nc, in_1d=True)
-
Expand source code
def get_xy_coords(hdf_or_nc, in_1d = True): ''' Returns a tuple (longitude, latitude) where the elements are coordinate arrays of longitude and latitude. These are needed for, e.g., plotting the geophysical data on a global geographic grid. This is convenience function for extracting the longitude-latitude coordinates based on the filename and our knowledge of where these data are stored. NOTE: This may seem like a hack, but is the easiest solution to the fundamental problem of inconsistent variable paths; inconsistent naming between dataset IDs and filenames; and incomplete documentation of each within the HDF5 file. Parameters ---------- hdf_or_nc : h4py.File or netcdf4.Dataset Either: an HDF5 file / h5py.File object OR a NetCDF file in_1d : bool True (Default) to return 1D arrays; if False, returns 2D arrays where the coordinates are duplicated along one axis ''' try: # HDF5 files if hasattr(hdf_or_nc, 'keys') and hasattr(hdf_or_nc, 'filename'): if 'GPP' in hdf_or_nc.keys() and 'NEE' in hdf_or_nc.keys(): d = HDF_PATHS['SPL4CMDL']['4'] # NOTE: We assume Version 4 elif 'Geophysical_Data' in hdf_or_nc.keys(): if not 'sm_surface' in hdf_or_nc['Geophysical_Data'].keys(): raise ValueError() d = HDF_PATHS['SPL4SMGP']['4'] else: raise ValueError() x = hdf_or_nc[d['longitude']] y = hdf_or_nc[d['latitude']] # NetCDF files else: assert hasattr(hdf_or_nc, 'variables'), 'Assumed NetCDF file has no variables' if 'lon' in hdf_or_nc.variables and 'lat' in hdf_or_nc.variables: x = hdf_or_nc.variables['lon'] y = hdf_or_nc.variables['lat'] except ValueError: # Let's assume we know what the X-Y coordinate array keys are if 'cell_lon' in hdf_or_nc.keys() and 'cell_lat' in hdf_or_nc.keys(): x = hdf_or_nc['cell_lon'] y = hdf_or_nc['cell_lat'] else: raise NotImplementedError('The filename was not recognized as a product with known longitude-latitude data') x = x[0,:] if in_1d and len(x.shape) > 1 else x[:] y = y[:,0] if in_1d and len(y.shape) > 1 else y[:] return (x, y)
Returns a tuple (longitude, latitude) where the elements are coordinate arrays of longitude and latitude. These are needed for, e.g., plotting the geophysical data on a global geographic grid. This is convenience function for extracting the longitude-latitude coordinates based on the filename and our knowledge of where these data are stored. NOTE: This may seem like a hack, but is the easiest solution to the fundamental problem of inconsistent variable paths; inconsistent naming between dataset IDs and filenames; and incomplete documentation of each within the HDF5 file.
Parameters
hdf_or_nc
:h4py.File
ornetcdf4.Dataset
- Either: an HDF5 file / h5py.File object OR a NetCDF file
in_1d
:bool
- True (Default) to return 1D arrays; if False, returns 2D arrays where the coordinates are duplicated along one axis
def index(array, indices)
-
Expand source code
def index(array, indices): ''' Fast array indexing by subsetting the array first. If there are multiple indices, anywhere in the array, that need to be returned, this provides a speed-up by creating a smaller array to index. Parameters ---------- array : numpy.ndarray Array containing data at (row, column) indices indices : tuple or list 2-element sequence of (row, column) indices, in order Returns ------- list List of indexed values ''' assert len(indices) == 2,\ 'Must provide 2-element sequence of (row coordinates, column coordinates)' assert len(indices[0]) == len(indices[1]),\ 'Array indices must be the same length!' assert array.ndim <= 2, 'Array must have 2 or fewer axes' midx = np.ravel_multi_index(indices, array.shape) return array.ravel()[midx]
Fast array indexing by subsetting the array first. If there are multiple indices, anywhere in the array, that need to be returned, this provides a speed-up by creating a smaller array to index.
Parameters
array
:numpy.ndarray
- Array containing data at (row, column) indices
indices
:tuple
orlist
- 2-element sequence of (row, column) indices, in order
Returns
list
- List of indexed values
def partition(array, num_processes, axis=0)
-
Expand source code
def partition(array, num_processes, axis = 0): ''' Creates index ranges for partitioning an array to work on over multiple processes. Parameters ---------- array : numpy.ndarray The 2-dimensional array to partition num_processes : int The number of processes desired axis : int The axis to break apart into chunks Returns ------- list ''' N = array.shape[axis] return list(partition_generator(N, num_processes))
Creates index ranges for partitioning an array to work on over multiple processes.
Parameters
array
:numpy.ndarray
- The 2-dimensional array to partition
num_processes
:int
- The number of processes desired
axis
:int
- The axis to break apart into chunks
Returns
list
def partition_generator(n_elements, n_parts=1)
-
Expand source code
def partition_generator(n_elements, n_parts = 1): ''' A Generator that yields slice indices for equal-sized partitions of a 1D array or sequence. See also: `partition()`. Parameters ---------- n_elements : int The number of elements in array/ sequence n_parts : int The number of partitions desired Returns ------- generator ''' p = 0 parts = np.linspace(0, n_elements, n_parts + 1, dtype = int) while p < n_parts: yield ( int(parts[p]), # If it is the final index, add 1 int(parts[p+1] if (p != n_parts - 1) else (parts[p+1] + 1)) ) p += 1
A Generator that yields slice indices for equal-sized partitions of a 1D array or sequence. See also:
partition()
.Parameters
n_elements
:int
- The number of elements in array/ sequence
n_parts
:int
- The number of partitions desired
Returns
generator
def sample(array, indices)
-
Expand source code
def sample(array, indices): ''' Samples the value in an array at each (row, column) position in a sequence of row-column index pairs. Parameters ---------- array : numpy.ndarray indices : tuple Tuple of (x, y) coordinate pairs; no z-level should be be provided (2D coordinate pairs only) Returns ------- numpy.ndarray ''' indices_by_axis = [ int(i[0]) for i in indices], [int(i[1]) for i in indices ] # If row, column indices are not sorted, we'll need another list comprehension try: return array[indices_by_axis] except TypeError: return [array[idx[0], idx[1]] for idx in indices]
Samples the value in an array at each (row, column) position in a sequence of row-column index pairs.
Parameters
array
:numpy.ndarray
indices
:tuple
- Tuple of (x, y) coordinate pairs; no z-level should be be provided (2D coordinate pairs only)
Returns
numpy.ndarray
def subset(hdf_or_nc, field, x_coords=None, y_coords=None, subset_id=None, subset_bbox=None)
-
Expand source code
def subset( hdf_or_nc, field, x_coords = None, y_coords = None, subset_id = None, subset_bbox = None): ''' Returns a subset array from the HDF, for the desired variable, where the array corresponds to an area defined by a known bounding box, e.g., the continental United States (CONUS). NOTE: `x_coords` and `y_coords` (hierarchical paths) will be inferred from the filename if not provided at all. Parameters ---------- hdf_or_nc : h5py.File or netcdf4.Dataset Either: an HDF5 file / h5py.File object OR a NetCDF file field : str Hierarchical path to the desired variable x_coords : numpy.ndarray or str (Optional) 1D NumPy array of X coordinates OR hierarchical path to the variable representing X coordinates, e.g., longitude values y_coords : numpy.ndarray or str (Optional) 1D NumPy array of Y coordinates OR hierarchical path to the variable representing Y coordinates, e.g., latitude values subset_id : str (Optional) Instead of subset_bbox, can provide keyword designating the desired subset area subset_bbox : tuple or list (Optional) Instead of subset_id, can provide any arbitrary bounding box, i.e., sequence of coordinates: `<xmin>, <ymin>, <xmax>, <ymax>`; these should be in WGS84 (longitude-latitude) coordinates Returns ------- tuple Tuple of: subset array, xoff, yoff; (numpy.ndarray, Int, Int) ''' assert (subset_id is None and subset_bbox is not None) or (subset_id is not None and subset_bbox is None), 'Should provide only one argument: Either subset_id or subset_bbox' assert isinstance(hdf_or_nc, h5py.File) or hasattr(hdf_or_nc, 'variables'), 'An HDF5 or NetCDF file is required; cannot subset a stand-alone array' assert (x_coords is None and y_coords is None) or (isinstance(x_coords, str) and isinstance(y_coords, str)) or (isinstance(x_coords, np.ndarray) and isinstance(y_coords, np.ndarray)), 'The x_coords and y_coords arguments must have matching type' assert (not isinstance(x_coords, np.ndarray)) or (x_coords.ndim == 1), 'The x_coords and y_coords arguments must be 1D arrays, otherwise pass a String or None' # Check that we have a 2D array to work with if hasattr(hdf_or_nc, 'variables'): shp = hdf_or_nc.variables[field].shape else: assert field in hdf_or_nc.keys(), 'Field name "%s" not found' % field shp = hdf_or_nc[field].shape assert len(shp) == 2, 'HDF5 or NetCDF data array indexed by "%s" must be a 2D array' % field # If a hierarchical path to the X and Y coordinate variables was not # given, then infer the paths from the filename if x_coords is None: x_coords, y_coords = get_xy_coords(hdf_or_nc, in_1d = True) if isinstance(x_coords, str): # NOTE: Only need first row of X-coordinate array, first column of # Y-coordinate array, because they are duplicated thereafter if hasattr(hdf_or_nc, 'variables'): x_coords = hdf_or_nc.variables[x_coords][0,:] # NetCDF y_coords = hdf_or_nc.variables[y_coords][0,:] else: x_coords = hdf_or_nc[x_coords][0,:] # HDF5 y_coords = hdf_or_nc[y_coords][:,0] x_idx, y_idx = get_slice_idx_by_bbox( x_coords, y_coords, subset_id, subset_bbox) xmin, xmax = x_idx # Unpack the slice range indexes ymin, ymax = y_idx if hasattr(hdf_or_nc, 'variables'): return (hdf_or_nc.variables[field][ymin:ymax, xmin:xmax], xmin, ymin) return (hdf_or_nc[field][ymin:ymax, xmin:xmax], xmin, ymin)
Returns a subset array from the HDF, for the desired variable, where the array corresponds to an area defined by a known bounding box, e.g., the continental United States (CONUS).
NOTE:
x_coords
andy_coords
(hierarchical paths) will be inferred from the filename if not provided at all.Parameters
hdf_or_nc
:h5py.File
ornetcdf4.Dataset
- Either: an HDF5 file / h5py.File object OR a NetCDF file
field
:str
- Hierarchical path to the desired variable
x_coords
:numpy.ndarray
orstr
- (Optional) 1D NumPy array of X coordinates OR hierarchical path to the variable representing X coordinates, e.g., longitude values
y_coords
:numpy.ndarray
orstr
- (Optional) 1D NumPy array of Y coordinates OR hierarchical path to the variable representing Y coordinates, e.g., latitude values
subset_id
:str
- (Optional) Instead of subset_bbox, can provide keyword designating the desired subset area
subset_bbox
:tuple
orlist
- (Optional) Instead of subset_id, can provide any arbitrary bounding
box, i.e., sequence of coordinates:
<xmin>, <ymin>, <xmax>, <ymax>
; these should be in WGS84 (longitude-latitude) coordinates
Returns
tuple
- Tuple of: subset array, xoff, yoff; (numpy.ndarray, Int, Int)
def summarize(data_array, summaries, scale=1, data_mask=None, nodata=-9999)
-
Expand source code
def summarize( data_array, summaries, scale = 1, data_mask = None, nodata = -9999): ''' Calculates statistical summar[ies] of an input data array over all values. Parameters ---------- data_array : numpy.ndarray Array of data values that we wish to summarize, e.g., gridded land surface temperatures summaries : dict A Dictionary of `{name: function}` where function is some vectorized function that acts over the values associated with a class label, including NaNs, and returns a single number scale : int or float Optional scaling parameter to apply to the input array values, e.g., if the array values are (spatial) rates and should be scaled by the (equal) area of the grid cell nodata : int or float NoData or Fill value(s) to ignore; can pass a sequence of multiple values (Default: -9999) Returns ------- dict ''' assert data_array.ndim <= 2 or (data_array.ndim == 3 and data_array.shape[0] == 1), 'Can only work with 1-band raster arrays' if data_array.ndim == 3: data_array = data_array[0,...] # Unwrap 1-band raster arrays # Fill in NaN where there is NoData data_array = np.where(np.isin(data_array, nodata), np.nan, data_array) stats = dict([(k, None) for k in summaries.keys()]) for stat_name, func in summaries.items(): # NOTE: Runs faster if dtype of accumulator is *not* set stats[stat_name] = func(np.multiply(data_array, scale)) return stats
Calculates statistical summar[ies] of an input data array over all values.
Parameters
data_array
:numpy.ndarray
- Array of data values that we wish to summarize, e.g., gridded land surface temperatures
summaries
:dict
- A Dictionary of
{name: function}
where function is some vectorized function that acts over the values associated with a class label, including NaNs, and returns a single number scale
:int
orfloat
- Optional scaling parameter to apply to the input array values, e.g., if the array values are (spatial) rates and should be scaled by the (equal) area of the grid cell
nodata
:int
orfloat
- NoData or Fill value(s) to ignore; can pass a sequence of multiple values (Default: -9999)
Returns
dict
def summarize_by_class(data_array, class_array, summaries, scale=1, ignore=(0,), nodata=-9999)
-
Expand source code
def summarize_by_class( data_array, class_array, summaries, scale = 1, ignore = (0,), nodata = -9999): ''' Calculates statistical summar[ies] of an input data array for each class label in an input class array. Parameters ---------- data_array : numpy.ndarray Array of data values that we wish to summarize, e.g., gridded land surface temperatures class_array : numpy.ndarray Array of class labels that will be used to summarize input data, e.g., watersheds or ecoregions; should NOT contain any NaNs as this will cause the function to hang summaries : dict A Dictionary of `{name: function}` where function is some vectorized function that acts over the values associated with a class label, including NaNs, and returns a single number scale : int or float Optional scaling parameter to apply to the input array values, e.g., if the array values are (spatial) rates and should be scaled by the (equal) area of the grid cell ignore : tuple Class labels (values in the class_array) to ignore nodata : int or float NoData or Fill value to ignore (Default: -9999) Returns ------- dict A nested Python dictionary with a key-value pair for each class, where the value is another dictionary with a key-value pair for each summary statistic. ''' assert data_array.ndim <= 2 or (data_array.ndim == 3 and data_array.shape[0] == 1), 'Can only work with 1-band raster arrays' if data_array.ndim == 3: data_array = data_array[0,...] # Unwrap 1-band raster arrays assert data_array.shape == class_array.shape, 'Input data_array does not match shape of the class_array' # Fill in NaN where there is NoData data_array = np.where(np.isin(data_array, nodata), np.nan, data_array) # In case None was passed to "ignore," replace with empty list ignore = ignore if ignore is not None else [] assert not np.any(np.isnan(class_array)),\ 'Class array should not contain NaNs; use "ignore" argument instead' classes = set(np.unique(class_array[~np.isnan(class_array)])) # Create, e.g., {1: {}, 2: {}, ...} stats = dict([(k, dict()) for k in classes.difference(ignore)]) for code in classes.difference(ignore): query = np.where(np.isin(class_array, code), data_array, np.nan) stats[code] = summarize(query, summaries, scale, nodata = nodata) return stats
Calculates statistical summar[ies] of an input data array for each class label in an input class array.
Parameters
data_array
:numpy.ndarray
- Array of data values that we wish to summarize, e.g., gridded land surface temperatures
class_array
:numpy.ndarray
- Array of class labels that will be used to summarize input data, e.g., watersheds or ecoregions; should NOT contain any NaNs as this will cause the function to hang
summaries
:dict
- A Dictionary of
{name: function}
where function is some vectorized function that acts over the values associated with a class label, including NaNs, and returns a single number scale
:int
orfloat
- Optional scaling parameter to apply to the input array values, e.g., if the array values are (spatial) rates and should be scaled by the (equal) area of the grid cell
ignore
:tuple
- Class labels (values in the class_array) to ignore
nodata
:int
orfloat
- NoData or Fill value to ignore (Default: -9999)
Returns
dict
- A nested Python dictionary with a key-value pair for each class, where the value is another dictionary with a key-value pair for each summary statistic.
def summarize_hdf_by_pft(hdf,
field,
summaries,
scale=1,
pft_codes=range(1, 9),
subset_id='CONUS',
x_coords=None,
y_coords=None,
nodata=-9999)-
Expand source code
def summarize_hdf_by_pft( hdf, field, summaries, scale = 1, pft_codes = range(1, 9), subset_id = 'CONUS', x_coords = None, y_coords = None, nodata = -9999): ''' Calculates statistical summar[ies] of an input data array for each Plant Functional Type (PFT) class. Parameters ---------- hdf : h5py.File The HDF5 file / h5py.File object field : str One of: "SOC", "NEE", "GPP", or "RH" summaries : dict A Dictionary of `{name: function}` where function is some vectorized function that acts over the values associated with a class label, including NaNs, and returns a single number scale : int or float Optional scaling parameter to apply to the input array values, e.g., if the array values are (spatial) rates and should be scaled by the (equal) area of the grid cell pft_codes The PFT codes to create summaries for; defaults to all of them, (i.e. Default: 1-8 inclusive) subset_id Keyword designating the desired subset area (Default: CONUS) x_coords (Optional) A 1D NumPy array of X coordinate values, used in subsetting and can be automatically discovered y_coords (Optional) A 1D NumPy array of Y coordinate values, used in subsetting and can be automatically discovered nodata NoData or Fill value to ignore (Default: -9999) ''' assert field in ('SOC', 'NEE', 'GPP', 'RH'), 'Only possible to summarize one of: "SOC", "NEE", "GPP", or "RH"' assert (subset_id in SUBSETS_BBOX.keys()) or (subset_id is None), 'Named subset_id not a recognized geographic subset; see SUBSETS_BBOX in this module' # Warn user against using nansum() or sum() functions if any([k.rfind('sum') > 0 for k in summaries.keys()]): print('WARNING: Totals or sums may be biased low because subgrid heterogeneity is not recognized; use total_hdf_by_pft() instead') field_names = [ '%s/%s_pft%d_mean' % (field, field.lower(), p) for p in pft_codes ] stats = dict([(p, dict()) for p in field_names]) for name in field_names: if subset_id is not None: array, x, y = subset( hdf, name, x_coords, y_coords, subset_id) else: array = hdf[name][:] stats[name] = summarize(array, summaries, scale, nodata = nodata) return stats
Calculates statistical summar[ies] of an input data array for each Plant Functional Type (PFT) class.
Parameters
hdf
:h5py.File
- The HDF5 file / h5py.File object
field
:str
- One of: "SOC", "NEE", "GPP", or "RH"
summaries
:dict
- A Dictionary of
{name: function}
where function is some vectorized function that acts over the values associated with a class label, including NaNs, and returns a single number scale
:int
orfloat
- Optional scaling parameter to apply to the input array values, e.g., if the array values are (spatial) rates and should be scaled by the (equal) area of the grid cell
pft_codes The PFT codes to create summaries for; defaults to all of them, (i.e. Default: 1-8 inclusive) subset_id Keyword designating the desired subset area (Default: CONUS) x_coords (Optional) A 1D NumPy array of X coordinate values, used in subsetting and can be automatically discovered y_coords (Optional) A 1D NumPy array of Y coordinate values, used in subsetting and can be automatically discovered nodata NoData or Fill value to ignore (Default: -9999)
def total_hdf_by_pft(hdf,
counts,
field,
scale=1,
pft_codes=range(1, 9),
max_count=81,
subset_id='CONUS',
x_coords=None,
y_coords=None,
nodata=-9999)-
Expand source code
def total_hdf_by_pft( hdf, counts, field, scale = 1, pft_codes = range(1, 9), max_count = 81, subset_id = 'CONUS', x_coords = None, y_coords = None, nodata = -9999): ''' Calculates statistical summar[ies] of an input data array for each Plant Functional Type (PFT) class. The problem with calculating *totals* for PFT means is that we need to know the area proportion of a given PFT within each 9-km cell in order to correctly sum over that area. This function correctly scales the 9-km PFT mean value by the provided `scale` argument and then scales this value by the proportion of 1-km subgrid cells that match the given PFT. This produces accurate sums of spatial rates; i.e., the `scale` parameter converts a PFT mean to a (biased) total, then the `counts` are used to scale that total by the area proportion of the given PFT. The summary function used is np.nansum. Parameters ---------- hdf : h5py.File The HDF5 file / h5py.File object counts : tuple or list A sequence of NumPy arrays, each array corresponding to the count of 1-km subcells matching a certain PFT; these should be provided in order, i.e., first element is the count of PFT 1 subcells, second element for PFT 2, and on. field : str One of: "SOC", "NEE", "GPP", or "RH" scale : int or float Optional scaling parameter to apply to the input array values, e.g., if the array values are (spatial) rates and should be scaled by the (equal) area of the grid cell pft_codes : tuple or list The PFT codes to create summaries for; defaults to all of them, (i.e. Default: 1-8 inclusive) max_count : int The maximum number of 1-km subcells that any given PFT can total within a 9-km area; no reason why this shouldn't be 81 (9 x 9), probably. subset_id : str (Optional) Instead of subset_bbox, can provide keyword designating the desired subset area x_coords : numpy.ndarray or str (Optional) 1D NumPy array of X coordinates OR hierarchical path to the variable representing X coordinates, e.g., longitude values y_coords : numpy.ndarray or str (Optional) 1D NumPy array of Y coordinates OR hierarchical path to the variable representing Y coordinates, e.g., latitude values nodata : int or float NoData or Fill value to ignore (Default: -9999) ''' assert len(counts) == len(pft_codes),\ 'Length of counts and pft_codes must be equal' assert field in ('SOC', 'NEE', 'GPP', 'RH'),\ 'Only possible to summarize one of: "SOC", "NEE", "GPP", or "RH"' assert (subset_id in SUBSETS_BBOX.keys()) or (subset_id is None),\ 'Named subset_id not a recognized geographic subset; see SUBSETS_BBOX in this module' summaries = {'nansum': np.nansum} stats = dict() for i, pft in enumerate(pft_codes): # Get the name of the target field; set up summary stats dict() fieldname = '%s/%s_pft%d_mean' % (field, field.lower(), pft) stats[fieldname] = dict() if subset_id is not None: array, _, _ = subset( hdf, fieldname, x_coords, y_coords, subset_id) else: array = hdf[fieldname][:] assert array.shape == counts[i].shape, 'Counts array must be the same size as data array' # Scale the data array by the proportion of the subgrid cells that # match the given PFT class stats[fieldname] = summarize( np.multiply( # NOTE: We scale data array ahead of time... np.divide(counts[i], max_count), np.multiply(array, scale)), # ...So scale must be fixed at 1 summaries, scale = 1, nodata = nodata) return stats
Calculates statistical summar[ies] of an input data array for each Plant Functional Type (PFT) class. The problem with calculating totals for PFT means is that we need to know the area proportion of a given PFT within each 9-km cell in order to correctly sum over that area. This function correctly scales the 9-km PFT mean value by the provided
scale
argument and then scales this value by the proportion of 1-km subgrid cells that match the given PFT. This produces accurate sums of spatial rates; i.e., thescale
parameter converts a PFT mean to a (biased) total, then thecounts
are used to scale that total by the area proportion of the given PFT. The summary function used is np.nansum.Parameters
hdf
:h5py.File
- The HDF5 file / h5py.File object
counts
:tuple
orlist
- A sequence of NumPy arrays, each array corresponding to the count of 1-km subcells matching a certain PFT; these should be provided in order, i.e., first element is the count of PFT 1 subcells, second element for PFT 2, and on.
field
:str
- One of: "SOC", "NEE", "GPP", or "RH"
scale
:int
orfloat
- Optional scaling parameter to apply to the input array values, e.g., if the array values are (spatial) rates and should be scaled by the (equal) area of the grid cell
pft_codes
:tuple
orlist
- The PFT codes to create summaries for; defaults to all of them, (i.e. Default: 1-8 inclusive)
max_count
:int
- The maximum number of 1-km subcells that any given PFT can total within a 9-km area; no reason why this shouldn't be 81 (9 x 9), probably.
subset_id
:str
- (Optional) Instead of subset_bbox, can provide keyword designating the desired subset area
x_coords
:numpy.ndarray
orstr
- (Optional) 1D NumPy array of X coordinates OR hierarchical path to the variable representing X coordinates, e.g., longitude values
y_coords
:numpy.ndarray
orstr
- (Optional) 1D NumPy array of Y coordinates OR hierarchical path to the variable representing Y coordinates, e.g., latitude values
nodata
:int
orfloat
- NoData or Fill value to ignore (Default: -9999)
Classes
class MockL4CGranule (file=None, grid='M09', mode='w', coords=None, data=None, pft_mean_fields=[])
-
Expand source code
class MockL4CGranule(h5py.File): ''' A mock for a typical L4C Ops granule. Should have all the right data fields at the right sizes for a typical L4C Ops granule read in as an h5py.File instance. Parameters ---------- file : file (Optional) The File object to write; defaults to an in-memory `BytesIO` instance grid : str (Optional) The EASE-Grid 2.0 size: "M01" or "M09" mode : str (Optional) The file access mode (Default: "w") coords : tuple (Optional) 2D coordinate arrays to set on the granule's "GEO/longitude" and "GEO/latitude" fields data : dict (Optional) If provided, should be a Dictionary with keys for each desired dataset (e.g., "NEE" or "SOC") and any correctly shaped 2D array as values pft_mean_fields : tuple or list Sequence of field names (e.g., `['RH', 'GPP']`) for which the PFT mean fields (e.g., `RH/rh_pft1_mean`) should be created ''' def __init__( self, file = None, grid = 'M09', mode = 'w', coords = None, data = None, pft_mean_fields = list()): # Although file = io.BytesIO() in the function signature, above, # should accomplish the same thing, it generates an OSError that # cannot be reproduced in interactive mode unless the "file" # is explicitly set here and io.BytesIO() is *not* invoked above if file is None: file = io.BytesIO() super(MockL4CGranule, self).__init__(file, mode) # Create a field for each L4C variable shp = EASE2_GRID_PARAMS[grid]['shape'] for field in ('NEE', 'GPP', 'RH', 'SOC'): key = '%s/%s_mean' % (field, field.lower()) # e.g., "SOC/soc_mean" if data is None: init_data = np.ones(shp) * np.nan else: # Generate a NaN-valued array if no data provided init_data = data.get(key, np.ones(shp) * np.nan) # Create, e.g., "SOC/soc_mean" field self.create_dataset(key, shp, dtype = 'float32', data = init_data) # Optionally, create, e.g., "SOC/soc_pft1_mean" field if field in pft_mean_fields: for pft in range(1, 9): key = '%s/%s_pft%d_mean' % (field, field.lower(), pft) init_data = data.get(key, np.ones(shp) * np.nan) self.create_dataset( key, shp, dtype = 'float32', data = init_data) if coords is None: x_coords, y_coords = ease2_coords(grid = 'M09', in_1d = False) else: x_coords, y_coords = coords assert x_coords.ndim == 2 and y_coords.ndim == 2, 'Must provide 2D coordinate arrays' self.create_dataset('GEO/longitude', shp, dtype = 'float32', data = x_coords) self.create_dataset('GEO/latitude', shp, dtype = 'float32', data = y_coords)
A mock for a typical L4C Ops granule. Should have all the right data fields at the right sizes for a typical L4C Ops granule read in as an h5py.File instance.
Parameters
file
:file
- (Optional) The File object to write; defaults to an in-memory
BytesIO
instance grid
:str
- (Optional) The EASE-Grid 2.0 size: "M01" or "M09"
mode
:str
- (Optional) The file access mode (Default: "w")
coords
:tuple
- (Optional) 2D coordinate arrays to set on the granule's "GEO/longitude" and "GEO/latitude" fields
data
:dict
- (Optional) If provided, should be a Dictionary with keys for each desired dataset (e.g., "NEE" or "SOC") and any correctly shaped 2D array as values
pft_mean_fields
:tuple
orlist
- Sequence of field names (e.g.,
['RH', 'GPP']
) for which the PFT mean fields (e.g.,RH/rh_pft1_mean
) should be created
Create a new file object.
See the h5py user guide for a detailed explanation of the options.
name
- Name of the file on disk, or file-like object. Note: for files created with the 'core' driver, HDF5 still requires this be non-empty.
mode
- r Readonly, file must exist (default) r+ Read/write, file must exist w Create file, truncate if exists w- or x Create file, fail if exists a Read/write if exists, create otherwise
driver
- Name of the driver to use. Legal values are None (default, recommended), 'core', 'sec2', 'direct', 'stdio', 'mpio', 'ros3'.
libver
- Library version bounds. Supported values: 'earliest', 'v108', 'v110', 'v112' and 'latest'.
userblock_size
- Desired size of user block. Only allowed when creating a new file (mode w, w- or x).
swmr
- Open the file in SWMR read mode. Only used when mode = 'r'.
rdcc_nbytes
- Total size of the dataset chunk cache in bytes. The default size is 1024**2 (1 MiB) per dataset. Applies to all datasets unless individually changed.
rdcc_w0
- The chunk preemption policy for all datasets. This must be between 0 and 1 inclusive and indicates the weighting according to which chunks which have been fully read or written are penalized when determining which chunks to flush from cache. A value of 0 means fully read or written chunks are treated no differently than other chunks (the preemption is strictly LRU) while a value of 1 means fully read or written chunks are always preempted before other chunks. If your application only reads or writes data once, this can be safely set to 1. Otherwise, this should be set lower depending on how often you re-read or re-write the same data. The default value is 0.75. Applies to all datasets unless individually changed.
rdcc_nslots
- The number of chunk slots in the raw data chunk cache for this file. Increasing this value reduces the number of cache collisions, but slightly increases the memory used. Due to the hashing strategy, this value should ideally be a prime number. As a rule of thumb, this value should be at least 10 times the number of chunks that can fit in rdcc_nbytes bytes. For maximum performance, this value should be set approximately 100 times that number of chunks. The default value is 521. Applies to all datasets unless individually changed.
track_order
- Track dataset/group/attribute creation order under root group if True. If None use global default h5.get_config().track_order.
fs_strategy
- The file space handling strategy to be used. Only allowed when creating a new file (mode w, w- or x). Defined as: "fsm" FSM, Aggregators, VFD "page" Paged FSM, VFD "aggregate" Aggregators, VFD "none" VFD If None use HDF5 defaults.
fs_page_size
- File space page size in bytes. Only used when fs_strategy="page". If None use the HDF5 default (4096 bytes).
fs_persist
- A boolean value to indicate whether free space should be persistent or not. Only allowed when creating a new file. The default value is False.
fs_threshold
- The smallest free-space section size that the free space manager will track. Only allowed when creating a new file. The default value is 1.
page_buf_size
- Page buffer size in bytes. Only allowed for HDF5 files created with fs_strategy="page". Must be a power of two value and greater or equal than the file space page size when creating the file. It is not used by default.
min_meta_keep
- Minimum percentage of metadata to keep in the page buffer before allowing pages containing metadata to be evicted. Applicable only if page_buf_size is set. Default value is zero.
min_raw_keep
- Minimum percentage of raw data to keep in the page buffer before allowing pages containing raw data to be evicted. Applicable only if page_buf_size is set. Default value is zero.
locking
-
The file locking behavior. Defined as:
- False (or "false") – Disable file locking
- True (or "true") – Enable file locking
- "best-effort" – Enable file locking but ignore some errors
- None – Use HDF5 defaults
Warning
The HDF5_USE_FILE_LOCKING environment variable can override this parameter.
Only available with HDF5 >= 1.12.1 or 1.10.x >= 1.10.7.
alignment_threshold
- Together with
alignment_interval
, this property ensures that any file object greater than or equal in size to the alignment threshold (in bytes) will be aligned on an address which is a multiple of alignment interval. alignment_interval
- This property should be used in conjunction with
alignment_threshold
. See the description above. For more details, see https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT meta_block_size
- Set the current minimum size, in bytes, of new metadata block allocations. See https://portal.hdfgroup.org/display/HDF5/H5P_SET_META_BLOCK_SIZE
Additional keywords Passed on to the selected file driver.
Ancestors
- h5py._hl.files.File
- h5py._hl.group.Group
- h5py._hl.base.HLObject
- h5py._hl.base.CommonStateObject
- h5py._hl.base.MutableMappingHDF5
- h5py._hl.base.MappingHDF5
- collections.abc.MutableMapping
- collections.abc.Mapping
- collections.abc.Collection
- collections.abc.Sized
- collections.abc.Iterable
- collections.abc.Container