Note

This page was generated from a jupyter notebook.

DataRecord Tutorial¶

This tutorial illustrates how to record variables of a Landlab model using DataRecord.

What is DataRecord?¶

DataRecord is a data structure that can hold data variables relating to a Landlab model or to items living on the Landlab grid.

DataRecord is built on xarray’s Dataset structure: a multi-dimensional, in memory, array database. Dataset implements the mapping interface with keys given by variable names and values given by DataArray objects for each variable name. DataRecord inherits all the methods and attributes from xarray.Dataset.

A DataRecord can have one or both (or none) of the following dimensions:

time: The simulated time in the model.
item_id: An identifier of a generic item in the model.

Coordinates are one dimensional arrays used for label-based indexing.

The examples below illustrate different use cases for DataRecord.

We start by importing the necessary libraries:

[ ]:

import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from matplotlib.pyplot import figure, legend, plot, subplot, title, xlabel, ylabel

from landlab import RasterModelGrid, imshow_grid
from landlab.data_record import DataRecord

%matplotlib inline

Case 1. DataRecord with 1 dimension: time¶

Let’s start with an example where we set DataRecord to have only time as a dimension. An example variable that varies over time and relates to the Landlab grid could be the mean elevation of the topographic surface. We will store this example variable in DataRecord.

We create a Raster grid, create a field (at nodes) called topographic__elevation and populate it with random values.

[ ]:

grid_1 = RasterModelGrid((10, 10), (1.0, 1.0))
z = np.random.rand(100)
_ = grid_1.add_field("topographic__elevation", z, at="node")

Print the current mean elevation.

[ ]:

current_mean = np.mean(grid_1.at_node["topographic__elevation"])
print(current_mean)

Now we will create a DataRecord that will hold the data variable mean_elevation relating to grid_1. The first value, at time=0 is the current mean elevation on the grid.

[ ]:

dr_1 = DataRecord(
    grid_1,
    time=[0.0],
    items=None,
    data_vars={"mean_elevation": (["time"], ([current_mean]))},
    attrs={"mean_elevation": "y"},
)

The input arguments passed in this case are: the grid, time (as a 1-element list), a data variable dictionary and an attributes dictionary. Note that items is not filled, we will see its use in other cases below.

Note the format of the data_vars dictionary:

{'variable_name_1' : (['dimensions'], variable_data_1),
 'variable_name_2' : (['dimensions'], variable_data_2),
 ...}

The attributes dictionary attrs can be used to store metadata about the variables: in this example, we use it to store the variable units.

So far, our DataRecord dr_1 holds one variable mean_elevation with one record at time=0.

[ ]:

dr_1

We can visualise this data structure as a pandas dataframe:

[ ]:

dr_1.dataset.to_dataframe()

Now we will run a simple model where the grid surface is uplifted several times and the mean elevation is recorded at every time step. We use the method add_record to put the new value in the DataRecord dr_1:

[ ]:

total_time = 100
dt = 20

uplift_rate = 0.01  # m/y

for t in range(20, total_time, dt):
    grid_1.at_node["topographic__elevation"] += uplift_rate * dt
    dr_1.add_record(
        time=[t],
        new_record={
            "mean_elevation": (
                ["time"],
                ([np.mean(grid_1.at_node["topographic__elevation"])]),
            )
        },
    )

Let’s see what was recorded:

[ ]:

dr_1.dataset["mean_elevation"].values

The corresponding time coordinates are:

[ ]:

dr_1.dataset.time.values

Notice the different syntax used here:

time is a dimension and can be called by dr_1.time (or dr_1['time'])
whereas mean_elevation is a variable and must be called by dr_1['mean_elevation']

DataRecord also has the handy property time_coordinates that returns these values as a list:

[ ]:

dr_1.time_coordinates

You can use the methods get_data and set_data to access and change the data:

[ ]:

dr_1.get_data(time=[20.0], data_variable="mean_elevation")

[ ]:

dr_1.set_data(time=[80.0], data_variable="mean_elevation", new_value=1.5)

dr_1.dataset["mean_elevation"]

Case 2. DataRecord with 1 dimension: item_id¶

An important feature of DataRecord is that it allows to create items that live on grid elements, and variables describing them. For instance, we can create boulders and store information about their size and lithology.

To create items, we need to instantiate a DataRecord and pass it a dictionary describing where each item lives on the Landlab grid. The format of this dictionary is:

{'grid_element' : [grid_element],
 'element_id' : [element_id]}

where:

grid_element is a str or number-of-items-long array containing strings of the grid element(s) on which the items live (e.g.: node, link). Valid locations depend on the grid type (my_grid.groups gives the valid locations for your grid). If grid_element is provided as a string, it is assumed that all items live on the same type of grid element.
element_id is an array of integers identifying the grid element IDs on which each item resides. For each item, element_id must be less than the number of this item’s grid_element that exist on the grid. For example, if the grid has 10 links, no item can live at link 10 or link -3 because only links 0 to 9 exist in this example.

[ ]:

grid_2 = RasterModelGrid((5, 5), (2, 2))

boulders = {"grid_element": "node", "element_id": np.array([6, 11, 12, 17, 12])}

initial_boulder_sizes = np.array([1, 1.5, 3, 1, 2])
boulder_lithologies = np.array(
    ["sandstone", "granite", "sandstone", "sandstone", "limestone"]
)

dr_2 = DataRecord(
    grid_2,
    time=None,
    items=boulders,
    data_vars={
        "boulder_size": (["item_id"], initial_boulder_sizes),
        "boulder_litho": (["item_id"], boulder_lithologies),
    },
    attrs={"boulder_size": "m"},
)
dr_2.dataset.to_dataframe()

Each item (in this case, each boulder) is designated by an item_id, its position on the grid is described by a grid_element and an element_id.

We can use the method add_item to add new boulders to the record:

[ ]:

dr_2.add_item(
    new_item={
        "grid_element": np.array(["link", "node"]),
        "element_id": np.array([24, 8]),
    },
    new_item_spec={"boulder_size": (["item_id"], np.array([1.2, 2.0]))},
)

dr_2.dataset.to_dataframe()

Notice that we did not specify the lithologies of the new boulders, their recorded values are thus set as NaN. We can use the set_data method to report the boulder lithologies:

[ ]:

dr_2.set_data(
    data_variable="boulder_litho", item_id=[5, 6], new_value=["sandstone", "granite"]
)
dr_2.dataset.to_dataframe()

We can use the method calc_aggregate_value to apply a function to a variable aggregated at grid elements. For example, we can calculate the mean size of boulders on each node:

[ ]:

mean_size = dr_2.calc_aggregate_value(
    func=xr.Dataset.mean, data_variable="boulder_size"
)
mean_size

Notice that boulder #5 is on a link so it is not taken into account in this calculation.

[ ]:

# replace nans with 0:
mean_size[np.isnan(mean_size)] = 0

# show unfiltered mean sizes on the grid:
imshow_grid(grid_2, mean_size)

Before doing this calculation we could filter by lithology and only use the ‘sandstone’ boulders in the calculation:

[ ]:

# define a filter array:
filter_litho = dr_2.dataset["boulder_litho"] == "sandstone"

# aggregate by node and apply function numpy.mean on boulder_size
filtered_mean = dr_2.calc_aggregate_value(
    func=xr.Dataset.mean,
    data_variable="boulder_size",
    at="node",
    filter_array=filter_litho,
)

filtered_mean

Case 3. DataRecord with 2 dimensions: item_id and time¶

We may want to record variables that have both dimensions time and item_id.

In the previous example, some variables that characterize the items (boulders) may not vary with time, such as boulder_lithology. Although it can be interesting to keep track of the change in size through time. We will redefine the DataRecord such that the variable boulder_size varies among the items/boulders (identified by item_id) and through time. The variable boulder_litho varies only among the items/boulders and this lithogy variable does not vary through time.

[ ]:

grid_3 = RasterModelGrid((5, 5), (2, 2))

initial_boulder_sizes_3 = np.array([[10], [4], [8], [3], [5]])
# boulder_lithologies = np.array(['sandstone', 'granite', 'sandstone', 'sandstone', 'limestone']) #same as above, already run

boulders_3 = {
    "grid_element": "node",
    "element_id": np.array([[6], [11], [12], [17], [12]]),
}

dr_3 = DataRecord(
    grid_3,
    time=[0.0],
    items=boulders_3,
    data_vars={
        "boulder_size": (["item_id", "time"], initial_boulder_sizes_3),
        "boulder_litho": (["item_id"], boulder_lithologies),
    },
    attrs={"boulder_size": "m"},
)
dr_3

Note that the syntax to define the initial_boulder_sizes_3 (as well as element_id) has changed: they are number-of-items-by-1 arrays because they vary along both time and item_id (compared to boulder_lithologies which is just number-of-items long as it only varies along item_id).

[ ]:

boulder_lithologies.shape, initial_boulder_sizes.shape, initial_boulder_sizes_3.shape

Let’s define a very simple erosion law for the boulders:

\[\begin{equation} \frac{dD}{dt} = -k_{b} . D \end{equation}\]

where \(D\) is the boulder diameter \([L]\) (this value represents the boulder_size variable), \(t\) is time, and \(k_{b}\) is the block erodibility \([L.T^{-1}]\).

We will now model boulder erosion and use DataRecord to store their size through time.

[ ]:

dt = 100
total_time = 100000

time_index = 1

for t in range(dt, total_time, dt):
    # create a new time coordinate:
    dr_3.add_record(time=np.array([t]))

    # this propagates grid_element and element_id values forward in time (instead of the 'nan' default filling):
    dr_3.ffill_grid_element_and_id()

    for i in range(0, dr_3.number_of_items):
        # value of block erodibility:
        if dr_3.dataset["boulder_litho"].values[i] == "limestone":
            k_b = 10**-5
        elif dr_3.dataset["boulder_litho"].values[i] == "sandstone":
            k_b = 3 * 10**-6
        elif dr_3.dataset["boulder_litho"].values[i] == "granite":
            k_b = 3 * 10**-7
        else:
            print("Unknown boulder lithology")

        dr_3.dataset["boulder_size"].values[i, time_index] = (
            dr_3.dataset["boulder_size"].values[i, time_index - 1]
            - k_b * dr_3.dataset["boulder_size"].values[i, time_index - 1] * dt
        )

    time_index += 1

print("Done")

[ ]:

figure(figsize=(15, 8))

time = range(0, total_time, dt)
boulder_size = dr_3.dataset["boulder_size"].values

subplot(121)
plot(time, boulder_size[1], label="granite")
plot(time, boulder_size[3], label="sandstone")
plot(time, boulder_size[-1], label="limestone")
xlabel("Time (yr)")
ylabel("Boulder size (m)")
legend(loc="lower left")
title("Boulder erosion by lithology")

# normalized plots
subplot(122)
plot(time, boulder_size[1] / boulder_size[1, 0], label="granite")
plot(time, boulder_size[2] / boulder_size[2, 0], label="sandstone")
plot(time, boulder_size[-1] / boulder_size[-1, 0], label="limestone")
xlabel("Time (yr)")
ylabel("Boulder size normalized to size at t=0 (m)")
legend(loc="lower left")
title("Normalized boulder erosion by lithology")
plt.show()

Other properties provided by DataRecord¶

[ ]:

dr_3.variable_names

[ ]:

dr_3.number_of_items

[ ]:

dr_3.item_coordinates

[ ]:

dr_3.number_of_timesteps

[ ]:

dr_1.time_coordinates

[ ]:

dr_1.earliest_time

[ ]:

dr_1.latest_time

[ ]:

dr_1.prior_time

DataRecord Tutorial¶

What is DataRecord?¶

Case 1. DataRecord with 1 dimension: time¶

Case 2. DataRecord with 1 dimension: item_id¶

Case 3. DataRecord with 2 dimensions: item_id and time¶

Other properties provided by DataRecord¶

More on DataRecord¶