Merging observations

This notebook shows how observations and observation collections can be merged. Merging observations can be useful if:

  • you have data from multiple sources measuring at the same location

  • you get new measurements that you want to add to the old measurements.

Notebook contents

  1. Simple merge

  2. Merge options

  3. Merging observation collections

[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import display

import hydropandas as hpd

hpd.util.get_color_logger("INFO");

Simple merge

[2]:
# observation 1
df = pd.DataFrame(
    {"measurements": np.random.randint(0, 10, 5)},
    index=pd.date_range("2020-1-1", "2020-1-5"),
)
o1 = hpd.Obs(df, name="obs", x=0, y=0)
display(o1)

hydropandas.Obs

obs
x 0
y 0
location
filename
source
unit

measurements
2020-01-01 6
2020-01-02 5
2020-01-03 0
2020-01-04 2
2020-01-05 8
[3]:
# observation 2
df = pd.DataFrame(
    {"measurements": np.random.randint(0, 10, 5)},
    index=pd.date_range("2020-1-6", "2020-1-10"),
)
o2 = hpd.Obs(df, name="obs", x=0, y=0)
display(o2)

hydropandas.Obs

obs
x 0
y 0
location
filename
source
unit

measurements
2020-01-06 0
2020-01-07 3
2020-01-08 4
2020-01-09 3
2020-01-10 5
[4]:
o_merged = o1.merge_observation(o2)
o_merged
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata
[4]:

hydropandas.Obs

obs
x 0
y 0
location
filename
source
unit

measurements
2020-01-01 6
2020-01-02 5
2020-01-03 0
2020-01-04 2
2020-01-05 8
2020-01-06 0
2020-01-07 3
2020-01-08 4
2020-01-09 3
2020-01-10 5
[5]:
f, axes = plt.subplots(figsize=(9, 7), nrows=3, sharex=True, sharey=True)
o1["measurements"].plot(ax=axes[0], marker="o", label="observation 1").legend(loc=1)
o2["measurements"].plot(ax=axes[1], marker="o", label="observation 2").legend(loc=1)
o_merged["measurements"].plot(ax=axes[2], marker="o", label="merged").legend(loc=1)
[5]:
<matplotlib.legend.Legend at 0x710dbbdc2270>
../_images/examples_04_merging_observations_7_1.png

Merge options

overlapping timeseries

[6]:
# create a partially overlapping dataframe
df = pd.DataFrame(
    {
        "measurements": np.concatenate(
            [o1["measurements"].values[-2:], np.random.randint(0, 10, 3)]
        )
    },
    index=pd.date_range("2020-1-4", "2020-1-8"),
)
o3 = hpd.Obs(df, name="obs", x=0, y=0)
display(o3)

hydropandas.Obs

obs
x 0
y 0
location
filename
source
unit

measurements
2020-01-04 2
2020-01-05 8
2020-01-06 9
2020-01-07 1
2020-01-08 6
[7]:
o_merged = o1.merge_observation(o3)
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata
[8]:
f, axes = plt.subplots(figsize=(9, 7), nrows=3, sharex=True, sharey=True)
o1["measurements"].plot(ax=axes[0], marker="o", label="observation 1").legend(loc=1)
o3["measurements"].plot(ax=axes[1], marker="o", label="observation 3").legend(loc=1)
o_merged["measurements"].plot(ax=axes[2], marker="o", label="merged").legend(loc=1)
[8]:
<matplotlib.legend.Legend at 0x710dbbcdce30>
../_images/examples_04_merging_observations_12_1.png
[9]:
# create a partially overlapping dataframe with different values
df = pd.DataFrame(
    {"measurements": np.random.randint(0, 10, 5)},
    index=pd.date_range("2020-1-4", "2020-1-8"),
)
o4 = hpd.Obs(df, name="obs", x=0, y=0)
display(o4)

hydropandas.Obs

obs
x 0
y 0
location
filename
source
unit

measurements
2020-01-04 0
2020-01-05 4
2020-01-06 5
2020-01-07 9
2020-01-08 1

by default an error is raised if the overlapping time series have different values

[10]:
o1.merge_observation(o4)
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs overlap with different values
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 1
----> 1 o1.merge_observation(o4)

File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/latest/lib/python3.12/site-packages/hydropandas/observation.py:682, in Obs.merge_observation(self, right, overlap, merge_metadata)
    676     raise TypeError(
    677         f"observation left has a different type {type(self)} than"
    678         f"observation right {type(right)}"
    679     )
    681 # merge timeseries
--> 682 o = self._merge_timeseries(right, overlap=overlap)
    684 # merge metadata
    685 if merge_metadata:

File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/latest/lib/python3.12/site-packages/hydropandas/observation.py:595, in Obs._merge_timeseries(self, right, overlap)
    590 logger.warning(
    591     f"timeseries of observation {right.name} overlap with "
    592     "different values"
    593 )
    594 if overlap == "error":
--> 595     raise ValueError(
    596         "observations have different values for same time steps"
    597     )
    598 elif overlap == "use_left":
    599     dup_o = self.loc[dup_ind_o.index, overlap_cols]

ValueError: observations have different values for same time steps

With the ‘overlap’ argument you can specify to use the left or the right observation when merging. See example below.

[11]:
print("use left")
merged_left = o1.merge_observation(o4, overlap="use_left")
display(merged_left)  # use the existing observation
print("use right")
merged_right = o1.merge_observation(o4, overlap="use_right")
display(merged_right)  # use the existing observation
use left
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs overlap with different values
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata

hydropandas.Obs

obs
x 0
y 0
location
filename
source
unit

measurements
2020-01-01 6
2020-01-02 5
2020-01-03 0
2020-01-04 2
2020-01-05 8
2020-01-06 5
2020-01-07 9
2020-01-08 1
use right
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs overlap with different values
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata

hydropandas.Obs

obs
x 0
y 0
location
filename
source
unit

measurements
2020-01-01 6
2020-01-02 5
2020-01-03 0
2020-01-04 0
2020-01-05 4
2020-01-06 5
2020-01-07 9
2020-01-08 1
[12]:
f, axes = plt.subplots(figsize=(9, 4), nrows=2, sharex=True, sharey=True)
o1["measurements"].plot(ax=axes[0], marker="o", label="observation 1").legend(loc=2)
o4["measurements"].plot(ax=axes[0], marker="o", label="observation 4").legend(loc=2)
merged_left["measurements"].plot(ax=axes[1], marker="o", label="merged left").legend(
    loc=2
)
merged_right["measurements"].plot(ax=axes[1], marker=".", label="merged right").legend(
    loc=2
)
[12]:
<matplotlib.legend.Legend at 0x710dbb93c0b0>
../_images/examples_04_merging_observations_18_1.png

metadata

The merge_observation method checks by default if the metadata of the two observations is the same.

[13]:
# observation 2
df = pd.DataFrame(
    {"measurements": np.random.randint(0, 10, 5)},
    index=pd.date_range("2020-1-6", "2020-1-10"),
)
o5 = hpd.Obs(df, name="obs5", x=0, y=0)
o5
[13]:

hydropandas.Obs

obs5
x 0
y 0
location
filename
source
unit

measurements
2020-01-06 2
2020-01-07 7
2020-01-08 3
2020-01-09 2
2020-01-10 6

When the metadata differs a ValueError is raised.

[14]:
o1.merge_observation(o5)
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 1
----> 1 o1.merge_observation(o5)

File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/latest/lib/python3.12/site-packages/hydropandas/observation.py:687, in Obs.merge_observation(self, right, overlap, merge_metadata)
    685 if merge_metadata:
    686     metadata = {key: getattr(right, key) for key in right._get_meta_attr()}
--> 687     new_metadata = self.merge_metadata(metadata, overlap=overlap)
    688 else:
    689     new_metadata = {key: getattr(self, key) for key in self._get_meta_attr()}

File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/latest/lib/python3.12/site-packages/hydropandas/observation.py:505, in Obs.merge_metadata(self, right, overlap)
    503 same_metadata = False
    504 if overlap == "error":
--> 505     raise ValueError(
    506         f"left observation {key} differs from right observation"
    507     )
    508 elif overlap == "use_left":
    509     logger.info(
    510         f"left observation {key} differs from right "
    511         "observation, use left"
    512     )

ValueError: left observation name differs from right observation

If you set the merge_metadata argument to False the metadata is not merged and only the timeseries of the observations is merged.

[15]:
o1.merge_observation(o5, merge_metadata=False)
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
[15]:

hydropandas.Obs

obs
x 0
y 0
location
filename
source
unit

measurements
2020-01-01 6
2020-01-02 5
2020-01-03 0
2020-01-04 2
2020-01-05 8
2020-01-06 2
2020-01-07 7
2020-01-08 3
2020-01-09 2
2020-01-10 6

Just as with overlapping timeseries, the ‘overlap’ argument can also be used for overlapping metadata values

[16]:
o_merged = o1.merge_observation(o5, overlap="use_left", merge_metadata=True)
print('observation name when overlap="use_left":', o_merged.name)
o_merged = o1.merge_observation(o5, overlap="use_right", merge_metadata=True)
print('observation name when overlap="use_right":', o_merged.name)
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
INFO:hydropandas.observation.merge_metadata:left observation name differs from right observation, use left
observation name when overlap="use_left": obs
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
INFO:hydropandas.observation.merge_metadata:left observation name differs from right observation, use right
observation name when overlap="use_right": obs5

all combinations

Combine two observations with: - different columns - overlapping columns - overlapping time series - different metadata

[17]:
# observation 6
df = pd.DataFrame(
    {"measurements": np.random.randint(0, 10, 5), "filter": np.ones(5)},
    index=pd.date_range("2020-1-1", "2020-1-5"),
)
o6 = hpd.Obs(df, name="obs6", x=100, y=0)
o6
[17]:

hydropandas.Obs

obs6
x 100
y 0
location
filename
source
unit

measurements filter
2020-01-01 6 1.0
2020-01-02 9 1.0
2020-01-03 5 1.0
2020-01-04 8 1.0
2020-01-05 8 1.0
[18]:
# observation 7
df = pd.DataFrame(
    {
        "measurements": np.concatenate(
            [o5["measurements"].values[-1:], np.random.randint(0, 10, 4)]
        ),
        "remarks": ["", "", "", "unreliable", ""],
    },
    index=pd.date_range("2020-1-4", "2020-1-8"),
)
o7 = hpd.Obs(df, name="obs7", x=0, y=100)
o7
[18]:

hydropandas.Obs

obs7
x 0
y 100
location
filename
source
unit

measurements remarks
2020-01-04 6
2020-01-05 5
2020-01-06 5
2020-01-07 5 unreliable
2020-01-08 0
[19]:
merged_right = o6.merge_observation(o7, overlap="use_right")
merged_right
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs7 overlap with different values
INFO:hydropandas.observation.merge_metadata:left observation name differs from right observation, use right
INFO:hydropandas.observation.merge_metadata:left observation x differs from right observation, use right
INFO:hydropandas.observation.merge_metadata:left observation y differs from right observation, use right
[19]:

hydropandas.Obs

obs7
x 0
y 100
location
filename
source
unit

measurements remarks filter
2020-01-01 6 NaN 1.0
2020-01-02 9 NaN 1.0
2020-01-03 5 NaN 1.0
2020-01-04 6 1.0
2020-01-05 5 1.0
2020-01-06 5 NaN
2020-01-07 5 unreliable NaN
2020-01-08 0 NaN
[20]:
f, axes = plt.subplots(figsize=(9, 7), nrows=2, sharex=True, sharey=True)
o6["measurements"].plot(ax=axes[0], marker="o", label="observation 6").legend(loc=2)
o7["measurements"].plot(ax=axes[0], marker="o", legend=True, label="observation 7")
merged_right["measurements"].plot(
    ax=axes[1], marker="o", legend=True, label="merged right"
)
[20]:
<Axes: >
../_images/examples_04_merging_observations_31_1.png

Merge observation collections

[21]:
# create an observation collection from a single observation
oc1 = hpd.ObsCollection(o1)

We can add a single observation to this collection using the add_observation method.

[22]:
oc1.add_observation(o7)
INFO:hydropandas.obs_collection.add_observation:adding obs7 to collection
[22]:
x y location filename source unit obs
name
obs 0 0 Obs obs -----metadata------ name : obs x : 0 ...
obs7 0 100 Obs obs7 -----metadata------ name : obs7 x : ...

We can also combine two observation collections.

[23]:
# create another observation collection from a list of observations
oc2 = hpd.ObsCollection([o5, o6])

# add the collection to the previous one
oc1.add_obs_collection(oc2)
INFO:hydropandas.obs_collection.add_observation:adding obs5 to collection
INFO:hydropandas.obs_collection.add_observation:adding obs6 to collection
[23]:
x y location filename source unit obs
name
obs 0 0 Obs obs -----metadata------ name : obs x : 0 ...
obs5 0 0 Obs obs5 -----metadata------ name : obs5 x : ...
obs6 100 0 Obs obs6 -----metadata------ name : obs6 x : ...

There is an automatic check for overlap based on the name of the observations. If the observations in both collections are exactly the same they are merged.

[24]:
# add o2 to the observation collection 1
oc1.add_observation(o2)
INFO:hydropandas.obs_collection.add_observation:observation name obs already in collection, merging observations
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata
[24]:
x y location filename source unit obs
name
obs 0 0 Obs obs -----metadata------ name : obs x : 0 ...

If the observation you want to add has the same name but not the same timeseries an error is raised.

[25]:
o1_mod = o1.copy()
o1_mod.loc["2020-01-02", "measurements"] = 100
oc1.add_observation(o1_mod)
INFO:hydropandas.obs_collection.add_observation:observation name obs already in collection, merging observations
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs overlap with different values
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[25], line 3
      1 o1_mod = o1.copy()
      2 o1_mod.loc["2020-01-02", "measurements"] = 100
----> 3 oc1.add_observation(o1_mod)

File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/latest/lib/python3.12/site-packages/hydropandas/obs_collection.py:1546, in ObsCollection.add_observation(self, o, check_consistency, inplace, **kwargs)
   1541 logger.info(
   1542     f"observation name {o.name} already in collection, merging observations"
   1543 )
   1545 o1 = oc.loc[o.name, "obs"]
-> 1546 omerged = o1.merge_observation(o, **kwargs)
   1548 # overwrite observation in collection
   1549 oc.loc[o.name] = omerged.to_collection_dict()

File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/latest/lib/python3.12/site-packages/hydropandas/observation.py:682, in Obs.merge_observation(self, right, overlap, merge_metadata)
    676     raise TypeError(
    677         f"observation left has a different type {type(self)} than"
    678         f"observation right {type(right)}"
    679     )
    681 # merge timeseries
--> 682 o = self._merge_timeseries(right, overlap=overlap)
    684 # merge metadata
    685 if merge_metadata:

File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/latest/lib/python3.12/site-packages/hydropandas/observation.py:595, in Obs._merge_timeseries(self, right, overlap)
    590 logger.warning(
    591     f"timeseries of observation {right.name} overlap with "
    592     "different values"
    593 )
    594 if overlap == "error":
--> 595     raise ValueError(
    596         "observations have different values for same time steps"
    597     )
    598 elif overlap == "use_left":
    599     dup_o = self.loc[dup_ind_o.index, overlap_cols]

ValueError: observations have different values for same time steps

To avoid errors we can use the overlap arguments to specify which observation we want to use.

[26]:
oc1.add_observation(o1_mod, overlap="use_left")
INFO:hydropandas.obs_collection.add_observation:observation name obs already in collection, merging observations
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs overlap with different values
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata
[26]:
x y location filename source unit obs
name
obs 0 0 Obs obs -----metadata------ name : obs x : 0 ...