Merging observations
This notebook shows how observations and observation collections can be merged. Merging observations can be useful if:
you have data from multiple sources measuring at the same location
you get new measurements that you want to add to the old measurements.
Notebook contents
Simple merge
Merge options
Merging observation collections
[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import display
import hydropandas as hpd
hpd.util.get_color_logger("INFO");
Simple merge
[2]:
# observation 1
df = pd.DataFrame(
{"measurements": np.random.randint(0, 10, 5)},
index=pd.date_range("2020-1-1", "2020-1-5"),
)
o1 = hpd.Obs(df, name="obs", x=0, y=0)
display(o1)
hydropandas.Obs
| obs | |
|---|---|
| x | 0 |
| y | 0 |
| location | |
| filename | |
| source | |
| unit |
| measurements | |
|---|---|
| 2020-01-01 | 2 |
| 2020-01-02 | 5 |
| 2020-01-03 | 0 |
| 2020-01-04 | 4 |
| 2020-01-05 | 1 |
[3]:
# observation 2
df = pd.DataFrame(
{"measurements": np.random.randint(0, 10, 5)},
index=pd.date_range("2020-1-6", "2020-1-10"),
)
o2 = hpd.Obs(df, name="obs", x=0, y=0)
display(o2)
hydropandas.Obs
| obs | |
|---|---|
| x | 0 |
| y | 0 |
| location | |
| filename | |
| source | |
| unit |
| measurements | |
|---|---|
| 2020-01-06 | 3 |
| 2020-01-07 | 5 |
| 2020-01-08 | 7 |
| 2020-01-09 | 0 |
| 2020-01-10 | 1 |
[4]:
o_merged = o1.merge_observation(o2)
o_merged
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata
[4]:
hydropandas.Obs
| obs | |
|---|---|
| x | 0 |
| y | 0 |
| location | |
| filename | |
| source | |
| unit |
| measurements | |
|---|---|
| 2020-01-01 | 2 |
| 2020-01-02 | 5 |
| 2020-01-03 | 0 |
| 2020-01-04 | 4 |
| 2020-01-05 | 1 |
| 2020-01-06 | 3 |
| 2020-01-07 | 5 |
| 2020-01-08 | 7 |
| 2020-01-09 | 0 |
| 2020-01-10 | 1 |
[5]:
f, axes = plt.subplots(figsize=(9, 7), nrows=3, sharex=True, sharey=True)
o1["measurements"].plot(ax=axes[0], marker="o", label="observation 1").legend(loc=1)
o2["measurements"].plot(ax=axes[1], marker="o", label="observation 2").legend(loc=1)
o_merged["measurements"].plot(ax=axes[2], marker="o", label="merged").legend(loc=1)
[5]:
<matplotlib.legend.Legend at 0x741a0c0e4500>
Merge options
overlapping timeseries
[6]:
# create a partially overlapping dataframe
df = pd.DataFrame(
{
"measurements": np.concatenate(
[o1["measurements"].values[-2:], np.random.randint(0, 10, 3)]
)
},
index=pd.date_range("2020-1-4", "2020-1-8"),
)
o3 = hpd.Obs(df, name="obs", x=0, y=0)
display(o3)
hydropandas.Obs
| obs | |
|---|---|
| x | 0 |
| y | 0 |
| location | |
| filename | |
| source | |
| unit |
| measurements | |
|---|---|
| 2020-01-04 | 4 |
| 2020-01-05 | 1 |
| 2020-01-06 | 5 |
| 2020-01-07 | 8 |
| 2020-01-08 | 5 |
[7]:
o_merged = o1.merge_observation(o3)
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata
[8]:
f, axes = plt.subplots(figsize=(9, 7), nrows=3, sharex=True, sharey=True)
o1["measurements"].plot(ax=axes[0], marker="o", label="observation 1").legend(loc=1)
o3["measurements"].plot(ax=axes[1], marker="o", label="observation 3").legend(loc=1)
o_merged["measurements"].plot(ax=axes[2], marker="o", label="merged").legend(loc=1)
[8]:
<matplotlib.legend.Legend at 0x741a0be4a1e0>
[9]:
# create a partially overlapping dataframe with different values
df = pd.DataFrame(
{"measurements": np.random.randint(0, 10, 5)},
index=pd.date_range("2020-1-4", "2020-1-8"),
)
o4 = hpd.Obs(df, name="obs", x=0, y=0)
display(o4)
hydropandas.Obs
| obs | |
|---|---|
| x | 0 |
| y | 0 |
| location | |
| filename | |
| source | |
| unit |
| measurements | |
|---|---|
| 2020-01-04 | 9 |
| 2020-01-05 | 4 |
| 2020-01-06 | 9 |
| 2020-01-07 | 2 |
| 2020-01-08 | 8 |
by default an error is raised if the overlapping time series have different values
[10]:
o1.merge_observation(o4)
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs overlap with different values
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[10], line 1
----> 1 o1.merge_observation(o4)
File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/stable/lib/python3.12/site-packages/hydropandas/observation.py:683, in Obs.merge_observation(self, right, overlap, merge_metadata)
677 raise TypeError(
678 f"observation left has a different type {type(self)} than"
679 f"observation right {type(right)}"
680 )
682 # merge timeseries
--> 683 o = self._merge_timeseries(right, overlap=overlap)
685 # merge metadata
686 if merge_metadata:
File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/stable/lib/python3.12/site-packages/hydropandas/observation.py:596, in Obs._merge_timeseries(self, right, overlap)
591 logger.warning(
592 f"timeseries of observation {right.name} overlap with "
593 "different values"
594 )
595 if overlap == "error":
--> 596 raise ValueError(
597 "observations have different values for same time steps"
598 )
599 elif overlap == "use_left":
600 dup_o = self.loc[dup_ind_o.index, overlap_cols]
ValueError: observations have different values for same time steps
With the ‘overlap’ argument you can specify to use the left or the right observation when merging. See example below.
[11]:
print("use left")
merged_left = o1.merge_observation(o4, overlap="use_left")
display(merged_left) # use the existing observation
print("use right")
merged_right = o1.merge_observation(o4, overlap="use_right")
display(merged_right) # use the existing observation
use left
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs overlap with different values
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata
hydropandas.Obs
| obs | |
|---|---|
| x | 0 |
| y | 0 |
| location | |
| filename | |
| source | |
| unit |
| measurements | |
|---|---|
| 2020-01-01 | 2 |
| 2020-01-02 | 5 |
| 2020-01-03 | 0 |
| 2020-01-04 | 4 |
| 2020-01-05 | 1 |
| 2020-01-06 | 9 |
| 2020-01-07 | 2 |
| 2020-01-08 | 8 |
use right
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs overlap with different values
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata
hydropandas.Obs
| obs | |
|---|---|
| x | 0 |
| y | 0 |
| location | |
| filename | |
| source | |
| unit |
| measurements | |
|---|---|
| 2020-01-01 | 2 |
| 2020-01-02 | 5 |
| 2020-01-03 | 0 |
| 2020-01-04 | 9 |
| 2020-01-05 | 4 |
| 2020-01-06 | 9 |
| 2020-01-07 | 2 |
| 2020-01-08 | 8 |
[12]:
f, axes = plt.subplots(figsize=(9, 4), nrows=2, sharex=True, sharey=True)
o1["measurements"].plot(ax=axes[0], marker="o", label="observation 1").legend(loc=2)
o4["measurements"].plot(ax=axes[0], marker="o", label="observation 4").legend(loc=2)
merged_left["measurements"].plot(ax=axes[1], marker="o", label="merged left").legend(
loc=2
)
merged_right["measurements"].plot(ax=axes[1], marker=".", label="merged right").legend(
loc=2
)
[12]:
<matplotlib.legend.Legend at 0x741a0bc2ba10>
metadata
The merge_observation method checks by default if the metadata of the two observations is the same.
[13]:
# observation 2
df = pd.DataFrame(
{"measurements": np.random.randint(0, 10, 5)},
index=pd.date_range("2020-1-6", "2020-1-10"),
)
o5 = hpd.Obs(df, name="obs5", x=0, y=0)
o5
[13]:
hydropandas.Obs
| obs5 | |
|---|---|
| x | 0 |
| y | 0 |
| location | |
| filename | |
| source | |
| unit |
| measurements | |
|---|---|
| 2020-01-06 | 6 |
| 2020-01-07 | 0 |
| 2020-01-08 | 5 |
| 2020-01-09 | 0 |
| 2020-01-10 | 8 |
When the metadata differs a ValueError is raised.
[14]:
o1.merge_observation(o5)
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[14], line 1
----> 1 o1.merge_observation(o5)
File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/stable/lib/python3.12/site-packages/hydropandas/observation.py:688, in Obs.merge_observation(self, right, overlap, merge_metadata)
686 if merge_metadata:
687 metadata = {key: getattr(right, key) for key in right._get_meta_attr()}
--> 688 new_metadata = self.merge_metadata(metadata, overlap=overlap)
689 else:
690 new_metadata = {key: getattr(self, key) for key in self._get_meta_attr()}
File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/stable/lib/python3.12/site-packages/hydropandas/observation.py:506, in Obs.merge_metadata(self, right, overlap)
504 same_metadata = False
505 if overlap == "error":
--> 506 raise ValueError(
507 f"left observation {key} differs from right observation"
508 )
509 elif overlap == "use_left":
510 logger.info(
511 f"left observation {key} differs from right "
512 "observation, use left"
513 )
ValueError: left observation name differs from right observation
If you set the merge_metadata argument to False the metadata is not merged and only the timeseries of the observations is merged.
[15]:
o1.merge_observation(o5, merge_metadata=False)
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
[15]:
hydropandas.Obs
| obs | |
|---|---|
| x | 0 |
| y | 0 |
| location | |
| filename | |
| source | |
| unit |
| measurements | |
|---|---|
| 2020-01-01 | 2 |
| 2020-01-02 | 5 |
| 2020-01-03 | 0 |
| 2020-01-04 | 4 |
| 2020-01-05 | 1 |
| 2020-01-06 | 6 |
| 2020-01-07 | 0 |
| 2020-01-08 | 5 |
| 2020-01-09 | 0 |
| 2020-01-10 | 8 |
Just as with overlapping timeseries, the ‘overlap’ argument can also be used for overlapping metadata values
[16]:
o_merged = o1.merge_observation(o5, overlap="use_left", merge_metadata=True)
print('observation name when overlap="use_left":', o_merged.name)
o_merged = o1.merge_observation(o5, overlap="use_right", merge_metadata=True)
print('observation name when overlap="use_right":', o_merged.name)
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
INFO:hydropandas.observation.merge_metadata:left observation name differs from right observation, use left
observation name when overlap="use_left": obs
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
INFO:hydropandas.observation.merge_metadata:left observation name differs from right observation, use right
observation name when overlap="use_right": obs5
all combinations
[17]:
# observation 6
df = pd.DataFrame(
{"measurements": np.random.randint(0, 10, 5), "filter": np.ones(5)},
index=pd.date_range("2020-1-1", "2020-1-5"),
)
o6 = hpd.Obs(df, name="obs6", x=100, y=0)
o6
[17]:
hydropandas.Obs
| obs6 | |
|---|---|
| x | 100 |
| y | 0 |
| location | |
| filename | |
| source | |
| unit |
| measurements | filter | |
|---|---|---|
| 2020-01-01 | 4 | 1.0 |
| 2020-01-02 | 9 | 1.0 |
| 2020-01-03 | 9 | 1.0 |
| 2020-01-04 | 8 | 1.0 |
| 2020-01-05 | 0 | 1.0 |
[18]:
# observation 7
df = pd.DataFrame(
{
"measurements": np.concatenate(
[o5["measurements"].values[-1:], np.random.randint(0, 10, 4)]
),
"remarks": ["", "", "", "unreliable", ""],
},
index=pd.date_range("2020-1-4", "2020-1-8"),
)
o7 = hpd.Obs(df, name="obs7", x=0, y=100)
o7
[18]:
hydropandas.Obs
| obs7 | |
|---|---|
| x | 0 |
| y | 100 |
| location | |
| filename | |
| source | |
| unit |
| measurements | remarks | |
|---|---|---|
| 2020-01-04 | 8 | |
| 2020-01-05 | 5 | |
| 2020-01-06 | 4 | |
| 2020-01-07 | 7 | unreliable |
| 2020-01-08 | 0 |
[19]:
merged_right = o6.merge_observation(o7, overlap="use_right")
merged_right
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs7 overlap with different values
INFO:hydropandas.observation.merge_metadata:left observation name differs from right observation, use right
INFO:hydropandas.observation.merge_metadata:left observation x differs from right observation, use right
INFO:hydropandas.observation.merge_metadata:left observation y differs from right observation, use right
[19]:
hydropandas.Obs
| obs7 | |
|---|---|
| x | 0 |
| y | 100 |
| location | |
| filename | |
| source | |
| unit |
| measurements | remarks | filter | |
|---|---|---|---|
| 2020-01-01 | 4 | NaN | 1.0 |
| 2020-01-02 | 9 | NaN | 1.0 |
| 2020-01-03 | 9 | NaN | 1.0 |
| 2020-01-04 | 8 | 1.0 | |
| 2020-01-05 | 5 | 1.0 | |
| 2020-01-06 | 4 | NaN | |
| 2020-01-07 | 7 | unreliable | NaN |
| 2020-01-08 | 0 | NaN |
[20]:
f, axes = plt.subplots(figsize=(9, 7), nrows=2, sharex=True, sharey=True)
o6["measurements"].plot(ax=axes[0], marker="o", label="observation 6").legend(loc=2)
o7["measurements"].plot(ax=axes[0], marker="o", legend=True, label="observation 7")
merged_right["measurements"].plot(
ax=axes[1], marker="o", legend=True, label="merged right"
)
[20]:
<Axes: >
Merge observation collections
[21]:
# create an observation collection from a single observation
oc1 = hpd.ObsCollection(o1)
We can add a single observation to this collection using the add_observation method.
[22]:
oc1.add_observation(o7)
INFO:hydropandas.obs_collection.add_observation:adding obs7 to collection
[22]:
| x | y | location | filename | source | unit | obs | |
|---|---|---|---|---|---|---|---|
| name | |||||||
| obs | 0 | 0 | Obs obs -----metadata------ name : obs x : 0 ... | ||||
| obs7 | 0 | 100 | Obs obs7 -----metadata------ name : obs7 x : ... |
We can also combine two observation collections.
[23]:
# create another observation collection from a list of observations
oc2 = hpd.ObsCollection([o5, o6])
# add the collection to the previous one
oc1.add_obs_collection(oc2)
INFO:hydropandas.obs_collection.add_observation:adding obs5 to collection
INFO:hydropandas.obs_collection.add_observation:adding obs6 to collection
[23]:
| x | y | location | filename | source | unit | obs | |
|---|---|---|---|---|---|---|---|
| name | |||||||
| obs | 0 | 0 | Obs obs -----metadata------ name : obs x : 0 ... | ||||
| obs5 | 0 | 0 | Obs obs5 -----metadata------ name : obs5 x : ... | ||||
| obs6 | 100 | 0 | Obs obs6 -----metadata------ name : obs6 x : ... |
There is an automatic check for overlap based on the name of the observations. If the observations in both collections are exactly the same they are merged.
[24]:
# add o2 to the observation collection 1
oc1.add_observation(o2)
INFO:hydropandas.obs_collection.add_observation:observation name obs already in collection, merging observations
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata
[24]:
| x | y | location | filename | source | unit | obs | |
|---|---|---|---|---|---|---|---|
| name | |||||||
| obs | 0 | 0 | Obs obs -----metadata------ name : obs x : 0 ... |
If the observation you want to add has the same name but not the same timeseries an error is raised.
[25]:
o1_mod = o1.copy()
o1_mod.loc["2020-01-02", "measurements"] = 100
oc1.add_observation(o1_mod)
INFO:hydropandas.obs_collection.add_observation:observation name obs already in collection, merging observations
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs overlap with different values
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[25], line 3
1 o1_mod = o1.copy()
2 o1_mod.loc["2020-01-02", "measurements"] = 100
----> 3 oc1.add_observation(o1_mod)
File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/stable/lib/python3.12/site-packages/hydropandas/obs_collection.py:1458, in ObsCollection.add_observation(self, o, check_consistency, inplace, **kwargs)
1453 logger.info(
1454 f"observation name {o.name} already in collection, merging observations"
1455 )
1457 o1 = oc.loc[o.name, "obs"]
-> 1458 omerged = o1.merge_observation(o, **kwargs)
1460 # overwrite observation in collection
1461 oc.loc[o.name] = omerged.to_collection_dict()
File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/stable/lib/python3.12/site-packages/hydropandas/observation.py:683, in Obs.merge_observation(self, right, overlap, merge_metadata)
677 raise TypeError(
678 f"observation left has a different type {type(self)} than"
679 f"observation right {type(right)}"
680 )
682 # merge timeseries
--> 683 o = self._merge_timeseries(right, overlap=overlap)
685 # merge metadata
686 if merge_metadata:
File ~/checkouts/readthedocs.org/user_builds/hydropandas/envs/stable/lib/python3.12/site-packages/hydropandas/observation.py:596, in Obs._merge_timeseries(self, right, overlap)
591 logger.warning(
592 f"timeseries of observation {right.name} overlap with "
593 "different values"
594 )
595 if overlap == "error":
--> 596 raise ValueError(
597 "observations have different values for same time steps"
598 )
599 elif overlap == "use_left":
600 dup_o = self.loc[dup_ind_o.index, overlap_cols]
ValueError: observations have different values for same time steps
To avoid errors we can use the overlap arguments to specify which observation we want to use.
[26]:
oc1.add_observation(o1_mod, overlap="use_left")
INFO:hydropandas.obs_collection.add_observation:observation name obs already in collection, merging observations
INFO:hydropandas.observation._merge_timeseries:right observation has a different time series
INFO:hydropandas.observation._merge_timeseries:merge time series
WARNING:hydropandas.observation._merge_timeseries:timeseries of observation obs overlap with different values
INFO:hydropandas.observation.merge_metadata:left and right observation have the same metadata
[26]:
| x | y | location | filename | source | unit | obs | |
|---|---|---|---|---|---|---|---|
| name | |||||||
| obs | 0 | 0 | Obs obs -----metadata------ name : obs x : 0 ... |