Merging observations
This notebook shows how observations and observation collections can be merged. Merging observations can be useful if: - you have data from multiple sources measuring at the same location - you get new measurements that you want to add to the old measurements.
Notebook contents
Simple merge
Merge options
Merging observation collections
[1]:
import numpy as np
import pandas as pd
import hydropandas as hpd
from IPython.display import display
import logging
hpd.util.get_color_logger("INFO");
Simple merge
[2]:
# observation 1
df = pd.DataFrame(
{"measurements": np.random.randint(0, 10, 5)},
index=pd.date_range("2020-1-1", "2020-1-5"),
)
o1 = hpd.Obs(df, name="obs", x=0, y=0)
print(o1)
Obs obs
-----metadata------
name : obs
x : 0
y : 0
filename :
source :
unit :
-----time series------
measurements
2020-01-01 4
2020-01-02 4
2020-01-03 9
2020-01-04 3
2020-01-05 8
[3]:
# observation 2
df = pd.DataFrame(
{"measurements": np.random.randint(0, 10, 5)},
index=pd.date_range("2020-1-6", "2020-1-10"),
)
o2 = hpd.Obs(df, name="obs", x=0, y=0)
print(o2)
Obs obs
-----metadata------
name : obs
x : 0
y : 0
filename :
source :
unit :
-----time series------
measurements
2020-01-06 7
2020-01-07 3
2020-01-08 8
2020-01-09 6
2020-01-10 3
[4]:
o1.merge_observation(o2)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:new and existing observation have the same metadata
[4]:
measurements | |
---|---|
2020-01-01 | 4 |
2020-01-02 | 4 |
2020-01-03 | 9 |
2020-01-04 | 3 |
2020-01-05 | 8 |
2020-01-06 | 7 |
2020-01-07 | 3 |
2020-01-08 | 8 |
2020-01-09 | 6 |
2020-01-10 | 3 |
Merge options
overlapping timeseries
[5]:
# create a parly overlapping dataframe
df = pd.DataFrame(
{
"measurements": np.concatenate(
[o1["measurements"].values[-2:], np.random.randint(0, 10, 3)]
)
},
index=pd.date_range("2020-1-4", "2020-1-8"),
)
o3 = hpd.Obs(df, name="obs", x=0, y=0)
print(o3)
Obs obs
-----metadata------
name : obs
x : 0
y : 0
filename :
source :
unit :
-----time series------
measurements
2020-01-04 3
2020-01-05 8
2020-01-06 4
2020-01-07 5
2020-01-08 1
[6]:
o1.merge_observation(o3)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:new and existing observation have the same metadata
[6]:
measurements | |
---|---|
2020-01-01 | 4 |
2020-01-02 | 4 |
2020-01-03 | 9 |
2020-01-04 | 3 |
2020-01-05 | 8 |
2020-01-06 | 4 |
2020-01-07 | 5 |
2020-01-08 | 1 |
[7]:
# create a parly overlapping dataframe with different values
df = pd.DataFrame(
{"measurements": np.random.randint(0, 10, 5)},
index=pd.date_range("2020-1-4", "2020-1-8"),
)
o4 = hpd.Obs(df, name="obs", x=0, y=0)
print(o4)
Obs obs
-----metadata------
name : obs
x : 0
y : 0
filename :
source :
unit :
-----time series------
measurements
2020-01-04 3
2020-01-05 4
2020-01-06 0
2020-01-07 4
2020-01-08 0
by default an error is raised if the overlapping time series have different values
[8]:
o1.merge_observation(o4)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs overlap withdifferent values
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-fb7c0e48ad44> in <module>
----> 1 o1.merge_observation(o4)
c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in merge_observation(self, right, overlap, merge_metadata)
427
428 # merge timeseries
--> 429 o = self._merge_timeseries(right, overlap=overlap)
430
431 # merge metadata
c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in _merge_timeseries(self, right, overlap)
336 )
337 if overlap == "error":
--> 338 raise ValueError(
339 "observations have different values for same time steps"
340 )
ValueError: observations have different values for same time steps
With the ‘overlap’ argument you can specify to use the left or the right observation when merging. See example below.
[9]:
print("use left")
display(o1.merge_observation(o4, overlap="use_left")) # use the existing observation
print("use right")
display(o1.merge_observation(o4, overlap="use_right")) # use the existing observation
use left
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs overlap withdifferent values
INFO:hydropandas.observation:new and existing observation have the same metadata
measurements | |
---|---|
2020-01-01 | 4 |
2020-01-02 | 4 |
2020-01-03 | 9 |
2020-01-04 | 3 |
2020-01-05 | 8 |
2020-01-06 | 0 |
2020-01-07 | 4 |
2020-01-08 | 0 |
use right
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs overlap withdifferent values
INFO:hydropandas.observation:new and existing observation have the same metadata
measurements | |
---|---|
2020-01-01 | 4 |
2020-01-02 | 4 |
2020-01-03 | 9 |
2020-01-04 | 3 |
2020-01-05 | 4 |
2020-01-06 | 0 |
2020-01-07 | 4 |
2020-01-08 | 0 |
metadata
The merge_observation
method checks by default if the metadata of the two observations is the same.
[10]:
# observation 2
df = pd.DataFrame(
{"measurements": np.random.randint(0, 10, 5)},
index=pd.date_range("2020-1-6", "2020-1-10"),
)
o5 = hpd.Obs(df, name="obs5", x=0, y=0)
o5
[10]:
measurements | |
---|---|
2020-01-06 | 0 |
2020-01-07 | 6 |
2020-01-08 | 8 |
2020-01-09 | 6 |
2020-01-10 | 7 |
When the metadata differs a ValueError is raised.
[11]:
o1.merge_observation(o5)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-204327616616> in <module>
----> 1 o1.merge_observation(o5)
c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in merge_observation(self, right, overlap, merge_metadata)
432 if merge_metadata:
433 metadata = {key: getattr(right, key) for key in right._metadata}
--> 434 new_metadata = self.merge_metadata(metadata, overlap=overlap)
435 else:
436 new_metadata = {key: getattr(self, key) for key in self._metadata}
c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in merge_metadata(self, right, overlap)
245 same_metadata = False
246 if overlap == "error":
--> 247 raise ValueError(
248 f"existing observation {key} differs from new observation"
249 )
ValueError: existing observation name differs from new observation
If you set the merge_metadata
argument to False
the metadata is not merged and only the timeseries of the observations is merged.
[12]:
o1.merge_observation(o5, merge_metadata=False)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
[12]:
measurements | |
---|---|
2020-01-01 | 4 |
2020-01-02 | 4 |
2020-01-03 | 9 |
2020-01-04 | 3 |
2020-01-05 | 8 |
2020-01-06 | 0 |
2020-01-07 | 6 |
2020-01-08 | 8 |
2020-01-09 | 6 |
2020-01-10 | 7 |
Just as with overlapping timeseries, the ‘overlap’ argument can also be used for overlapping metadata values
[13]:
o_merged = o1.merge_observation(o5, overlap="use_left", merge_metadata=True)
print('observation name when overlap="use_left":', o_merged.name)
o_merged = o1.merge_observation(o5, overlap="use_right", merge_metadata=True)
print('observation name when overlap="use_right":', o_merged.name)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:existing observation name differs from newobservation, use existing
observation name when overlap="use_left": obs
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:existing observation name differs from newobservation, use new
observation name when overlap="use_right": obs5
all combinations
[14]:
# observation 6
df = pd.DataFrame(
{"measurements": np.random.randint(0, 10, 5), "filter": np.ones(5)},
index=pd.date_range("2020-1-1", "2020-1-5"),
)
o6 = hpd.Obs(df, name="obs6", x=100, y=0)
o6
[14]:
measurements | filter | |
---|---|---|
2020-01-01 | 8 | 1.0 |
2020-01-02 | 7 | 1.0 |
2020-01-03 | 5 | 1.0 |
2020-01-04 | 3 | 1.0 |
2020-01-05 | 6 | 1.0 |
[15]:
# observation 7
df = pd.DataFrame(
{
"measurements": np.concatenate(
[o5["measurements"].values[-1:], np.random.randint(0, 10, 4)]
),
"remarks": ["", "", "", "unreliable", ""],
},
index=pd.date_range("2020-1-4", "2020-1-8"),
)
o7 = hpd.Obs(df, name="obs7", x=0, y=100)
o7
[15]:
measurements | remarks | |
---|---|---|
2020-01-04 | 7 | |
2020-01-05 | 5 | |
2020-01-06 | 0 | |
2020-01-07 | 9 | unreliable |
2020-01-08 | 6 |
[16]:
o6.merge_observation(o7, overlap="use_right")
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs7 overlap withdifferent values
INFO:hydropandas.observation:existing observation name differs from newobservation, use new
INFO:hydropandas.observation:existing observation x differs from newobservation, use new
INFO:hydropandas.observation:existing observation y differs from newobservation, use new
[16]:
measurements | remarks | filter | |
---|---|---|---|
2020-01-01 | 8 | NaN | 1.0 |
2020-01-02 | 7 | NaN | 1.0 |
2020-01-03 | 5 | NaN | 1.0 |
2020-01-04 | 7 | 1.0 | |
2020-01-05 | 5 | 1.0 | |
2020-01-06 | 0 | NaN | |
2020-01-07 | 9 | unreliable | NaN |
2020-01-08 | 6 | NaN |
Merge observation collections
[17]:
# create an observation collection from a single observation
oc1 = hpd.ObsCollection(o1)
We can add a single observation to this collection using the add_observation
method.
[18]:
oc1.add_observation(o2)
oc1
INFO:hydropandas.obs_collection:observation name obs already in collection, merging observations
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:new and existing observation have the same metadata
[18]:
x | y | filename | source | unit | obs | |
---|---|---|---|---|---|---|
name | ||||||
obs | 0 | 0 | Obs obs -----metadata------ name : obs x : 0 ... |
We can also combine two observation collections.
[19]:
# create another observation collection from a list of observations
oc2 = hpd.ObsCollection([o5, o6])
oc2
# add the collection to the previous one
oc1.add_obs_collection(oc2, inplace=True)
oc1
INFO:hydropandas.obs_collection:adding obs5 to collection
INFO:hydropandas.obs_collection:adding obs6 to collection
[19]:
x | y | filename | source | unit | obs | |
---|---|---|---|---|---|---|
name | ||||||
obs | 0 | 0 | Obs obs -----metadata------ name : obs x : 0 ... | |||
obs5 | 0 | 0 | Obs obs5 -----metadata------ name : obs5 x : ... | |||
obs6 | 100 | 0 | Obs obs6 -----metadata------ name : obs6 x : ... |
There is an automatic check for overlap based on the name of the observations. If the observations in both collections are exactly the same they are merged.
[20]:
# add o2 to the observation collection 1
oc1.add_observation(o2)
INFO:hydropandas.obs_collection:observation name obs already in collection, merging observations
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:new and existing observation have the same metadata
If the observation you want to add has the same name but not the same timeseries an error is raised.
[21]:
o1_mod = o1.copy()
o1_mod.loc["2020-01-02", "measurements"] = 100
oc1.add_observation(o1_mod)
INFO:hydropandas.obs_collection:observation name obs already in collection, merging observations
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs overlap withdifferent values
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-21-ffb5717f4bf3> in <module>
1 o1_mod = o1.copy()
2 o1_mod.loc["2020-01-02", "measurements"] = 100
----> 3 oc1.add_observation(o1_mod)
c:\users\oebbe\02_python\hydropandas\hydropandas\obs_collection.py in add_observation(self, o, check_consistency, **kwargs)
876
877 o1 = self.loc[o.name, "obs"]
--> 878 omerged = o1.merge_observation(o, **kwargs)
879
880 # overwrite observation in collection
c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in merge_observation(self, right, overlap, merge_metadata)
427
428 # merge timeseries
--> 429 o = self._merge_timeseries(right, overlap=overlap)
430
431 # merge metadata
c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in _merge_timeseries(self, right, overlap)
336 )
337 if overlap == "error":
--> 338 raise ValueError(
339 "observations have different values for same time steps"
340 )
ValueError: observations have different values for same time steps
To avoid errors we can use the overlap
arguments to specify which observation we want to use.
[22]:
oc1.add_observation(o1_mod, overlap="use_left")
oc1
INFO:hydropandas.obs_collection:observation name obs already in collection, merging observations
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs overlap withdifferent values
INFO:hydropandas.observation:new and existing observation have the same metadata
[22]:
x | y | filename | source | unit | obs | |
---|---|---|---|---|---|---|
name | ||||||
obs | 0 | 0 | Obs obs -----metadata------ name : obs x : 0 ... | |||
obs5 | 0 | 0 | Obs obs5 -----metadata------ name : obs5 x : ... | |||
obs6 | 100 | 0 | Obs obs6 -----metadata------ name : obs6 x : ... |