Merging observations

This notebook shows how observations and observation collections can be merged. Merging observations can be useful if: - you have data from multiple sources measuring at the same location - you get new measurements that you want to add to the old measurements.

Notebook contents

  1. Simple merge

  2. Merge options

  3. Merging observation collections

[1]:
import numpy as np
import pandas as pd
import hydropandas as hpd
from IPython.display import display

import logging

hpd.util.get_color_logger("INFO");

Simple merge

[2]:
# observation 1
df = pd.DataFrame(
    {"measurements": np.random.randint(0, 10, 5)},
    index=pd.date_range("2020-1-1", "2020-1-5"),
)
o1 = hpd.Obs(df, name="obs", x=0, y=0)
print(o1)
Obs obs
-----metadata------
name : obs
x : 0
y : 0
filename :
source :
unit :

-----time series------
            measurements
2020-01-01             4
2020-01-02             4
2020-01-03             9
2020-01-04             3
2020-01-05             8
[3]:
# observation 2
df = pd.DataFrame(
    {"measurements": np.random.randint(0, 10, 5)},
    index=pd.date_range("2020-1-6", "2020-1-10"),
)
o2 = hpd.Obs(df, name="obs", x=0, y=0)
print(o2)
Obs obs
-----metadata------
name : obs
x : 0
y : 0
filename :
source :
unit :

-----time series------
            measurements
2020-01-06             7
2020-01-07             3
2020-01-08             8
2020-01-09             6
2020-01-10             3
[4]:
o1.merge_observation(o2)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:new and existing observation have the same metadata
[4]:
measurements
2020-01-01 4
2020-01-02 4
2020-01-03 9
2020-01-04 3
2020-01-05 8
2020-01-06 7
2020-01-07 3
2020-01-08 8
2020-01-09 6
2020-01-10 3

Merge options

overlapping timeseries

[5]:
# create a parly overlapping dataframe
df = pd.DataFrame(
    {
        "measurements": np.concatenate(
            [o1["measurements"].values[-2:], np.random.randint(0, 10, 3)]
        )
    },
    index=pd.date_range("2020-1-4", "2020-1-8"),
)
o3 = hpd.Obs(df, name="obs", x=0, y=0)
print(o3)
Obs obs
-----metadata------
name : obs
x : 0
y : 0
filename :
source :
unit :

-----time series------
            measurements
2020-01-04             3
2020-01-05             8
2020-01-06             4
2020-01-07             5
2020-01-08             1
[6]:
o1.merge_observation(o3)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:new and existing observation have the same metadata
[6]:
measurements
2020-01-01 4
2020-01-02 4
2020-01-03 9
2020-01-04 3
2020-01-05 8
2020-01-06 4
2020-01-07 5
2020-01-08 1
[7]:
# create a parly overlapping dataframe with different values
df = pd.DataFrame(
    {"measurements": np.random.randint(0, 10, 5)},
    index=pd.date_range("2020-1-4", "2020-1-8"),
)
o4 = hpd.Obs(df, name="obs", x=0, y=0)
print(o4)
Obs obs
-----metadata------
name : obs
x : 0
y : 0
filename :
source :
unit :

-----time series------
            measurements
2020-01-04             3
2020-01-05             4
2020-01-06             0
2020-01-07             4
2020-01-08             0

by default an error is raised if the overlapping time series have different values

[8]:
o1.merge_observation(o4)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs overlap withdifferent values
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-fb7c0e48ad44> in <module>
----> 1 o1.merge_observation(o4)

c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in merge_observation(self, right, overlap, merge_metadata)
    427
    428         # merge timeseries
--> 429         o = self._merge_timeseries(right, overlap=overlap)
    430
    431         # merge metadata

c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in _merge_timeseries(self, right, overlap)
    336                 )
    337                 if overlap == "error":
--> 338                     raise ValueError(
    339                         "observations have different values for same time steps"
    340                     )

ValueError: observations have different values for same time steps

With the ‘overlap’ argument you can specify to use the left or the right observation when merging. See example below.

[9]:
print("use left")
display(o1.merge_observation(o4, overlap="use_left"))  # use the existing observation
print("use right")
display(o1.merge_observation(o4, overlap="use_right"))  # use the existing observation
use left
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs overlap withdifferent values
INFO:hydropandas.observation:new and existing observation have the same metadata
measurements
2020-01-01 4
2020-01-02 4
2020-01-03 9
2020-01-04 3
2020-01-05 8
2020-01-06 0
2020-01-07 4
2020-01-08 0
use right
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs overlap withdifferent values
INFO:hydropandas.observation:new and existing observation have the same metadata
measurements
2020-01-01 4
2020-01-02 4
2020-01-03 9
2020-01-04 3
2020-01-05 4
2020-01-06 0
2020-01-07 4
2020-01-08 0

metadata

The merge_observation method checks by default if the metadata of the two observations is the same.

[10]:
# observation 2
df = pd.DataFrame(
    {"measurements": np.random.randint(0, 10, 5)},
    index=pd.date_range("2020-1-6", "2020-1-10"),
)
o5 = hpd.Obs(df, name="obs5", x=0, y=0)
o5
[10]:
measurements
2020-01-06 0
2020-01-07 6
2020-01-08 8
2020-01-09 6
2020-01-10 7

When the metadata differs a ValueError is raised.

[11]:
o1.merge_observation(o5)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-204327616616> in <module>
----> 1 o1.merge_observation(o5)

c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in merge_observation(self, right, overlap, merge_metadata)
    432         if merge_metadata:
    433             metadata = {key: getattr(right, key) for key in right._metadata}
--> 434             new_metadata = self.merge_metadata(metadata, overlap=overlap)
    435         else:
    436             new_metadata = {key: getattr(self, key) for key in self._metadata}

c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in merge_metadata(self, right, overlap)
    245                     same_metadata = False
    246                     if overlap == "error":
--> 247                         raise ValueError(
    248                             f"existing observation {key} differs from new observation"
    249                         )

ValueError: existing observation name differs from new observation

If you set the merge_metadata argument to False the metadata is not merged and only the timeseries of the observations is merged.

[12]:
o1.merge_observation(o5, merge_metadata=False)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
[12]:
measurements
2020-01-01 4
2020-01-02 4
2020-01-03 9
2020-01-04 3
2020-01-05 8
2020-01-06 0
2020-01-07 6
2020-01-08 8
2020-01-09 6
2020-01-10 7

Just as with overlapping timeseries, the ‘overlap’ argument can also be used for overlapping metadata values

[13]:
o_merged = o1.merge_observation(o5, overlap="use_left", merge_metadata=True)
print('observation name when overlap="use_left":', o_merged.name)
o_merged = o1.merge_observation(o5, overlap="use_right", merge_metadata=True)
print('observation name when overlap="use_right":', o_merged.name)
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:existing observation name differs from newobservation, use existing
observation name when overlap="use_left": obs
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:existing observation name differs from newobservation, use new
observation name when overlap="use_right": obs5

all combinations

[14]:
# observation 6
df = pd.DataFrame(
    {"measurements": np.random.randint(0, 10, 5), "filter": np.ones(5)},
    index=pd.date_range("2020-1-1", "2020-1-5"),
)
o6 = hpd.Obs(df, name="obs6", x=100, y=0)
o6
[14]:
measurements filter
2020-01-01 8 1.0
2020-01-02 7 1.0
2020-01-03 5 1.0
2020-01-04 3 1.0
2020-01-05 6 1.0
[15]:
# observation 7
df = pd.DataFrame(
    {
        "measurements": np.concatenate(
            [o5["measurements"].values[-1:], np.random.randint(0, 10, 4)]
        ),
        "remarks": ["", "", "", "unreliable", ""],
    },
    index=pd.date_range("2020-1-4", "2020-1-8"),
)
o7 = hpd.Obs(df, name="obs7", x=0, y=100)
o7
[15]:
measurements remarks
2020-01-04 7
2020-01-05 5
2020-01-06 0
2020-01-07 9 unreliable
2020-01-08 6
[16]:
o6.merge_observation(o7, overlap="use_right")
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs7 overlap withdifferent values
INFO:hydropandas.observation:existing observation name differs from newobservation, use new
INFO:hydropandas.observation:existing observation x differs from newobservation, use new
INFO:hydropandas.observation:existing observation y differs from newobservation, use new
[16]:
measurements remarks filter
2020-01-01 8 NaN 1.0
2020-01-02 7 NaN 1.0
2020-01-03 5 NaN 1.0
2020-01-04 7 1.0
2020-01-05 5 1.0
2020-01-06 0 NaN
2020-01-07 9 unreliable NaN
2020-01-08 6 NaN

Merge observation collections

[17]:
# create an observation collection from a single observation
oc1 = hpd.ObsCollection(o1)

We can add a single observation to this collection using the add_observation method.

[18]:
oc1.add_observation(o2)
oc1
INFO:hydropandas.obs_collection:observation name obs already in collection, merging observations
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:new and existing observation have the same metadata
[18]:
x y filename source unit obs
name
obs 0 0 Obs obs -----metadata------ name : obs x : 0 ...

We can also combine two observation collections.

[19]:
# create another observation collection from a list of observations
oc2 = hpd.ObsCollection([o5, o6])
oc2

# add the collection to the previous one
oc1.add_obs_collection(oc2, inplace=True)
oc1
INFO:hydropandas.obs_collection:adding obs5 to collection
INFO:hydropandas.obs_collection:adding obs6 to collection
[19]:
x y filename source unit obs
name
obs 0 0 Obs obs -----metadata------ name : obs x : 0 ...
obs5 0 0 Obs obs5 -----metadata------ name : obs5 x : ...
obs6 100 0 Obs obs6 -----metadata------ name : obs6 x : ...

There is an automatic check for overlap based on the name of the observations. If the observations in both collections are exactly the same they are merged.

[20]:
# add o2 to the observation collection 1
oc1.add_observation(o2)
INFO:hydropandas.obs_collection:observation name obs already in collection, merging observations
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
INFO:hydropandas.observation:new and existing observation have the same metadata

If the observation you want to add has the same name but not the same timeseries an error is raised.

[21]:
o1_mod = o1.copy()
o1_mod.loc["2020-01-02", "measurements"] = 100
oc1.add_observation(o1_mod)
INFO:hydropandas.obs_collection:observation name obs already in collection, merging observations
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs overlap withdifferent values
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-ffb5717f4bf3> in <module>
      1 o1_mod = o1.copy()
      2 o1_mod.loc["2020-01-02", "measurements"] = 100
----> 3 oc1.add_observation(o1_mod)

c:\users\oebbe\02_python\hydropandas\hydropandas\obs_collection.py in add_observation(self, o, check_consistency, **kwargs)
    876
    877             o1 = self.loc[o.name, "obs"]
--> 878             omerged = o1.merge_observation(o, **kwargs)
    879
    880             # overwrite observation in collection

c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in merge_observation(self, right, overlap, merge_metadata)
    427
    428         # merge timeseries
--> 429         o = self._merge_timeseries(right, overlap=overlap)
    430
    431         # merge metadata

c:\users\oebbe\02_python\hydropandas\hydropandas\observation.py in _merge_timeseries(self, right, overlap)
    336                 )
    337                 if overlap == "error":
--> 338                     raise ValueError(
    339                         "observations have different values for same time steps"
    340                     )

ValueError: observations have different values for same time steps

To avoid errors we can use the overlap arguments to specify which observation we want to use.

[22]:
oc1.add_observation(o1_mod, overlap="use_left")
oc1
INFO:hydropandas.obs_collection:observation name obs already in collection, merging observations
WARNING:hydropandas.observation:function 'merge_observation' not thoroughly tested, please be carefull!
INFO:hydropandas.observation:new observation has a different time series
INFO:hydropandas.observation:merge time series
WARNING:hydropandas.observation:timeseries of observation obs overlap withdifferent values
INFO:hydropandas.observation:new and existing observation have the same metadata
[22]:
x y filename source unit obs
name
obs 0 0 Obs obs -----metadata------ name : obs x : 0 ...
obs5 0 0 Obs obs5 -----metadata------ name : obs5 x : ...
obs6 100 0 Obs obs6 -----metadata------ name : obs6 x : ...