pyasdf: tags and labels

Phil_Cummins · February 12, 2017, 2:03am

Hi,

I’m not sure where else to ask questions about pyasdf - is there a separate mailing list? Anyway, I thought I’d try here.

I have a question about waveform tags vs. labels. My understanding is that tags are a coarser way of organizing the data that affects how it is stored on disk, and labels allow finer grained organization. E.g., you might use a tag to distinguish data and synthetics, and then labels to distinguish synthetics calculated for different earth models.

However, I don’t understand why this is useful if you can’t store data having the same tag but different labels. E.g., if I do:
ds.add_waveforms(obspy.read(), tag=“synthetic”,labels=[“model1”])
and then
ds.add_waveforms(obspy.read(), tag=“synthetic”,labels=[“model2”])
I get an error:
ASDFWarning: Data ‘IU.SDV/IU.SDV.00.BHZ__2015-09-16T23:11:12__2015-09-16T23:21:42__synthetic’ already exists in file. Will not be added!

So it would seem any add_waveform call must have a distinct tag. How then can I give different labels to groups of traces within the same tag?

E.g., could I set the label in the ASDF attributes of the trace before calling add_waveforms, and then do the do the add_waveform call without specifying the labels? Would my traces then get saved to the same tag but with different labels?

Thanks,

Phil

Phil_Cummins · February 12, 2017, 3:21am

Hi again,

Quick answer to my own question:

In [1]: st=read()
In [2]: st
Out[2]:
3 Trace(s) in Stream:
BW.RJOB…EHZ | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples
BW.RJOB…EHN | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples
BW.RJOB…EHE | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples

In [3]: for i,tr in enumerate(st): # Insert labels in tr.stats.asdf
… tr.stats.asdf = {}
… tr.stats.asdf[‘labels’] = [‘model%d’ % i]
In [4]: ds = pyasdf.ASDFDataSet(“tmp2”) # Open ASDF file
In [5]: ds.add_waveforms(st,tag=‘synthetic’) # Add waveforms
ln [6]: for sta in ds.ifilter(ds.q.labels == ‘model1’): # Retrieve model1 traces
… print sta[‘synthetic’][0].stats

network: BW
station: RJOB
location:
channel: EHN
starttime: 2009-08-24T00:20:03.000000Z
endtime: 2009-08-24T00:20:32.990000Z
sampling_rate: 100.0
delta: 0.01
npts: 3000
calib: 1.0
_format: ASDF
asdf: AttribDict({‘labels’: [u’model1’], ‘tag’: u’synthetic’, ‘format_version’: ‘1.0.0’})

So, putting the label in the asdf attribute dictionary before the call to ds.add_waveforms() seems to successfully store traces with the same tag but different labels. But is this the way it’s supposed to work? I.e., behavior unlikely to change in future versions of ASDF?

Thanks,

Phil

Phil_Cummins · February 12, 2017, 4:13am

Hi once again,

Ah, I think I see. When I store waveforms with identical header and tag info, but different labels, it appears to get stored in the same place, i.e. overwritten. Oddly, sometimes it complains that the data with same header info and tag are already there, so it won’t add, and other times it doesn’t complain, but presumably overwrites.

So now I think I understand that the tag is not used at all in storing the data, (which I guess is what the doco said), but is only useful for retrieving groups of traces - e.g., I think you can retrieve a bunch of traces with different tags, but the same label.

Sorry for bothering everyone…

Phil

LionKrischer · March 9, 2017, 12:35pm

Hi Phil,

sorry for the late answer - I somehow did not notice this email.

I assume you saw this page here:

Tags are used as parts of the array name inside the HDF5 file. Thus one can only store one piece of data with same station, temporal extent, and tag. The array name is " {NET}.{STA}/{NET}.{STA}.{LOC}.{CHA}__{ST}__{ET}__{TAG}" with st and et being start and endtime.

Some kind of unique identifier is necessary to distinguish otherwise identical data - in this case it is the tag. The labels are just additional possible information assigned with waveforms. The tag is pretty free-form and the only “defined” tag is “raw_recording” reserved for data straight from a digitizer/data center. A schema that might work for your case:

tag: “syn_model_a”, labels=[“synthetic”, “model_a”]
tag: “syn_model_b”: labels=[“synthetic”, “model_b”]
tag: “processed”: label=[“data”, “instrument_corrected”]

You can search over data via the labels with this interface: It should not overwrite the data even if the labels differ. It should raise a warning and claim that the data already exists but it would not do anything. I just tried to reproduce and noticed that the warning is only raised once on Python 2 but every time on Python 3. So I presume you ran on Python 2? In any case - this is now fixed in the latest repository version and you should see the warning every time. If you actually want to overwrite data you have to explicitly remove it first: See comment above about tags and labels. Hope it helps! Lion