Trace header fields and additional file format specific headers -- here be dragons..

Hi all,

I just wrote a pretty lengthy comment on github on on the use of file
format specific header fields. Since things can go wrong if one doesn't
keep a few things in mind when using these, I'll share the comment here.

So, if you only use the headers in Trace.stats it is pretty safe to
ignore this mail.

  * Trace.stats is the main part of the header which every data segment
needs to make sense. Any output waveform file format plugin will
interpret and use them during output. You can think of Trace.stats as
the lowest common denominator when it comes to describing a contiguous
block of samples. (Be aware that misspelling fields during assignments
will result in unpleasant surprises, e.g. setting
Trace.stats.samlping_rate = 20 will not raise an Exception but also will
not change the sampling rate in reality, it will end up in another stats
field that is going to be silently ignored)
  * Most waveform formats have additional header fields. During reading
these additional fields can not be stored reasonably in the common
header (i.e. Trace.stats) directly, because they differ from file format
to file format and thus are not compatible with every single other
format. Thus they are stored as an additional AttribDict attached as
e.g. Trace.stats.segy or Trace.stats.mseed. There is also an attribute
Trace.stats._format (e.g. 'SEGY', 'MSEED') after reading which can be
used to find out what specific file format the trace is coming from.
  * During waveform output, waveform plugins will use additional header
information if they encounter a matching entry in Trace.stats, e.g.
Trace.stats.segy when writing to "SEGY" format (Again, of course only
entries will be used that are defined for that waveform format. So,
watch out to not try to use SAC header field names with a write
operation to SEGY. Not recognized fields will be disregarded silently).
  * Unfortunately, there can be conflicts between "normal" and
"additional" file format specific header fields. For example, SAC is
storing the start time of the trace in a very peculiar form spread over
a whole bunch of different fields (reference time, offset from reference
time, .. and so on). If you would set one of these timing related fields
manually in Trace.stats.sac it collides with the timing setting from the
"normal" header and there might be unexpected results and unpleasant
surprises in the output files. So when using additional header fields
always check if there might be conflicts with the main header fields
(timing, sampling rate etc.) and make sure to double check if the
results are what you think (e.g. write and then read the output again
and check it).

Summary:

* Rule 1: If possible at all, thou shalt only use the main header
           fields in Trace.stats. Their behavior is well tested.
* Rule 2: Look left and right before stepping over Rule 1.
* Rule 3: If you have to use file format specific headers, think about
possible conflicts with main header fields (e.g. sampling rate
multipliers in SEGY) and be sure to check that your output is what you
expect it to be, because not every single combination of scenarios is
tested. Also be aware that all methods defined on Trace, e.g.
Trace.filter() only consider what is in the main header. So changing
sampling rate information in e.g. Trace.stats.segy will not affect how
Trace.filter() interprets the trace's sampling rate (which is necessary
to build the filter of course)

best,
Tobias

P.S.: If you provide us with a test for a specific scenario (e.g. by
sending a pull request for it) you can make sure the test ends up in the
main repository and thus is being monitored routinely.