Trouble reading this SEED file

filefolder · October 18, 2022, 10:25am

Hi all, possibly more of a puzzle than a question but I wonder if anyone is able to crack open this slightly deformed SEED file. It is event data recorded in 1996 that would be quite valuable to have.

Obspy (1.3.1) is not reading it, and even rdseed has trouble e.g.

	Trying again ignoring effective times.
WARNING (get_stn_chn_Lrecl()):  station/channel SE08/BHE not found in station table.
Unable to determine the logical record length for station/channel SE08/BHE for location:
Defaulting to 4096

Currently leafing through the SEED manual but thought I would ask if anyone has been here before or can see an obvious stupid thing wrong with the attached file. The metadata and response information is all there and easy to extract but the data itself seems to have an inconsistent header somewhere and I am not sure how to get in there and edit it.

(file is 8.2M, too big to upload but: can download here )

Thanks in advance,

megies · October 19, 2022, 10:10am

What I have done previously with corrupt files is to read the file binary and look for data block starts with regex matching and then you can just grab those blocks you want and mash em together and output

import re

data = open("9316600.sed", "rb").read()
net = ''
sta = 'SA01'
loc = ''
cha = 'BHZ'
pattern = f'[0-9]{{6}}[DRQM].{sta:<5}{loc:<2}{cha:<3}{net:<2}'
pattern = re.compile(pattern.encode('ASCII'))
matches = re.finditer(pattern, data)
startpositions = [match.span()[0] for match in matches]
diffs = np.diff(startpositions)

record length seems to be 4096 and the you just grab those blocks like..

with open(out, "wb") as fh:
    good_data = data[startpositions[0]:startpositions[0]+reclength]
    fh.write(good_data)

I only looked for data blocks of the first SEED ID I saw, obviously you might have a better idea what else to look for.. you can also iteratively regex hop through the file based on the record number, that 6-digit incrementing thing right at the start, but I’m guessing you got the idea

filefolder · October 19, 2022, 11:45am

Thanks! That’s a very handy piece of code. I will have to experiment with it somewhat; but looping through all the start positions as given I still get a lot of errors (below) which is actually the same what rdseed gives IIRC. I’m confused why there are a different amount of samples per record or how to get around it.


 InternalMSEEDError: Encountered 52 error(s) during a call to readMSEEDBuffer():
msr_unpack_data(_SA01__BHZ_D): only decoded 961 samples of 1008 expected
msr_unpack_data(_SA01__BHZ_D): only decoded 935 samples of 1008 expected
msr_unpack_data(_SA01__BHZ_D): only decoded 938 samples of 1008 expected
msr_unpack_data(_SA01__BHZ_D): only decoded 938 samples of 1008 expected
msr_unpack_data(_SA01__BHZ_D): only decoded 963 samples of 1008 expected
msr_unpack_data(_SA01__BHZ_D): only decoded 930 samples of 1008 expected
msr_unpack_data(_SA01__BHZ_D): only decoded 920 samples of 1008 expected
msr_unpack_data(_SA01__BHZ_D): only decoded 942 samples of 1008 expected
msr_unpack_data(_SA01__BHZ_D): only decoded 962 samples of 1008 expected
msr_unpack_data(_SA01__BHZ_D): only decoded 950 samples of 1008 expected
msr_unpack_data(_SA01__BHZ_D): only decoded 923 samples of 1008 expected
...

I assume the original author of the file was able to read it fine in the 90s, so I wonder if there is an encoding issue somewhere. ISO-8859-1 vs utf-8?

megies · October 19, 2022, 12:16pm

“Decoding” here likely refers to decompressing STEIM, nothing like UTF or similar involved.

IIRC the number of samples in one miniseed data record is variable, since the compression works better the closer neighboring samples are in amplitude and since the length of a record is fixed. That being said, it would be surprising if all records were to contain the same amount of samples (not sure though if the “52” errors means all of the records you read), so just a wild guess, but maybe the expected amount of samples was just set to a wrong and fixed value in the data record headers?

If that’s not what is happening, you might want to ask Chad Trabant what he thinks, I think I can’t ping him here since he hasnt logged in here, I believe..

megies · October 19, 2022, 12:29pm

Btw, the regex matching anywhere in the file was for files with garbled records, in your case it looks like all is in order and throughout the whole file you have the start of each record every 4096 bytes, so you could just hop 4096 bytes all the time and check if its a data record or not

filefolder · October 20, 2022, 1:45am

I think the reason I guessed there was a decoding issue was because of this error trying to read a 4096 chunk..

i = 9
good_data = data[startpositions[i]:startpositions[i]+reclen]
#(I can post the whole byte string but its a bit long)
good_data.decode()
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc9 in position 21: invalid continuation byte

but this works if I set encoding to latin-1 or ISO-8859-1 (the same thing?) but I have no idea how relevant this is.

Doing something like the below is… possibly.. getting closer but trying to read the file gives a InternalMSEEDParseTimeError: Problem decoding time (wrong endian?) error, so I guess the time got garbled now.

with open("test.ms", "wb") as fh: 
    for i in range(len(startpositions)): 
        good_data = data[startpositions[i]:startpositions[i]+reclen]
        good_data = good_data.decode(encoding='ISO-8859-1').encode('utf-8')
        fh.write(good_data)

Obviously I am completely flailing around with this and my understanding of binary/SEED structure isn’t great. I don’t know how to actually read the components of each good_data chunk so am unable to see or change the information in it. As you observed the files are likely to be structurally sound so I am hoping it is something trivial related to the filesystem they were saved on vs the one I am using. May ask CT for advice on it (or at least get the original PIs to!) as you recommend.

megies · October 20, 2022, 4:06pm

filefolder:

I think the reason I guessed there was a decoding issue was because of this error trying to read a 4096 chunk…
i = 9
good_data = data[startpositions[i]:startpositions[i]+reclen]
#(I can post the whole byte string but its a bit long)
good_data.decode()
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc9 in position 21: invalid continuation byte
but this works if I set encoding to latin-1 or ISO-8859-1 (the same thing?) but I have no idea how relevant this is.

Decoding a full SEED record as UTF-8 makes no sense..

First thing you want to do is go through the file and only grab data records. Data records are real simple and each 4k record is fully self-contained. The metadata records you have in that file as well are much much more complex, but maybe you do not even need them (if you have the station metadata already).

Let me give it a look real quick

megies · October 20, 2022, 5:46pm

@filefolder Hmm OK, it’s not as easy as it first looked, what I think is normal MiniSEED records is weird. At the end of the fixed record header the last item is supposed to indicate where in the record the first blockette (variable chunk of different types) starts. But it says 0 which if I understand correctly according to SEED manual should be used for “no data”.

I was looking for the start of a blockette real quick in the data part but couldn’t make sense of it and didnt find a blockette number that would make sense.

You might need somebody more familiar with the format or used to debugging it, maybe Chad has a minute to give it a look? Not sure who else to ask.

chad-iris · October 21, 2022, 2:20am

Hey,
The 9316600.sed file is a full SEED volume containing both headers and data records. The data records are what I call “bare” records, and technically not miniSEED because they do not include a Blockette 1000 that describes their length and data payload encoding. This was not uncommon with older data which were shipped as full SEED, where the SEED headers described the needed details.

To work data records without B1000 libmseed searches for signature of data records, thus skipping things it does not understand like SEED headers, and uses that search to determine record length. There is no way to determine the payload encoding, so libmseed simply has a default, which is Steim-1 because that is most prevalent with such older data. I believe the data payload decoding is failing either because 1) corruption, or much more likely, 2) the data are not Steim-1 encoded.

My normal, manual method of figuring out the encoding is to try them until I find one that decodes to something looking like a signal. In this case the SEED headers should contain the right details. Luckily rdseed still runs and can give a test output of a file like this:

% rdseed -s -f 9316600.sed:

B052F16     Format lookup:     1                   Format Information Follows
B030F03          Format Name: REF32
B030F05          Data family:    0          
B030F06          Number of Keys:    2        
B030F07             Key  1: M0 
B030F07             Key  2: W4 D0-23 C2 
B052F17     Log2 of Data record length:            12
B052F18     Sample rate:                           25

So the record lengths are 2^12 = 4096 bytes, check, and the data encoding is REF32. I’ve never heard of REF32, but the DDL keys, if you’re willing to squint at the SEED manual, describe non-multiplexed (M0), 24-bit integers (D0-23) in 32-bits of space (W4), using 2’s complement signing (C2).

Based on this I tried decoding the data as 32-bit integers and got seismic-looking signals. For example with this command:

UNPACK_DATA_FORMAT=3 mseed2sac 9316600.sed

The data do not have network codes (introduced in the format in 1992) and so some software doesn’t handle that well. mseed2sac adds ‘XX’ as a network code to avoid having none.

Naturally I cannot guarantee the accuracy of the data interpreted as 32-bit integers and have only looked at a few samples.

chad-iris · October 21, 2022, 2:40am

Worth mentioning that I had no trouble reading the .sed file with rdseed:

% rdseed -f 9316600.sed -d   
<< IRIS SEED Reader, Release 5.3 >>
	d = read data from tape
	Taking input from 9316600.sed
Writing .SA01..BHZ, 43750 samples (binary), starting 1993,166 05:16:07.0609 UT
Writing .SA01..BHN, 43750 samples (binary), starting 1993,166 05:16:07.0609 UT
Writing .SA01..BHE, 43750 samples (binary), starting 1993,166 05:16:07.0609 UT
Writing .SA02..BHZ, 45250 samples (binary), starting 1993,166 05:16:34.0209 UT
Writing .SA02..BHN, 45250 samples (binary), starting 1993,166 05:16:34.0209 UT
Writing .SA02..BHE, 45250 samples (binary), starting 1993,166 05:16:34.0209 UT
Writing .SA03..BHZ, 46000 samples (binary), starting 1993,166 05:16:59.0309 UT
Writing .SA03..BHN, 46000 samples (binary), starting 1993,166 05:16:59.0309 UT
Writing .SA03..BHE, 46000 samples (binary), starting 1993,166 05:16:59.0309 UT
Writing .SA04..BHZ, 47000 samples (binary), starting 1993,166 05:17:11.0710 UT
Writing .SA04..BHN, 47000 samples (binary), starting 1993,166 05:17:11.0710 UT
Writing .SA04..BHE, 47000 samples (binary), starting 1993,166 05:17:11.0710 UT
Writing .SA05..BHZ, 47750 samples (binary), starting 1993,166 05:17:17.0410 UT
...

filefolder · October 21, 2022, 3:34am

That’s amazing, thanks Chad. The mseed2sac trick with dataformat 3 works perfectly.

My guess is these were done on RefTek 72A07 tape loggers, have been used on plenty of other networks but this is the first I have seen this problem.

Also FWIW here’s my rdseed (5.3.1) output, blank location issue? Haven’t tried 5.3

 $ rdseed -f 9316600.sed -d
<< IRIS SEED Reader, Release 5.3.1 >>
	d = read data from tape
	Taking input from 9316600.sed
WARNING (process_data):  station/channel SA01/BHZ not found in station/channel tables for location:  .
	Skipping this trace.
WARNING (process_data):  station/channel SA01/BHZ not found in station/channel tables for location:  .
	Skipping this trace.
WARNING (process_data):  station/channel SA01/BHZ not found in station/channel tables for location:  .
	Skipping this trace.
WARNING (process_data):  station/channel SA01/BHZ not found in station/channel tables for location:  .
	Skipping this trace.