ram consumption when loading part of MSEED file

Alexandre_Szenicer · July 22, 2019, 5:18pm

Hello,

I’m trying to load a 10 seconds chunk out of a large (approx. 1GB) miniseed file. I thought that loading just a chunk of the file would only consume a small amount of RAM, however after inspecting the source code it seems that the whole file is first loaded into RAM and only then trimmed to return the chunk.

This makes things very tricky if I want to do some parallel processing, as multiple cores will all need to load up one GB files and flood the RAM.

In numpy there is a very convenient thing called memmap https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html#numpy.memmap

which essentially allows to store a big array on disk but still slice chunks of it and send them to RAM, which is exactly what I’d like from obspy.read().

Does anyone know of a way to do this? Maybe I’m just missing something easy?

Many thanks for the help,

Alex.

LionKrischer · July 22, 2019, 6:46pm

Hi Alex,

if you just pass starttime and endtime to obspy.read() it will always have to read the whole file to be able to parse the MiniSEED headers. For reasons of simplicity it is true that it loads the whole file into memory so we can treat it identical, independent of the original data source (file, URL, created-on-the-fly, …). Another reason for currently not support some form of iterative reading is that in 99% of the cases people would just always read the file at which point it would expand quite a bit in memory (due to miniseed’s compression) so the slight overhead does not matter that much. That being said: if there is some willingness to improve upon this be it via memory maps or just iteratively reading a file, please open a github issue and we can discuss it.

A simple way to decrease the memory usage of your use case would be to read it in chunks and pass it as BytesIO() objects to obspy.read(), e.g.:

record_size = 512

all_data = obspy.Stream()

with open(“filename.mseed”, “rb”) as fh:

�� while True:

�� ms = fh.read(record_size * 100)

�� if not ms:

�� break

�� with io.BytesIO(ms) as buf:

�� st = obspy.read(buf, format=“mseed”, starttime=…, endtime=…)� # bypass format detection by passing it

�� if not st:

�� continue

�� all_data += st

all_data.merge()

Untested code but I think the idea is visible. Please note that if you have a better idea where in the file the records of interest are located you can just directly seek there and only pass these headers - this will be much faster as large parts of the file will just be ignored.

Cheers!

Lion

Alexandre_Szenicer · July 23, 2019, 10:55am

Hi Lion,

Thanks a lot for your quick reply. I have a follow up question if that’s ok.

That sounds like a good fix, however since the files are quite big (many days), having to look through the file from the start sounds indeed a bit unnecessary for a 10s chunk. What’s more, this would introduce some overhead, where I’m also going for processing speed.

However, since I know the start time of the big file and the start time I’m interested in, as well as the sample rate and the numerical precision of the data, I could just compute the offset in bytes necessary to read the correct chunk of data directly.

My question is related to my poor knowledge of mseed files: since they contain headers and such, I imagine these occupy some memory space at the beginning of the file? Do you happen to know how to take this into account so I can directly seek the correct chunk of data ?

I hope my question makes sense.

Many thanks for the help,

Alex.

megies · July 23, 2019, 1:23pm

Hi Alex,

MiniSEED has blocks with fixed record size and usually that record size
is the same for all data in one file. It's a power of two, with 256 the
smallest allowed value, most common is 512 bytes I think. (Station code
is in bytes 9-14 of each record, so you can easily check record length
opening a file object and moving through the data)
So you could jump to the expected position within the file and only read
certain blocks. However, nobody can guarantee you that the data blocks
in the file are ordered by starttime of the block.. usually that might
at least roughly be the case, but it's not a mandatory property of the
file format, the could in principle appear in random order and it would
still be a valid mseed file.

You can also have a look at:
obspy.io.mseed.scripts.recordanalyzer.RecordAnalyser

I think you should be able to use it to navigate through your file and
try to find the appropriate position you're looking for.

best,
Tobias

Chad_Trabant · July 24, 2019, 12:12am

Hi Alex,

Adding a bit more to what Lion and Tobias already wrote, unfortunately not a ready solution for ObsPy but hopefully useful nonetheless.

To help address the general problem of sub-selecting data from a large volume of miniSEED, regardless of how it’s organized, the DMC released mseedindex:
https://github.com/iris-edu/mseedindex

and, for ObsPy users, proposed an additional module to leverage such indexed data sets:
https://github.com/obspy/obspy/pull/2206

It’s not in a release yet, but perhaps you can use part of this system or the pre-release code.

This is basically the same system the IRIS DMC uses (Postgres instead of SQLite) to efficiently extract 10 seconds from 1/2 petabyte of miniSEED for multiple decades of continuous data from 300,000+ channels.

regards,
Chad