Problems reading miniseed files in parallel with multiprocessing

yunndlalala · August 28, 2020, 9:18am

I am trying to read many miniseed files using obspy, which is operated in parallel by multiprocessing.pool. Some processes will be deadlock when the cores of the pool are larger than 1. Codes are as follows:

import os
import time
import obspy
import multiprocessing


def ob_read(file, output_file):
    try:
        with open(output_file, 'a') as f:
            f.write('open file' + '\n')
        st = obspy.read(file)
        with open(output_file, 'a') as f:
            f.write(str(st) + '\n')
    except Exception as err_msg:
        print("sub_task():error message=%s" % str(err_msg))

    return None


def ob_read_parallel(data_path, cores, output_path):
    pool = multiprocessing.Pool(processes=cores)
    tasks = []
    for path, dir_list, file_list in os.walk(data_path):
        for file in file_list:
            full_file = os.path.join(path, file)
            output_file = os.path.join(output_path, file)
            if os.path.exists(output_file):
                os.remove(output_file)
            # ob_read(full_file, output_file)
            tasks.append((full_file, output_file))

    rs = pool.starmap_async(ob_read, tasks, chunksize=1)
    pool.close()
    while True:
        remaining = rs._number_left
        print("finished:{0}/{1}".format(len(tasks) - remaining, len(tasks)),
              end='\r')
        if rs.ready():
            break
        time.sleep(0.5)
    print ('\n')
    pool.join()


if __name__ == '__main__':
    data_path = '../data/mseed'
    output_path = '../result'
    cores = 5
    ob_read_parallel(data_path, cores, output_path)

    pass

For example, if we read 245 files, the output will stop at:

finished: 190/245

The python environment is created by ‘conda create -n test_mseed python=3.7’. Obspy is installed via pip.

obspy version: 1.2.2
python: 3.6 or 3.7
platform: CentOS Linux release 7.5.1804 (Core)

I find changing the start method of the child processes as follows

multiprocessing.set_start_method("forkserver") # or 'spawn'

can solve this problem. Is there any inappropriate use of ‘thread’ or ‘lock’ in obspy when reading miniseed causing this error?

LionKrischer · August 31, 2020, 3:22pm

Hi @yunndlalala,

there should be no locks or mutexes in any of the core I/O routines so I’m sure why it hangs in your example. If you want to drill down and understand what is going on a simple way would be to add a couple of print statements in various parts of the code and see what exact call causes the hang.

I know that it used to a be a problem that many blas/lapack implementations (which numpy uses) did and some still do not like to be forked. Using "forkserver" works around this as the fork happens before blas/lapack is loaded. There might be a similar issue here. Or maybe the same - I don’t know which blas implementation your installation uses.

If you don’t want to look into what happens and just want to get it done I think using the forkserver is a totally usable solution.

I’m not convinced that reading in parallel is that beneficial as the filesystem will quickly be the bottleneck and many interleaved calls in different process might trash it. It might be beneficial to implement some form of queuing system where a single process reads from disc and the other processes unpack/parse it. That is a bit more involved but possible.

All the best!

Lion

yunndlalala · September 4, 2020, 8:51am

Hi @LionKrischer,
Thank you so much for your reply. I am just sure the hangs happened during

st = obspy.read(file)

through checking each step. I haven’t look into the codes of obspy. Maybe I will try it next.
In addition, I know little about the blas/lapack implementation, how can I know which blas implementation my installation uses?
And, it is interesting that no this bug when I using python 3.8.

ThomasLecocq · September 18, 2020, 9:17am

Hi,

Do you also have the issue when passing obspy.read(file, format="MSEED")?

Thomas