How to speed up the mass downloader

Paride_Legovini · December 15, 2017, 7:06pm

Dear obspy-users (and developers!),

I'm downloading all the available data for a continental scale noise
cross-correlation study. It turns out it's a lot of data, and while the
mass downloader has proven itself very useful, it's still very slow. The
reasons for this slowness are the following.

1. I'm getting a lot of lines like:

obspy.clients.fdsn.mass_downloader - INFO: Client 'IRIS' - No data
available for request.

meaning that a request was made to the data center, but the requested
data was not available. As I'm requesting data spanning decades divided
in daily chunks, this means I'm making thousands of useless, slow
requests. AFAIK, the first thing obspy does when connecting to a data
center is requesting the data availability. Why then is it requesting
data that is not actually available?

2. It seems that the requests could be packed more efficiently.
Currently I get this kind of output when a download happens:

obspy.clients.fdsn.mass_downloader - INFO: Client 'IRIS' - Successfully
downloaded 5 channels (of 6)

but as I'm downloading LHZ data, way more that 6 channels would fit in a
request. Can this somehow be increased?

Thank you!

Paride

LionKrischer · December 16, 2017, 1:28am

Hi Paride,

meaning that a request was made to the data center, but the requested
data was not available. As I'm requesting data spanning decades divided
in daily chunks, this means I'm making thousands of useless, slow
requests. AFAIK, the first thing obspy does when connecting to a data
center is requesting the data availability. Why then is it requesting
data that is not actually available?

Getting reliable availability information from the data centers is
really hard. Most just tell you the epoch times of their channels but
that does not necessarily mean that there is actual waveform data
available as well - the mass downloader does take this into account -
thats why there are so many empty requests coming back. In the
particular case of IRIS I think we misinterpreted what the
`matchtimeseries` flag is actually doing - the result is that this might
actually miss some data if the `minimum_interstation_distance_in_m`
argument is used in some rare cases. But this should not affect your
downloads.

2. It seems that the requests could be packed more efficiently.
Currently I get this kind of output when a download happens:

obspy.clients.fdsn.mass_downloader - INFO: Client 'IRIS' - Successfully
downloaded 5 channels (of 6)

but as I'm downloading LHZ data, way more that 6 channels would fit in a
request. Can this somehow be increased?

Sure - 6 LHZ channels - but how long is each? ObsPy tries to batch
downloads by sending out bulk requests that should result in returned
data of X MB. It defaults to 20 - you can increase this by setting the
download_chunk_size_in_mb argument of the `.download()` method. Maybe 20
is actually too low and a higher value might be much faster. Please
report your findings to us if you experiment a bit with this.

Out of curiosity: What is the download speed you are approximately getting?

Cheers!

Lion

Paride_Legovini · December 17, 2017, 10:45pm

Hi Paride,

meaning that a request was made to the data center, but the requested
data was not available. As I'm requesting data spanning decades divided
in daily chunks, this means I'm making thousands of useless, slow
requests. AFAIK, the first thing obspy does when connecting to a data
center is requesting the data availability. Why then is it requesting
data that is not actually available?

Getting reliable availability information from the data centers is
really hard. Most just tell you the epoch times of their channels but
that does not necessarily mean that there is actual waveform data
available as well - the mass downloader does take this into account -
thats why there are so many empty requests coming back. In the
particular case of IRIS I think we misinterpreted what the
`matchtimeseries` flag is actually doing - the result is that this might
actually miss some data if the `minimum_interstation_distance_in_m`
argument is used in some rare cases. But this should not affect your
downloads.

Thank you Lion. Still it happens to me to get several "No data available
for request" lines when using IRIS, see for example:

https://clbin.com/Lf70J

Is this supposed to happen when *reliable* availability is requested?

2. It seems that the requests could be packed more efficiently.
Currently I get this kind of output when a download happens:

obspy.clients.fdsn.mass_downloader - INFO: Client 'IRIS' - Successfully
downloaded 5 channels (of 6)

but as I'm downloading LHZ data, way more that 6 channels would fit in a
request. Can this somehow be increased?

Sure - 6 LHZ channels - but how long is each? ObsPy tries to batch
downloads by sending out bulk requests that should result in returned
data of X MB. It defaults to 20 - you can increase this by setting the
download_chunk_size_in_mb argument of the `.download()` method. Maybe 20
is actually too low and a higher value might be much faster. Please
report your findings to us if you experiment a bit with this.

I'm currently using download_chunk_size_in_mb=50 and I'm downloading
1-day long LHZ chunks, so the download size shouldn't be a limiting
factor...

Out of curiosity: What is the download speed you are approximately getting?

Difficult to say, as I normally leave the scripts running unmonitored,
but I'll do a couple of tests and let you know. You're interested in the
*overall* speed, counting the time wasted in "no data available"
requests, right?

Thanks again,

Paride

Nikolaos_Triantafyll · December 21, 2017, 2:52pm

Hi ObsPy community,

Have you thought of using the EIDA service, WFCatalog? This service provides the waveform meta-data like availability and more. Right now it doesn't support real time meta-data distribution but meta-data are not older than 24h. I suppose it could be used as pre-processing step in order to find out which data are available.
For example:
1. get request from user
2. ask for availability (via WFCatalog service)
3. request data only from available ones (via FDSN service) -or add a flag that ignores wfcatalog service and act as it does already-

What you think of this?

Thanks,
Nikos

LionKrischer · January 8, 2018, 2:48pm

Hi Nikos,

as of now there is unfortunately no standardized solution available for
these types of queries across all fdsnws data centers - and an explicit
goal of the mass downloader is to work for all datacenters implementing
the fdsn web services. So we would need to implement queries for at
least two separate queries (and some datacenters don't have a service
like this at all). This is a fair amount of effort and currently not done.

Cheers!

Lion