python - Multiprocessing doing Many Fast Calculations -


I have an interesting multi-processing problem with the structure, which I could be able to exploit. There are several columns and functions with func and a large ~ 80 column dataframe ( df ) which pairs (~ 80 * 79/2 pairs) is in that column in df , and every run takes very little time.

The code looks like

  mgr = manager () ns = mgr.Namespace () ns.df = df Pool = pool (procedures = 16) args = [(ns, list (combo)) Df.column, 2))] result = pool.map (func, args) pool.close ()  

Not fast, but faster than pool, but only one 7 or so factor I'm worried that the upper part of so many calls is an issue. Is there a good way to tap the structure for multi-processing?

This is a fairly standard result for setting up each process and overhead for passing data between processes Due to running in parallel, nothing will run exactly the same. Keep in mind that (80 * 79) / 2 = 3,160 is actually a very small number which believes that the function is not very computational intensive (i.e. actually takes a lot of time). All others are the same, the more work the more work is to use multi-processing, because the time to establish an additional process is relatively definite.

High flow overflow primarily comes in memory if you have many duplicates of a large dataset (a duplication for each process if the function is poorly designed) because the processes do not share the memory Take that the installation of your function is such that it can be easily paralleled, and it is good to increase more processes, as long as you have your computer More not the number of processors. Most home computers have 16 processors (maximum 8 are normal) and your results (which are in parallel with 7 times faster) you have less than 16 processors, you can use multiprocessing.cpu_count () Check the number of processors on the machine.

Edit:

If you pass a parallel column string to a function, it repeatedly copies the dataframe. For example:

  def stringpace (string1, string2): return df [string1] * df [string2]  

If you press the stringpus If you do parallel, then it will copy the data frames per process at least once. In contrast: ColumnPass (column1, column2): Return column 1 * column2

If you pass only the required columns ColumnPass only The required column will mimic the function when running in parallel for each call, while stringpace (string1, string2) and columnpace (df [string1], df [string 2] ) will return the same result, many multi-processing copies will give multiple copies of the global Df , while only the required column will be copied for each call to the function.


Comments

Popular posts from this blog

apache - 504 Gateway Time-out The server didn't respond in time. How to fix it? -

c# - .net WebSocket: CloseOutputAsync vs CloseAsync -

c++ - How to properly scale qgroupbox title with stylesheet for high resolution display? -