![]() ![]() ![]() It’s doing just one calculation at a time for a dataset that can have millions or even billions of rows. Yet most modern machines made for Data Science have at least 2 CPU cores. That means, for the example of 2 CPU cores, that 50% or more of your computer’s processing power won’t be doing anything by default when using Pandas. The situation gets even worse when you get to 4 cores (modern Intel i5) or 6 cores (modern Intel i7). Pandas simply wasn’t designed to use that computing power effectively. Modin is a new library designed to accelerate Pandas by automatically distributing the computation across all of the system’s available CPU cores. With that, Modin claims to be able to get nearly linear speedup to the number of CPU cores on your system for Pandas DataFrames of any size. #How to use gmd speed time on pc code#Let’s see how it all works and go through a few code examples. How Modin Does Parallel Processing With Pandas #Gmd speed time alternative code Given a DataFrame in Pandas, our goal is to perform some kind of calculation or process on it in the fastest way possible. That could be taking the mean of each column with. mean(), grouping data with groupby, dropping all duplicates with drop_duplicates(), or any of the other built-in Pandas functions. In the previous section, we mentioned how Pandas only uses one CPU core for processing. Naturally, this is a big bottleneck, especially for larger DataFrames, where the lack of resources really shows through. In theory, parallelizing a calculation is as easy as applying that calculation on different data points across every available CPU core. For a Pandas DataFrame, a basic idea would be to divide up the DataFrame into a few pieces, as many pieces as you have CPU cores, and let each CPU core run the calculation on its piece. In the end, we can aggregate the results, which is a computationally cheap operation. How a multi-core system can process data faster. For a single-core process (left), all 10 tasks go to a single node. For the dual-core process (right), each node takes on 5 tasks, thereby doubling the processing speed. ![]() It slices your DataFrame into different parts such that each part can be sent to a different CPU core. Modin partitions the DataFrames across both the rows and the columns. This makes Modin’s parallel processing scalable to DataFrames of any shape. Imagine if you are given a DataFrame with many columns but fewer rows. Some libraries only perform the partitioning across rows, which would be inefficient in this case since we have more columns than rows. But with Modin, since the partitioning is done across both dimensions, the parallel processing remains efficient all shapes of DataFrames, whether they are wider (lots of columns), longer (lots of rows), or both.Ī Pandas DataFrame (left) is stored as one block and is only sent to one CPU core. A Modin DataFrame (right) is partitioned across rows and columns, and each partition can be sent to a different CPU core up to the max cores in the system. Modin actually uses a Partition Manager that can change the size and shape of the partitions based on the type of operation. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |