@@ -24,7 +24,7 @@ limitations under the License.
2424{% endcomment %}
2525-->
2626
27- * [ Bryan Cutler] [ 11 ] is a software engineer at IBM's Spark Technology Center [ STC] [ 12 ] *
27+ * [ Bryan Cutler] [ 11 ] is a software engineer at IBM's Spark Technology Center [ STC] [ 12 ] *
2828
2929Beginning with [ Apache Spark] [ 1 ] version 2.3, [ Apache Arrow] [ 2 ] will be a supported
3030dependency and begin to offer increased performance with columnar data transfer.
@@ -42,8 +42,8 @@ format and sent to a Python worker process. This child process unpickles each ro
4242a huge list of tuples. Finally, a Pandas DataFrame is created from the list using
4343` pandas.DataFrame.from_records() ` .
4444
45- This all might seem like standard procedure, but suffers from 2 glaring issues:
46- 1 ) even using CPickle, Python serialization is a slow process and 2) creating
45+ This all might seem like standard procedure, but suffers from 2 glaring issues: 1)
46+ even using CPickle, Python serialization is a slow process and 2) creating
4747a ` pandas.DataFrame ` using ` from_records ` must slowly iterate over the list of pure
4848Python data and convert each value to Pandas format. See [ here] [ 4 ] for a detailed
4949analysis.
@@ -52,7 +52,7 @@ Here is where Arrow really shines to help optimize these steps: 1) Once the data
5252in Arrow memory format, there is no need to serialize/pickle anymore as Arrow data can
5353be sent directly to the Python process, 2) When the Arrow data is received in Python,
5454then pyarrow can utilize zero-copy methods to create a ` pandas.DataFrame ` from entire
55- chunks of data at once instead of processing individual scalar values. Additionaly ,
55+ chunks of data at once instead of processing individual scalar values. Additionally ,
5656the conversion to Arrow data can be done on the JVM and pushed back for the Spark
5757executors to perform in parallel, drastically reducing the load on the driver.
5858
@@ -105,7 +105,7 @@ max 4.194303e+06 9.999996e-01
105105
106106This example was run locally on my laptop using Spark defaults so the times
107107shown should not be taken precisely. Even though, it is clear there is a huge
108- performance boost and using Arrow took something that was excrutiatingly slow
108+ performance boost and using Arrow took something that was excruciatingly slow
109109and speeds it up to be barely noticeable.
110110
111111# Notes on Usage
0 commit comments