Skip to content

Commit 2fa3587

Browse files
committed
fixed spelling and formatting
1 parent 6a14066 commit 2fa3587

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

site/_posts/2017-07-26-spark-arrow.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ limitations under the License.
2424
{% endcomment %}
2525
-->
2626

27-
* [Bryan Cutler][11] is a software engineer at IBM's Spark Technology Center [STC][12] *
27+
*[Bryan Cutler][11] is a software engineer at IBM's Spark Technology Center [STC][12]*
2828

2929
Beginning with [Apache Spark][1] version 2.3, [Apache Arrow][2] will be a supported
3030
dependency and begin to offer increased performance with columnar data transfer.
@@ -42,8 +42,8 @@ format and sent to a Python worker process. This child process unpickles each ro
4242
a huge list of tuples. Finally, a Pandas DataFrame is created from the list using
4343
`pandas.DataFrame.from_records()`.
4444

45-
This all might seem like standard procedure, but suffers from 2 glaring issues:
46-
1) even using CPickle, Python serialization is a slow process and 2) creating
45+
This all might seem like standard procedure, but suffers from 2 glaring issues: 1)
46+
even using CPickle, Python serialization is a slow process and 2) creating
4747
a `pandas.DataFrame` using `from_records` must slowly iterate over the list of pure
4848
Python data and convert each value to Pandas format. See [here][4] for a detailed
4949
analysis.
@@ -52,7 +52,7 @@ Here is where Arrow really shines to help optimize these steps: 1) Once the data
5252
in Arrow memory format, there is no need to serialize/pickle anymore as Arrow data can
5353
be sent directly to the Python process, 2) When the Arrow data is received in Python,
5454
then pyarrow can utilize zero-copy methods to create a `pandas.DataFrame` from entire
55-
chunks of data at once instead of processing individual scalar values. Additionaly,
55+
chunks of data at once instead of processing individual scalar values. Additionally,
5656
the conversion to Arrow data can be done on the JVM and pushed back for the Spark
5757
executors to perform in parallel, drastically reducing the load on the driver.
5858

@@ -105,7 +105,7 @@ max 4.194303e+06 9.999996e-01
105105

106106
This example was run locally on my laptop using Spark defaults so the times
107107
shown should not be taken precisely. Even though, it is clear there is a huge
108-
performance boost and using Arrow took something that was excrutiatingly slow
108+
performance boost and using Arrow took something that was excruciatingly slow
109109
and speeds it up to be barely noticeable.
110110

111111
# Notes on Usage

0 commit comments

Comments
 (0)