Skip to content

Commit fae4e2d

Browse files
ksonjrxin
authored andcommitted
[SPARK-7035] Encourage __getitem__ over __getattr__ on column access in the Python DataFrame API
Author: ksonj <[email protected]> Closes #5971 from ksonj/doc and squashes the following commits: dadfebb [ksonj] __getitem__ is cleaner than __getattr__
1 parent fa8fddf commit fae4e2d

File tree

1 file changed

+8
-3
lines changed

1 file changed

+8
-3
lines changed

docs/sql-programming-guide.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,6 @@ DataFrames provide a domain-specific language for structured data manipulation i
139139

140140
Here we include some basic examples of structured data processing using DataFrames:
141141

142-
143142
<div class="codetabs">
144143
<div data-lang="scala" markdown="1">
145144
{% highlight scala %}
@@ -242,6 +241,12 @@ df.groupBy("age").count().show();
242241
</div>
243242

244243
<div data-lang="python" markdown="1">
244+
In Python it's possible to access a DataFrame's columns either by attribute
245+
(`df.age`) or by indexing (`df['age']`). While the former is convenient for
246+
interactive data exploration, users are highly encouraged to use the
247+
latter form, which is future proof and won't break with column names that
248+
are also attributes on the DataFrame class.
249+
245250
{% highlight python %}
246251
from pyspark.sql import SQLContext
247252
sqlContext = SQLContext(sc)
@@ -270,14 +275,14 @@ df.select("name").show()
270275
## Justin
271276

272277
# Select everybody, but increment the age by 1
273-
df.select(df.name, df.age + 1).show()
278+
df.select(df['name'], df['age'] + 1).show()
274279
## name (age + 1)
275280
## Michael null
276281
## Andy 31
277282
## Justin 20
278283

279284
# Select people older than 21
280-
df.filter(df.age > 21).show()
285+
df.filter(df['age'] > 21).show()
281286
## age name
282287
## 30 Andy
283288

0 commit comments

Comments
 (0)