The datascience
package is an open source Python package that helps make programming more accessible to all students, regardless of background. As a pedagogical aid, the package is designed to help students more intuitively conduct data science techniques without first spending considerable time directly learning more complex tools such as pandas
or matplotlib
. At Berkeley, these other packages are introduced in further upper-division coursework such as Data 100.
The datascience
package was built with the main goal to teach students about working with tables and visualizations in an introductory data science setting. It was inspired by techniques in SQL, pandas
, and R data frames, and follows a more natural langauge programming design to have a more intuitive way in syntax.
The package is built on built-in Python data structures, with several dependencies:
NumPy
: a tool for numerical computing and linear algebra. Thedatascience
package relies onnumpy
arrays as its primary data structure; for example, each column indatascience
Table objects arenumpy
arrays. Often, manynumpy
functions are also separtely introduced in the course, such asnp.mean
ornp.append
.SciPy
: a set of tools for scientific computing. Theminimize
function, used to minimize RMSE, uses theoptimize
module fromscipy
.Matplotlib
: a tool for visualization. Plotting is directly done indatascience
by calling plotting functions on Table objects. Notably, tweaking plots such as renaming titles or adjusting axis shape are abstracted away from students.pandas
: the more industry standard tool for data manipulation and analysis. Althoughpandas
is not a significant dependency,datascience
supports conversion between its Table objects andpandas
dataframes.
The datascience
python package was written by Berkeley professors John DeNero and David Culler, as well as students Sam Lau and Alvin Wan. The full documentation to the datascience
package can be found here, but students typically only need the Python Reference Guide for all the functions that are used widely in Data 8.
One large barrier to entry in doing data science for many students is the coding knowledge required. Since Data 8 was designed to be highly accessible to students of all backgrounds, the datascience
package was thus created to help make the programming part of the course more accessible to students with no coding background by removing syntax complexities. However, this decision comes with a profound trade-off: the package loses computational flexibility and power for increased ease of understanding and usage compared to industry-standard tools such as pandas
. This trade-off was acceptable for teaching Data 8, as datasets and their associated computation are typically not too large (<100 MB), and the computational flexibility required is limited to within the scope of the course.
Overall, Data 8 emphasizes developing computational thinking skills over details in the specific syntax. This training allows students to more seamlessly transition to other more complex packages after Data 8.
One limitation from using the datascience
package is that it does not support a wide range of data cleaning procedures. Data 8 abstracts away methods in data cleaning, which will instead be taught in Data 100. As such, students typically receive well-formed data without missing values in Data 8. However, if you plan on placing a larger focus data cleaning or more advanced data manipulation procedures in your course, using pandas
may perhaps be more appropriate.