CODAIT · frreiss · Aug 13, 2021 · Jul 20, 2021 · Jul 20, 2021 · Jul 21, 2021
diff --git a/.DS_Store b/.DS_Store
diff --git a/notebooks/DataFrame_Widget_Demo.ipynb b/notebooks/DataFrame_Widget_Demo.ipynb
@@ -0,0 +1,244 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e69ba12f-305c-4e43-aa02-54a09093c321",
+   "metadata": {},
+   "source": [
+    "<h1>Text Extensions for Pandas</h1>\n",
+    "<h2>Interactive Dataframe Widget</h2>\n",
+    "The interactive dataframe widget is an application within the IBM CODAIT team's open source Python library: Text Extension for Pandas. The widget aims to provide data scientists with a meaningful, visual way to interpret NLP (Natural Language Processing) data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d23a275a-09ac-4de0-9e07-1a71adb78365",
+   "metadata": {},
+   "source": [
+    "This demo will walk you though an example session of using the widget and related visualizers provided in the ```jupyter``` sub-module of Text Extensions for Pandas."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8a02abec-ae6b-4ad8-903b-f182418726e9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import regex\n",
+    "import sys\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "# And of course we need the text_extensions_for_pandas library itself.\n",
+    "try:\n",
+    "    import text_extensions_for_pandas as tp\n",
+    "except ModuleNotFoundError as e:\n",
+    "    # If we're running from within the project source tree and the parent Python\n",
+    "    # environment doesn't have the text_extensions_for_pandas package, use the\n",
+    "    # version in the local source tree.\n",
+    "    if not os.getcwd().endswith(\"notebooks\"):\n",
+    "        raise e\n",
+    "    if \"..\" not in sys.path:\n",
+    "        sys.path.insert(0, \"..\")\n",
+    "    import text_extensions_for_pandas as tp"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2f101deb-5c29-4ee2-be0c-da7061d0b5c9",
+   "metadata": {},
+   "source": [
+    "This demo will make use of the CoNLL-2003 dataset, a dataset concerning named entity recognition (Named Entity Extraction). We will be looking at a token classification problem - analyzing the building blocks of natural language present in this dataset that we can process and feed into a machine learning algorithm. The dataset contains categorical entity classifications of ```locations (LOC)```, ```persons (PER)```, ```organizations (ORG)``` and ```miscellaneous (MISC)```.\n",
+    "\n",
+    "Our goal is to load up some data from this dataset and do some basic processing and analysis, and make corrections if necessary.\n",
+    "\n",
+    "We will use Text Extensions for Pandas to download and parse the CoNLL dataset into dataframes to work with."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ac6e58cc-57fb-4d71-ba64-364fd2255d95",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Download and cache the data set.\n",
+    "# NOTE: This data set is licensed for research use only. Be sure to adhere\n",
+    "#  to the terms of the license when using this data set!\n",
+    "data_set_info = tp.io.conll.maybe_download_conll_data(\"outputs\")\n",
+    "data_set_info"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2847afc9-6e3e-48a9-acc0-326cdf45877d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gold_standard = tp.io.conll.conll_2003_to_dataframes(\n",
+    "    data_set_info[\"test\"], [\"pos\", \"phrase\", \"ent\"], [False, True, True])\n",
+    "gold_standard = [\n",
+    "    df.drop(columns=[\"pos\", \"phrase_iob\", \"phrase_type\"])\n",
+    "    for df in gold_standard\n",
+    "]\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f3dd3e7b-6535-4476-8e8a-997b2ba5e0d0",
+   "metadata": {},
+   "source": [
+    "Once we have our dataset downloaded and parsed, we can prepare our dataframe for visualization."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3651acbd-18e8-45e8-bd5c-9427659e2fd3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tokens = gold_standard[0]\n",
+    "tokens"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d1c602a3-186a-4059-847b-753e73df685e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "entity_mentions = tp.io.conll.iob_to_spans(tokens)\n",
+    "entity_mentions.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2e539ae1-b4f6-4cc1-bb74-adb206e32544",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sentences = tokens[\"sentence\"].unique()\n",
+    "entity_sentence_pairs = tp.spanner.contain_join(pd.Series(sentences), entity_mentions[\"span\"], \"sentence\", \"span\")\n",
+    "entity_mentions = entity_mentions.merge(entity_sentence_pairs)\n",
+    "entity_mentions[\"sentence_id\"] = entity_mentions[\"sentence\"].array.begin\n",
+    "entity_mentions.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2cd6d543-df21-4774-8140-5d45d6a1b7ca",
+   "metadata": {},
+   "source": [
+    "We can take a closer look at what the ```span``` column might look like in context by viewing the column alone as the SpanArray datatype."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e777948a-ffca-4a4c-9a04-31ac90a88698",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "entity_mentions[\"sentence\"].unique()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c7a34d33-2956-4e15-ba6b-93b1fee06fd7",
+   "metadata": {},
+   "source": [
+    "We don't really want to visualize every column in our dataframe as we're only interested in viewing the entity classifications. The next step is to drop any columns we don't care about."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84886242-284e-4a66-ad87-0f65191880bf",
+   "metadata": {},
+   "source": [
+    "Now that our data is prepared for analysis, we can load it up in our widget."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "40f6c7f2-dd12-4038-9f14-e19d40e7301e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "widget = tp.jupyter.DataFrameWidget(entity_mentions.drop(columns=[\"sentence\"]))\n",
+    "widget.display()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dbe7bbe4-44aa-4a73-ac13-faf8bd741f2c",
+   "metadata": {},
+   "source": [
+    "If we want to view this widget interactively, we can pass in the additional parameter ```interactive_columns``` with an array of column names we want to become interactive widgets.\n",
+    "\n",
+    "One thing you may notice in the above widgets is that the column ```ent_type``` is editable via a text box. This is fine, but there is a more appropriate way to interact with categorical data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "46b65ec8-865e-43eb-9990-2c91b4f298c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "categorical = pd.Categorical(entity_mentions[\"ent_type\"], categories=[\"PER\", \"LOC\", \"ORG\", \"MISC\"])\n",
+    "entity_mentions[\"ent_type\"] = categorical\n",
+    "tp.jupyter.DataFrameWidget(entity_mentions.drop(columns=[\"sentence\", \"sentence_id\"]), interactive_columns=[\"ent_type\"]).display()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f8280a38-7c4d-45c6-b90d-60ceccc6b603",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "corrected_entities = entity_mentions.copy(True)\n",
+    "new_types = corrected_entities[\"ent_type\"].copy()\n",
+    "new_types[widget.selected] = \"ORG\"\n",
+    "corrected_entities[\"new_type\"] = new_types\n",
+    "corrected_entities"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6d826972-07ef-40dd-9d60-534fe1198c1b",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "pd",
+   "language": "python",
+   "name": "pd"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/text_extensions_for_pandas/__init__.py b/text_extensions_for_pandas/__init__.py
@@ -41,12 +41,13 @@
 from text_extensions_for_pandas import io
 from text_extensions_for_pandas import spanner
 from text_extensions_for_pandas import cleaning
+from text_extensions_for_pandas import jupyter
 
 # Sphinx autodoc needs this redundant listing of public symbols to list the contents
 # of this subpackage.
 __all__ = [
     "Span", "SpanDtype", "SpanArray",
     "TokenSpan", "TokenSpanDtype", "TokenSpanArray",
     "TensorElement", "TensorDtype", "TensorArray",
-    "io", 'cleaning'
+    "io", 'cleaning', "jupyter"
 ]
diff --git a/text_extensions_for_pandas/jupyter/__init__.py b/text_extensions_for_pandas/jupyter/__init__.py
@@ -27,5 +27,6 @@
 # library.
 from text_extensions_for_pandas.jupyter.span import pretty_print_html
 from text_extensions_for_pandas.jupyter.misc import run_with_progress_bar
+from text_extensions_for_pandas.jupyter.widget import DataFrameWidget
 
-__all__ = ["span", "misc"]
+__all__ = ["span", "misc", "widget"]
diff --git a/text_extensions_for_pandas/jupyter/widget/__init__.py b/text_extensions_for_pandas/jupyter/widget/__init__.py
@@ -0,0 +1,29 @@
+#
+#  Copyright (c) 2021 IBM Corp.
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+#
+
+"""
+The ``jupyter`` module contains functions to support the use of Text Extensions for Pandas
+ in Jupyter notebooks.
+"""
+################################################################################
+# jupyter module
+#
+#
+# Functions in text_extensions_for_pandas for Jupyter notebook support.
+
+# Expose the public APIs that users should get from importing the top-level
+# library.
+
+from text_extensions_for_pandas.jupyter.widget.widget import DataFrameWidget