Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrame interactive widget, widget sub-module, resources, and demo notebook. #238

Merged
merged 61 commits into from
Aug 13, 2021
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
235bfcf
Seed widget branch with initial structure
PokkeFe Jul 20, 2021
93943e3
Merge pull request #2 from PokkeFe/widget-init
PokkeFe Jul 20, 2021
52172fe
Addeda directory (html_edit) for creating custom ipywidgets.
Jul 21, 2021
7f2d9b0
Merge pull request #3 from PokkeFe/widget-directory
PokkeFe Jul 21, 2021
b2a3672
Added start of table component
PokkeFe Jul 27, 2021
dfd68f3
Created DataFrameWidget class with internal dataframe dict.
PokkeFe Jul 28, 2021
729cd6c
Added metadata column and backend.
PokkeFe Jul 28, 2021
7582ea3
Added to_dataframe() to widget object
PokkeFe Jul 28, 2021
aedf5a3
Added backend sync and temporary widget output
PokkeFe Jul 28, 2021
68f6148
Added some basic layout styling to table
PokkeFe Jul 28, 2021
deadb18
Basic metadata import and Span rendering
PokkeFe Aug 2, 2021
2b56bbe
Improved table render styling and switched to rendering by column ins…
PokkeFe Aug 2, 2021
cd1cfc9
Merge branch 'CODAIT:master' into widget-ipyw
PokkeFe Aug 2, 2021
9a8e1d5
Separated widget into submodule
PokkeFe Aug 2, 2021
15a3dd9
Added interactive columns based on column datatype
Aug 4, 2021
ada8aed
Merge pull request #4 from PokkeFe/widget-table
PokkeFe Aug 5, 2021
c185702
Added handling of categorical data to interactive widget table
Aug 5, 2021
025bbc5
Merge pull request #5 from PokkeFe/widget-table
PokkeFe Aug 5, 2021
97f2a8f
Added span tag rendering and multiple coloring modes alongside displa…
PokkeFe Aug 5, 2021
3dc8eba
Merge branch 'widget-ipyw' of github.com:PokkeFe/text-extensions-for-…
PokkeFe Aug 5, 2021
9fddc77
Fixed outdated arguments in DataFrameWidgetComponent calls
PokkeFe Aug 6, 2021
807866f
Merge pull request #6 from PokkeFe/widget-ipyw-span
jeremy-alcanzare Aug 6, 2021
49f2d15
Added JS and CSS importing, and added some hover and click events to …
PokkeFe Aug 7, 2021
9dd314f
Added button that allows adding additional rows to displayed dataframe
Aug 8, 2021
1f0859a
Merge pull request #7 from PokkeFe/widget-add-rows
PokkeFe Aug 9, 2021
b5a0502
Merge remote-tracking branch 'origin/widget-ipyw' into widget-ipyw-js
PokkeFe Aug 9, 2021
612ab19
Added support for interactive editing of spans within a dataframe and…
Aug 9, 2021
4b16240
Added check for invalid span structure (end <= begin) to avoid rendering
PokkeFe Aug 9, 2021
8ddad02
Added debug output, basic metadata skeleton, and fixed multi-script init
PokkeFe Aug 9, 2021
4ad5bfa
Updated AddRow to preserve column type
PokkeFe Aug 9, 2021
04a4898
Merge pull request #9 from PokkeFe/widget-ipyw-span
PokkeFe Aug 9, 2021
9301423
Javascript document check
PokkeFe Aug 9, 2021
0250a96
Merge remote-tracking branch 'origin/widget-ipyw' into widget-ipyw-js
PokkeFe Aug 9, 2021
c9fe0db
Merge pull request #8 from PokkeFe/widget-ipyw-js
jeremy-alcanzare Aug 9, 2021
4c88567
Added interactive_columns parameter that takes an array of column names
PokkeFe Aug 10, 2021
b472a7a
Fix for NaN values in categorical dropdowns
PokkeFe Aug 10, 2021
392b25d
Added demo notebook
PokkeFe Aug 10, 2021
aa336f9
Added classes and styling for wrapped row rendering if the table is s…
PokkeFe Aug 10, 2021
6108999
Added metadata and index column with interactive checkboxes.
PokkeFe Aug 10, 2021
502f72b
Added editable interactive widgets for TokenSpans
Aug 10, 2021
5c78521
Merge pull request #10 from PokkeFe/widget-demo
jeremy-alcanzare Aug 10, 2021
358d9aa
Merge branch 'widget-ipyw' into widget-token-span
PokkeFe Aug 10, 2021
ee5b7ac
Merge pull request #11 from PokkeFe/widget-token-span
PokkeFe Aug 10, 2021
a3428d9
Added documentation, cleaned up some code blocks, renamed demo file, …
PokkeFe Aug 13, 2021
9267c88
Merge pull request #12 from PokkeFe/widget-demo-final
PokkeFe Aug 13, 2021
6b26b27
Removed html_edit folder
PokkeFe Aug 13, 2021
23bfd2e
Added jupyter as a sub-module to the main module's __init__.py
PokkeFe Aug 13, 2021
5a4f734
Updated doctext to sphinx and renamed widget.py to core.py
PokkeFe Aug 13, 2021
efd55f0
Changed datatype checks to use the dataype check functions found in t…
Aug 13, 2021
02b39b9
Added notebook output for DataFrame_Widget_Demo.ipynb
PokkeFe Aug 13, 2021
f2f7281
Merge pull request #14 from Crushellini/widget
PokkeFe Aug 13, 2021
d96fd51
Updated class property names to reflect privacy format. (underscore)
PokkeFe Aug 13, 2021
3084481
Merge pull request #13 from PokkeFe/widget-rename-and-docs
jeremy-alcanzare Aug 13, 2021
99fe077
Ran black formatted on widget module
PokkeFe Aug 13, 2021
e038d55
Changed imports in core.py to be compatible with Python 3.6
Aug 13, 2021
54cbcd3
Merge pull request #15 from PokkeFe/widget-formatted
jeremy-alcanzare Aug 13, 2021
b7215a6
Switched name of index in span.py to i_loc for type clarity. Also bro…
PokkeFe Aug 13, 2021
bfc2514
Added check for MultiIndex.index and raises exception in this case
Aug 13, 2021
d995935
Merge pull request #17 from PokkeFe/widget-multiindex-exception
PokkeFe Aug 13, 2021
7f9e1a7
Merge pull request #16 from PokkeFe/widget-iloc-and-spanhtml
jeremy-alcanzare Aug 13, 2021
4d300b7
Fixed init doc for widget module and ran black formatter one final time.
PokkeFe Aug 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
244 changes: 244 additions & 0 deletions notebooks/DataFrame_Widget_Demo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
{
PokkeFe marked this conversation as resolved.
Show resolved Hide resolved
"cells": [
{
"cell_type": "markdown",
"id": "e69ba12f-305c-4e43-aa02-54a09093c321",
"metadata": {},
"source": [
"<h1>Text Extensions for Pandas</h1>\n",
"<h2>Interactive Dataframe Widget</h2>\n",
"The interactive dataframe widget is an application within the IBM CODAIT team's open source Python library: Text Extension for Pandas. The widget aims to provide data scientists with a meaningful, visual way to interpret NLP (Natural Language Processing) data."
]
},
{
"cell_type": "markdown",
"id": "d23a275a-09ac-4de0-9e07-1a71adb78365",
"metadata": {},
"source": [
"This demo will walk you though an example session of using the widget and related visualizers provided in the ```jupyter``` sub-module of Text Extensions for Pandas."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a02abec-ae6b-4ad8-903b-f182418726e9",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import regex\n",
"import sys\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"# And of course we need the text_extensions_for_pandas library itself.\n",
"try:\n",
" import text_extensions_for_pandas as tp\n",
"except ModuleNotFoundError as e:\n",
" # If we're running from within the project source tree and the parent Python\n",
" # environment doesn't have the text_extensions_for_pandas package, use the\n",
" # version in the local source tree.\n",
" if not os.getcwd().endswith(\"notebooks\"):\n",
" raise e\n",
" if \"..\" not in sys.path:\n",
" sys.path.insert(0, \"..\")\n",
" import text_extensions_for_pandas as tp"
]
},
{
"cell_type": "markdown",
"id": "2f101deb-5c29-4ee2-be0c-da7061d0b5c9",
"metadata": {},
"source": [
"This demo will make use of the CoNLL-2003 dataset, a dataset concerning named entity recognition (Named Entity Extraction). We will be looking at a token classification problem - analyzing the building blocks of natural language present in this dataset that we can process and feed into a machine learning algorithm. The dataset contains categorical entity classifications of ```locations (LOC)```, ```persons (PER)```, ```organizations (ORG)``` and ```miscellaneous (MISC)```.\n",
"\n",
"Our goal is to load up some data from this dataset and do some basic processing and analysis, and make corrections if necessary.\n",
"\n",
"We will use Text Extensions for Pandas to download and parse the CoNLL dataset into dataframes to work with."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac6e58cc-57fb-4d71-ba64-364fd2255d95",
"metadata": {},
"outputs": [],
"source": [
"# Download and cache the data set.\n",
"# NOTE: This data set is licensed for research use only. Be sure to adhere\n",
"# to the terms of the license when using this data set!\n",
"data_set_info = tp.io.conll.maybe_download_conll_data(\"outputs\")\n",
"data_set_info"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2847afc9-6e3e-48a9-acc0-326cdf45877d",
"metadata": {},
"outputs": [],
"source": [
"gold_standard = tp.io.conll.conll_2003_to_dataframes(\n",
" data_set_info[\"test\"], [\"pos\", \"phrase\", \"ent\"], [False, True, True])\n",
"gold_standard = [\n",
" df.drop(columns=[\"pos\", \"phrase_iob\", \"phrase_type\"])\n",
" for df in gold_standard\n",
"]\n"
]
},
{
"cell_type": "markdown",
"id": "f3dd3e7b-6535-4476-8e8a-997b2ba5e0d0",
"metadata": {},
"source": [
"Once we have our dataset downloaded and parsed, we can prepare our dataframe for visualization."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3651acbd-18e8-45e8-bd5c-9427659e2fd3",
"metadata": {},
"outputs": [],
"source": [
"tokens = gold_standard[0]\n",
"tokens"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d1c602a3-186a-4059-847b-753e73df685e",
"metadata": {},
"outputs": [],
"source": [
"entity_mentions = tp.io.conll.iob_to_spans(tokens)\n",
"entity_mentions.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2e539ae1-b4f6-4cc1-bb74-adb206e32544",
"metadata": {},
"outputs": [],
"source": [
"sentences = tokens[\"sentence\"].unique()\n",
"entity_sentence_pairs = tp.spanner.contain_join(pd.Series(sentences), entity_mentions[\"span\"], \"sentence\", \"span\")\n",
"entity_mentions = entity_mentions.merge(entity_sentence_pairs)\n",
"entity_mentions[\"sentence_id\"] = entity_mentions[\"sentence\"].array.begin\n",
"entity_mentions.head()"
]
},
{
"cell_type": "markdown",
"id": "2cd6d543-df21-4774-8140-5d45d6a1b7ca",
"metadata": {},
"source": [
"We can take a closer look at what the ```span``` column might look like in context by viewing the column alone as the SpanArray datatype."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e777948a-ffca-4a4c-9a04-31ac90a88698",
"metadata": {},
"outputs": [],
"source": [
"entity_mentions[\"sentence\"].unique()"
]
},
{
"cell_type": "markdown",
"id": "c7a34d33-2956-4e15-ba6b-93b1fee06fd7",
"metadata": {},
"source": [
"We don't really want to visualize every column in our dataframe as we're only interested in viewing the entity classifications. The next step is to drop any columns we don't care about."
]
},
{
"cell_type": "markdown",
"id": "84886242-284e-4a66-ad87-0f65191880bf",
"metadata": {},
"source": [
"Now that our data is prepared for analysis, we can load it up in our widget."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40f6c7f2-dd12-4038-9f14-e19d40e7301e",
"metadata": {},
"outputs": [],
"source": [
"widget = tp.jupyter.DataFrameWidget(entity_mentions.drop(columns=[\"sentence\"]))\n",
"widget.display()"
]
},
{
"cell_type": "markdown",
"id": "dbe7bbe4-44aa-4a73-ac13-faf8bd741f2c",
"metadata": {},
"source": [
"If we want to view this widget interactively, we can pass in the additional parameter ```interactive_columns``` with an array of column names we want to become interactive widgets.\n",
"\n",
"One thing you may notice in the above widgets is that the column ```ent_type``` is editable via a text box. This is fine, but there is a more appropriate way to interact with categorical data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46b65ec8-865e-43eb-9990-2c91b4f298c8",
"metadata": {},
"outputs": [],
"source": [
"categorical = pd.Categorical(entity_mentions[\"ent_type\"], categories=[\"PER\", \"LOC\", \"ORG\", \"MISC\"])\n",
"entity_mentions[\"ent_type\"] = categorical\n",
"tp.jupyter.DataFrameWidget(entity_mentions.drop(columns=[\"sentence\", \"sentence_id\"]), interactive_columns=[\"ent_type\"]).display()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f8280a38-7c4d-45c6-b90d-60ceccc6b603",
"metadata": {},
"outputs": [],
"source": [
"corrected_entities = entity_mentions.copy(True)\n",
"new_types = corrected_entities[\"ent_type\"].copy()\n",
"new_types[widget.selected] = \"ORG\"\n",
"corrected_entities[\"new_type\"] = new_types\n",
"corrected_entities"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d826972-07ef-40dd-9d60-534fe1198c1b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "pd",
"language": "python",
"name": "pd"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
3 changes: 2 additions & 1 deletion text_extensions_for_pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,13 @@
from text_extensions_for_pandas import io
from text_extensions_for_pandas import spanner
from text_extensions_for_pandas import cleaning
from text_extensions_for_pandas import jupyter

# Sphinx autodoc needs this redundant listing of public symbols to list the contents
# of this subpackage.
__all__ = [
"Span", "SpanDtype", "SpanArray",
"TokenSpan", "TokenSpanDtype", "TokenSpanArray",
"TensorElement", "TensorDtype", "TensorArray",
"io", 'cleaning'
"io", 'cleaning', "jupyter"
]
3 changes: 2 additions & 1 deletion text_extensions_for_pandas/jupyter/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,5 +27,6 @@
# library.
from text_extensions_for_pandas.jupyter.span import pretty_print_html
from text_extensions_for_pandas.jupyter.misc import run_with_progress_bar
from text_extensions_for_pandas.jupyter.widget import DataFrameWidget

__all__ = ["span", "misc"]
__all__ = ["span", "misc", "widget"]
29 changes: 29 additions & 0 deletions text_extensions_for_pandas/jupyter/widget/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#
# Copyright (c) 2021 IBM Corp.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

"""
PokkeFe marked this conversation as resolved.
Show resolved Hide resolved
The ``jupyter`` module contains functions to support the use of Text Extensions for Pandas
in Jupyter notebooks.
"""
################################################################################
# jupyter module
#
#
# Functions in text_extensions_for_pandas for Jupyter notebook support.

# Expose the public APIs that users should get from importing the top-level
# library.

from text_extensions_for_pandas.jupyter.widget.widget import DataFrameWidget
Loading