DataTreeGrab

Some examples
The latest stable release
Go to the WIKI
Go to tvgrabpyAPI

Spin-off python module for extracting structured data from HTML and JSON pages.
It is at the heart of the tv_grab_py_API and was initially named just DataTree,
but as this name is already taken in the Python library...

Requirements

Python 2.7.9 or higher (currently not python 3.x)
The pytz module

Installation

Especially under Windows, make sure Python 2.7.9 or higher is installed
Make sure the above mentioned Python 2 package is installed on your system
Download the latest stable release and unpack it into a directory
Run:
- under Linux: sudo ./setup.py install from that directory
- under Windows depending on how you installed Python:
  - setup.py install from that directory
  - Or: Python setup.py install from that directory

Main advantages

It gives you a highly dependable dataset from a potentially changable source.
You can easily update on changes in the source without touching your code.
You can make the data_def available on a central location while distributing
the aplication and so giving your users easy access to (automated) updates.

Known issues

Adding warning rules to DataTreeShell prior to DataTree initialization will place the general rule before those added rules. Fixed in version 1.3.1.

Release notes

With version 1.4.0 10-07-2017

Introducing a pre-conversion of the data_def to a more machine friendly and thus faster format. During conversion the data_defs are validated and any defaults are filled-in. Because of this a lot of validation code during parsing could be removed, introducing more speed increasement.
(Some 50% relative to 1.3.3 and 65%!! compared to 1.3.2)
During a complete review of the code, adapting it to the converted data_def format, several inconsistencies in data_def keyword handling were found and were corrected.
It should be compatible with older implementations.

With version 1.3.4 18-06-2017

Some minor fixes

With version 1.3.3 17-05-2017

with the introduction of the "node" keyword to store references to nodes, significant speed increase of some 30% can be achieved. It does need adaptations to existing data_defs. The "values2" keyword is introduced as an alternative set of value_defs. leaving the original set for backward compatibility.

With version 1.3.2 27-11-2016

With a fix on missing signals on the extract_datalist function on an empty result

With version 1.3.1 19-11-2016

With a fix on warning rules getting reset when DataTreeShell initializes The DATAtree
Added error return codes to some of the functions. See https://github.com/tvgrabbers/DataTree/wiki/The-Warning-Framework

With version 1.3.0 9-11-2016

first not beta release
added functionality to show progress while running the extract_datalist function in a multi-threading environment
added a flag to abort the extract_datalist function in a multi-threading environment

With version 1.2.5 15-10-2016

added a print_datatree function to DataTreeShell
some cosmetic updates on the internal print functions

With version 1.2.4 30-9-2016

implemented "text_replace" keyword to search and replace in the html data before importing
implemented "unquote_html" keyword to correct ", < and > occurence in html text by replacing them with the correct ", < and >
made it possible to read a partial html page resulting from an "HTTP incomplete read" by checking and adding on a missing </html> and/or </body>tag. If more then the tail part is missing it probably will later fail on your search. (Any tag with a missing clossing tag is assumed to be auto-closing. This will, except on the enclosing <html> tag, prevent HTML errors. However if the missing is inadvertently it can cause a change in the tree hierarchy, making the search, even when the data is present, fail. So this will only work if all the other higher closing tags are in the download.)

With version 1.2.3 25-9-2016

implemented "url-relative-weekdays" keyword
some bug fixes
Updates on the test module

With version 1.2.2 31-8-2016

Updates on the test module
Some code sanitation

With version 1.2.1 22-8-2016

Updates on the test module

With version 1.2.0 20-8-2016

Implemented a data_def test module

With version 1.1.4 23-7-2016

Implemented a stripped and extended Warnings framework
Added optional sorting before extraction of part of a JSON tree
Some fixes

With version 1.1.3 9-7-2016

More unified HTML and JSON parsing with added keywords "notchildkeys" and "tags",
renamed keyword "childkeys" and extended functionality for some of the others.
Also allowing to use a linked value in most cases.
Added selection keyword "inclusive text" for HTML to include text in sub tags like
"i", "b" etc.
Added support for a tupple with multiple dtype values in the is_data_value function.

With version 1.1.2 5-7-2016

A new warnings category for invalid data imports into a tree
A new search keyword "notattrs"

With version 1.1 28-6-2016 we have next to some patches added several new features:

Added support for 12 hour time values
Added the str-list type
Added a warnings framework
Added a DataTreeShell class with pre and post processing functionality.

It reads the page into a Node based tree, from which you, on the bases of a json
data-file, can extract your data into a list of items. For this a special Data_def language
has been developed. It can first extract a list of keyNodes and extract for each
of them the same data-list. During the extraction several data manipulation
functions are available.

Check the WIKI for the syntax. Here a short incomplete list of possible keywords:

path-dict keywords:

"path": "all", "root", "parent"
"key":
"keys":{"":{"link":1},"":""} (selection on child presence)
"tag":""
"attrs":{"":{"link":1},"":{"not":[]},"":"","":null}
"index":{"link":1}

selection-keywords:

"select": "key", "text", "tag", "index", "value"
"attr":""
"link":1 (create a link)
"link-index":1 (create a link)

link examples

[{"key":"abstract_key", "link":1},
        "root",{"key":"library"},"all",{"key":"abstracts"},
        {"keys":{"abstract_key":{"link":1}}},
        {"key":"name","default":""}],

        [...,{ "attr":"value", "ascii-replace":["ss","s", "[-!?(), ]"], "link":1}],

        [...,{"tag":"img", "attrs":{"class": {"link":1}},"attr":"src"}],

selection-format keywords:

"lower","upper","capitalize"
"ascii-replace":["ss","s", "[-!?(), ]"]
"lstrip", "rstrip":"')"
"sub":["",""]
"split":[["/",-1],["\.",0,1]]
"multiplier", "divider":1000 (for timestamp)
"replace":{"tv":2, "radio":12}
"default":
"type":
- "datetimestring","timestamp","time","timedelta","date","datestamp", "relative-weekday","string", "lower-ascii","int", "float","boolean","list",
"member-off"

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
DataTreeGrab.py		DataTreeGrab.py
LICENSE		LICENSE
MANIFEST		MANIFEST
README.md		README.md
setup.py		setup.py
test_data_def.py		test_data_def.py
test_json_struct.py		test_json_struct.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataTreeGrab

Requirements

Installation

Main advantages

Known issues

Release notes

path-dict keywords:

selection-keywords:

link examples

selection-format keywords:

About

Releases 6

Packages

Languages

License

tvgrabbers/DataTree

Folders and files

Latest commit

History

Repository files navigation

DataTreeGrab

Requirements

Installation

Main advantages

Known issues

Release notes

path-dict keywords:

selection-keywords:

link examples

selection-format keywords:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages