Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate schematics #399

Merged
merged 2 commits into from
Jun 29, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 36 additions & 23 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
@@ -315,12 +315,8 @@ Item validation
---------------

Item validators allows you to match your returned items with predetermined structure
ensuring that all fields contains data in the expected format. Spidermon allows
you to choose between schematics_ or `JSON Schema`_ to define the structure
of your item.

In this tutorial, we will use a schematics_ model to make sure that all required
fields are populated and they are all of the correct format.
ensuring that all fields contains data in the expected format. supports `JSON Schema`_
to define the structure of your item.

First step is to change our actual spider code to use `Scrapy items`_. Create a
new file called `items.py`:
@@ -367,25 +363,43 @@ And then modify the spider code to use the newly defined item:
)
)

Now we need to create our schematics model in `validators.py` file that will contain
Now we need to create our jsonschema model in the `schemas/quote_item.json` file that will contain
all the validation rules:

.. _quote-item-validation-schema:

.. code-block:: python

# tutorial/validators.py
from schematics.models import Model
from schematics.types import URLType, StringType, ListType

class QuoteItem(Model):
quote = StringType(required=True)
author = StringType(required=True)
author_url = URLType(required=True)
tags = ListType(StringType)
.. code-block:: json

{
"$schema": "http://json-schema.org/draft-07/schema",
"type": "object",
"properties": {
"quote": {
"type": "string"
},
"author": {
"type": "string"
},
"author_url": {
"type": "string",
"pattern": ""
},
"tags": {
"type": "array",
"items": {
"type":"string"
}
}
},
"required": [
"quote",
"author",
"author_url"
]
}

To allow Spidermon to validate your items, you need to include an item pipeline and
inform the name of the model class used for validation:
inform the path of the json schema used for validation:

.. code-block:: python

@@ -394,8 +408,8 @@ inform the name of the model class used for validation:
'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800,
}

SPIDERMON_VALIDATION_MODELS = (
'tutorial.validators.QuoteItem',
SPIDERMON_VALIDATION_SCHEMAS = (
'./schemas/quote_item.json',
)

After that, every time you run your spider you will have a new set of stats in
@@ -408,7 +422,7 @@ your spider log providing information about the results of the validations:
'spidermon/validation/fields': 400,
'spidermon/validation/items': 100,
'spidermon/validation/validators': 1,
'spidermon/validation/validators/item/schematics': True,
'spidermon/validation/validators/item/jsonschema': True,
[scrapy.core.engine] INFO: Spider closed (finished)

You can then create a new monitor that will check these new statistics and raise
@@ -473,7 +487,6 @@ The resulted item will look like this:
}

.. _`JSON Schema`: https://json-schema.org/
.. _`schematics`: https://schematics.readthedocs.io/en/latest/
.. _`Scrapy`: https://scrapy.org/
.. _`Scrapy items`: https://docs.scrapy.org/en/latest/topics/items.html
.. _`Scrapy Tutorial`: https://doc.scrapy.org/en/latest/intro/tutorial.html
7 changes: 2 additions & 5 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -11,11 +11,8 @@ following features:

* It can check the output data produced by Scrapy (or other sources) and
verify it against a schema or model that defines the expected structure,
data types and value restrictions. It supports data validation based on two
external libraries:

* jsonschema: `<https://github.com/Julian/jsonschema>`_
* Schematics: `<https://github.com/schematics/schematics>`_
data types and value restrictions. It supports data validation based on
the jsonschema library (`<https://github.com/Julian/jsonschema>`_).
* It allows you to define conditions that should trigger an alert based on
Scrapy stats.
* It supports notifications via email, Slack, Telegram and Discord.
5 changes: 1 addition & 4 deletions docs/source/installation.rst
Original file line number Diff line number Diff line change
@@ -9,15 +9,12 @@ build your monitors on top of it. The library depends on jsonschema_ and

If you want to set up any notifications, additional `monitoring` dependencies will help with that.

If you want to use schematics_ validation, you probably want `validation`.

So the recommended way to install the library is by adding both:

.. code-block:: bash

pip install "spidermon[monitoring,validation]"
pip install "spidermon[monitoring]"


.. _`jsonschema`: https://pypi.org/project/jsonschema/
.. _`python-slugify`: https://pypi.org/project/python-slugify/
.. _`schematics`: https://pypi.org/project/schematics/
66 changes: 2 additions & 64 deletions docs/source/item-validation.rst
Original file line number Diff line number Diff line change
@@ -21,37 +21,8 @@ the first step is to enable the built-in item pipeline in your project settings:
subsequent pipeline changes the content of the item, ignoring the
validation already performed.

After that, you need to choose which validation library will be used. Spidermon
accepts schemas defined using schematics_ or `JSON Schema`_.

With schematics
---------------

Schematics_ is a validation library based on ORM-like models. These models include
some common data types and validators, but they can also be extended to define
custom validation rules.

.. warning::

You need to install `schematics`_ to use this feature.

.. code-block:: python

# Usually placed in validators.py file
from schematics.models import Model
from schematics.types import URLType, StringType, ListType

class QuoteItem(Model):
quote = StringType(required=True)
author = StringType(required=True)
author_url = URLType(required=True)
tags = ListType(StringType)

Check `schematics documentation`_ to learn how to define a model and how to extend the
built-in data types.

With JSON Schema
----------------
Using JSON Schema
-----------------

`JSON Schema`_ is a powerful tool for validating the structure of JSON data. You can
define which fields are required, the type assigned to each field, a regular expression
@@ -133,36 +104,6 @@ Default: ``_validation``
The name of the field added to the item when a validation error happens and
`SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS`_ is enabled.

SPIDERMON_VALIDATION_MODELS
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Default: ``None``

A `list` containing the `schematics models`_ that contain the definition of the items
that need to be validated.

.. code-block:: python

# settings.py

SPIDERMON_VALIDATION_MODELS = [
'tutorial.validators.DummyItemModel'
]

If you are working on a spider that produces multiple items types, you can define it
as a `dict`:

.. code-block:: python

# settings.py

from tutorial.items import DummyItem, OtherItem

SPIDERMON_VALIDATION_MODELS = {
DummyItem: 'tutorial.validators.DummyItemModel',
OtherItem: 'tutorial.validators.OtherItemModel',
}

SPIDERMON_VALIDATION_SCHEMAS
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@@ -235,9 +176,6 @@ Some examples:
# checks that no errors is present in any fields
self.check_field_errors_percent()

.. _`schematics`: https://schematics.readthedocs.io/en/latest/
.. _`schematics documentation`: https://schematics.readthedocs.io/en/latest/
.. _`JSON Schema`: https://json-schema.org/
.. _`guide`: http://json-schema.org/learn/getting-started-step-by-step.html
.. _`schematics models`: https://schematics.readthedocs.io/en/latest/usage/models.html
.. _`jsonschema`: https://pypi.org/project/jsonschema/
Binary file not shown.
27 changes: 27 additions & 0 deletions examples/tutorial/tutorial/schemas/quote_item.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"$schema": "http://json-schema.org/draft-07/schema",
"type": "object",
"properties": {
"quote": {
"type": "string"
},
"author": {
"type": "string"
},
"author_url": {
"type": "string",
"pattern": ""
},
"tags": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"quote",
"author",
"author_url"
]
}
2 changes: 1 addition & 1 deletion examples/tutorial/tutorial/settings.py
Original file line number Diff line number Diff line change
@@ -15,7 +15,7 @@
SPIDERMON_SLACK_RECIPIENTS = ["@yourself", "#yourprojectchannel"]

ITEM_PIPELINES = {"spidermon.contrib.scrapy.pipelines.ItemValidationPipeline": 800}
SPIDERMON_VALIDATION_MODELS = ("tutorial.validators.QuoteItem",)
SPIDERMON_VALIDATION_SCHEMAS = ("../schemas/quote_item.json",)

SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS = True

9 changes: 0 additions & 9 deletions examples/tutorial/tutorial/validators.py

This file was deleted.

1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -3,7 +3,6 @@ slack-sdk
boto
premailer
jsonschema[format]
schematics==2.1.0
python-slugify
scrapy
pytest
2 changes: 0 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
@@ -43,8 +43,6 @@
"premailer",
"sentry-sdk",
],
# Data validation
"validation": ["schematics"],
# Tools to run the tests
"tests": test_requirements,
# Tools to build and publish the documentation
16 changes: 2 additions & 14 deletions spidermon/contrib/scrapy/pipelines.py
Original file line number Diff line number Diff line change
@@ -2,12 +2,10 @@
from itemadapter import ItemAdapter

from scrapy.exceptions import DropItem, NotConfigured
from scrapy.utils.misc import load_object
from scrapy import Field, Item
from scrapy import Item

from spidermon.contrib.validation import SchematicsValidator, JSONSchemaValidator
from spidermon.contrib.validation import JSONSchemaValidator
from spidermon.contrib.validation.jsonschema.tools import get_schema_from
from schematics.models import Model

from .stats import ValidationStatsManager

@@ -59,7 +57,6 @@ def set_validators(loader, schema):

for loader, name in [
(cls._load_jsonschema_validator, "SPIDERMON_VALIDATION_SCHEMAS"),
(cls._load_schematics_validator, "SPIDERMON_VALIDATION_MODELS"),
]:
res = crawler.settings.get(name)
if not res:
@@ -100,15 +97,6 @@ def _load_jsonschema_validator(cls, schema):
)
return JSONSchemaValidator(schema)

@classmethod
def _load_schematics_validator(cls, model_path):
model_class = load_object(model_path)
if not issubclass(model_class, Model):
raise NotConfigured(
"Invalid model, models must subclass schematics.models.Model"
)
return SchematicsValidator(model_class)

def process_item(self, item, _):
validators = self.find_validators(item)
if not validators:
1 change: 0 additions & 1 deletion spidermon/contrib/validation/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
from .schematics.validator import SchematicsValidator
from .jsonschema.validator import JSONSchemaValidator
Empty file.
39 changes: 0 additions & 39 deletions spidermon/contrib/validation/schematics/monkeypatches.py

This file was deleted.

Loading