Skip to content

Conversation

@kou
Copy link
Member

@kou kou commented May 27, 2025

Rationale for this change

We want to migrate to pre-commit from archery lint.

What changes are included in this PR?

Use pre-commit for numpydoc but this doesn't support Cython files.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@kou kou requested review from AlenkaF, raulcd and rok as code owners May 27, 2025 03:06
@github-actions
Copy link

⚠️ GitHub issue #46546 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member Author

kou commented May 27, 2025

archery lint has our own Cython integration code:

class NumpyDoc:
IGNORE_VALIDATION_ERRORS_FOR_TYPE = {
# Enum function signatures should never be documented
EnumMeta: ["PR01"]
}
def __init__(self, symbols=None):
if not have_numpydoc:
raise RuntimeError(
'Numpydoc is not available, install with command: '
'pip install numpydoc==1.1.0'
)
self.symbols = set(symbols or {'pyarrow'})
def traverse(self, fn, obj, from_package):
"""Apply a function on publicly exposed API components.
Recursively iterates over the members of the passed object. It omits
any '_' prefixed and thirdparty (non pyarrow) symbols.
Parameters
----------
fn : callable
A function to apply on all traversed objects.
obj : Any
The object to start from.
from_package : string
Predicate to only consider objects from this package.
"""
todo = [obj]
seen = set()
while todo:
obj = todo.pop()
if obj in seen:
continue
else:
seen.add(obj)
fn(obj)
for name in dir(obj):
if name.startswith('_'):
continue
member = getattr(obj, name)
module = _get_module(member)
if module is None or not module.startswith(from_package):
continue
# Is it a Cython-generated method? If so, try to detect
# whether it only has a implicitly-generated docstring,
# and no user-defined docstring following it.
# The generated docstring would lack description of method
# parameters and therefore fail Numpydoc validation.
if hasattr(member, '__objclass__'):
doc = getattr(member, '__doc__', None)
# The Cython-generated docstring would be a one-liner,
# such as "ReadOptions.equals(self, ReadOptions other)".
if (doc and '\n' not in doc and f'.{name}(' in doc):
continue
todo.append(member)
@contextmanager
def _apply_patches(self):
"""
Patch Docstring class to bypass loading already loaded python objects.
"""
orig_load_obj = Docstring._load_obj
orig_signature = inspect.signature
@staticmethod
def _load_obj(obj):
# By default it expects a qualname and import the object, but we
# have already loaded object after the API traversal.
if isinstance(obj, str):
return orig_load_obj(obj)
else:
return obj
def signature(obj):
# inspect.signature tries to parse __text_signature__ if other
# properties like __signature__ doesn't exists, but cython
# doesn't set that property despite that embedsignature cython
# directive is set. The only way to inspect a cython compiled
# callable's signature to parse it from __doc__ while
# embedsignature directive is set during the build phase.
# So path inspect.signature function to attempt to parse the first
# line of callable.__doc__ as a signature.
try:
return orig_signature(obj)
except Exception as orig_error:
try:
return inspect_signature(obj)
except Exception:
raise orig_error
try:
Docstring._load_obj = _load_obj
inspect.signature = signature
yield
finally:
Docstring._load_obj = orig_load_obj
inspect.signature = orig_signature
def validate(self, from_package='', allow_rules=None,
disallow_rules=None):
results = []
def callback(obj):
try:
result = validate(obj)
except OSError as e:
symbol = f"{_get_module(obj, default='')}.{obj.__name__}"
logger.warning(f"Unable to validate `{symbol}` due to `{e}`")
return
errors = []
for errcode, errmsg in result.get('errors', []):
if allow_rules and errcode not in allow_rules:
continue
if disallow_rules and errcode in disallow_rules:
continue
if any(isinstance(obj, obj_type) and errcode in errcode_list
for obj_type, errcode_list
in NumpyDoc.IGNORE_VALIDATION_ERRORS_FOR_TYPE.items()):
continue
errors.append((errcode, errmsg))
if len(errors):
result['errors'] = errors
results.append((obj, result))
with self._apply_patches():
for symbol in self.symbols:
try:
obj = Docstring._load_obj(symbol)
except (ImportError, AttributeError):
print(f'{symbol} is not available for import')
else:
self.traverse(callback, obj, from_package=from_package)
return results

But we can't use it with numpydoc lint.

Do we need the Cython integration?

^python/pyarrow/
exclude: >-
(
?^python/pyarrow/interchange/from_dataframe\.py|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFYI: other files contains $ character like ?^python/pyarrow/interchange/from_dataframe\.py$

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
I've added $.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 27, 2025
@kou
Copy link
Member Author

kou commented May 28, 2025

@pitrou Is Cython docstring validation by numpydoc important for us?

It was implemented by #12698 only for us. If it's important, can we implement it in upstream instead of implementing it in Archery?

@pitrou
Copy link
Member

pitrou commented May 28, 2025

@pitrou Is Cython docstring validation by numpydoc important for us?

We want to validate docstrings regardless of how the functions are generated, yes.

It was implemented by #12698 only for us. If it's important, can we implement it in upstream instead of implementing it in Archery?

We can probably do that. However, we may have to wait for cython/cython#6904 to be released first.

cc @raulcd

@kou
Copy link
Member Author

kou commented May 30, 2025

OK. I've opened #46641 for it.

I'll merge this if nobody objects it. Because we can remove Archery based lint from our lint CI job by merging this. It will solve #46528.

@pitrou
Copy link
Member

pitrou commented May 30, 2025

The problem is we don't know if numpydoc will accept a contribution for it.

We can still use Archery-based lint for numpydoc. I don't think there's any urgency.

@kou
Copy link
Member Author

kou commented Jun 2, 2025

Does it mean that we should not merge this without Cython support?

@pitrou
Copy link
Member

pitrou commented Jun 2, 2025

Sorry. Yes, you're right that this PR can be merged. But we shouldn't disable the corresponding Archery test until numpydoc is enhanced.

@kou
Copy link
Member Author

kou commented Jun 2, 2025

Sure. We must remove

arrow/docker-compose.yml

Lines 1526 to 1550 in 19aadb2

conda-python-docs:
# Usage:
# archery docker run conda-python-docs
#
# Only a single rule is enabled for now to check undocumented arguments.
# We should extend the list of enabled rules after adding this build to
# the CI pipeline.
image: ${REPO}:${ARCH}-conda-python-${PYTHON}-pandas-${PANDAS}
cap_add:
- SYS_ADMIN
environment:
<<: [*common, *ccache]
ARROW_SUBSTRAIT: "ON"
LC_ALL: "C.UTF-8"
LANG: "C.UTF-8"
BUILD_DOCS_CPP: "ON"
BUILD_DOCS_PYTHON: "ON"
PYTEST_ARGS: "--doctest-modules --doctest-cython"
volumes: *conda-volumes
command:
["/arrow/ci/scripts/cpp_build.sh /arrow /build &&
/arrow/ci/scripts/python_build.sh /arrow /build &&
pip install -e /arrow/dev/archery[numpydoc] &&
archery numpydoc --allow-rule GL10,PR01,PR03,PR04,PR05,PR10,RT03,YD01 &&
/arrow/ci/scripts/python_test.sh /arrow"]
only after #46641 is completed.

I'll merge this.

@kou kou merged commit dfac0cc into apache:main Jun 2, 2025
12 of 13 checks passed
@kou kou deleted the pre-commit-numpydoc branch June 2, 2025 07:29
@kou kou removed the awaiting changes Awaiting changes label Jun 2, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit dfac0cc.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants