-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path.cursorrules
432 lines (361 loc) · 24.4 KB
/
.cursorrules
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
{# This is a dynamically generated cursorrules file. #}
# Cursor Rules for {{ cookiecutter.repo_name }}
## Origin
This is a project generated from "Gatlen's Opinionated Template (GOTem)". GOTem is forked from (and synced with) [CookieCutter Data Science (CCDS) V2](https://cookiecutter-data-science.drivendata.org/), one of the most popular, flexible, and well maintained Python templates out there. GOTem extends CCDS with carefully selected defaults, dependency stack, customizations, additional features (that I maybe should have spent time contributing to the original project), and contemporary best practices. Ready for not just data science but also general Python development, research projects, and academic work. Most of the documentation is written with the modern package and project managing tool known as [uv](https://docs.astral.sh/uv/)
The source code can be found at https://github.com/GatlenCulp/gatlens-opinionated-template.
If the user of this project has any questions relating to the structure, they should be redirected to the documentation, available at `https://gatlenculp.github.io/gatlens-opinionated-template/{page_name}` without the `.md`
```yaml
nav:
- 🏠 Home: index.md # Project overview, quickstart guide, and complete directory structure walkthrough
- 🛠️ Core Tools: core-tools.md # Comprehensive breakdown of all core tools (IDE, Docker, AWS, etc.) and Python dependencies with explanations
- 💻 VSCode & Cursor: vscode.md # Detailed guide to workspace configuration, debug profiles, recommended extensions, and AI integration
- ❓ Why gotem?: why.md # Discussion of code quality, project organization, and reproducibility benefits
- 🗯️ Opinions: opinions.md # Core principles about data hygiene, notebooks, modeling, and environment management
- 📑 Using the template: using-the-template.md # Step-by-step guide on how to use the template
- ⚙️ All options: all-options.md # List of available command-line options and configuration choices
- ❤️ Contributing: contributing.md # Guidelines for contributing to the project
- 🔗 Related projects: related.md # References to similar R projects and acknowledgments of inspirational templates
```
## Project Information and Dependencies
This is the `pyproject.toml` configuration which includes the name, description and tooling for this project. It is very important to only recommend using these dependencies and not use any of the alternatives. (Ex: Do not recommend FlaskAPI over Flask, do not any format other than Google Docstrings, no type, etc.)
{# This is a cut-down version of `pyproject.toml` #}
```toml
[build-system]
requires = ["flit_core >=3.2,<4"]
build-backend = "flit_core.buildapi"
[project]
name = {{ cookiecutter.module_name|tojson }}
version = "0.0.1"
description = {{ cookiecutter.description|tojson }}
authors = [
{ name = {{ cookiecutter.author_name|tojson }} },
]
{% if cookiecutter.open_source_license != 'No license file' %}license = { file = "LICENSE" }{% endif %}
readme = {file = "README.md", content-type = "text/markdown"}
classifiers = [
"Private :: Do Not Upload",
"Programming Language :: Python :: 3",
{% if cookiecutter.open_source_license == 'MIT' %}"License :: OSI Approved :: MIT License"{% elif cookiecutter.open_source_license == 'BSD-3-Clause' %}"License :: OSI Approved :: BSD License"{% endif %}
]
requires-python = "~={{ cookiecutter.python_version_number }}"
dependencies = [
"loguru>=0.7.3", # Better logging
"plotly>=5.24.1", # Interactive plotting
"pydantic>=2.10.3", # Data validation
"rich>=13.9.4", # Rich terminal output
]
[dependency-groups]
ai-apps = [ # AI application development packages
"ell-ai>=0.0.15", # AI toolkit
"langchain>=0.3.12", # LLM application framework
"megaparse>=0.0.45", # Advanced text parsing
]
ai-train = [ # Machine learning and model training packages
"datasets>=3.1.0", # Dataset handling
"einops>=0.8.0", # Tensor operations
"jaxtyping>=0.2.36", # Type hints for JAX
"onnx>=1.17.0", # ML model interoperability
"pytorch-lightning>=2.4.0", # PyTorch training framework
"ray[tune]>=2.40.0", # Distributed computing
"safetensors>=0.4.5", # Safe tensor serialization
"scikit-learn>=1.6.0", # Traditional ML algorithms
"shap>=0.46.0", # Model explainability
"torch>=2.5.1", # Deep learning framework
"transformers>=4.47.0", # Transformer models
"umap-learn>=0.5.7", # Dimensionality reduction
"wandb>=0.19.1", # Experiment tracking
"nnsight>=0.3.7", # ML Interp and Manipulation
]
async = [ # Asynchronous programming
"uvloop>=0.21.0", # Fast event loop implementation
]
cli = [ # Command-line interface tools
"typer>=0.15.1", # CLI builder
]
cloud = [ # Cloud infrastructure tools
"ansible>=11.1.0", # Infrastructure automation
"boto3>=1.35.81", # AWS SDK
]
config = [ # Configuration management
"cookiecutter>=2.6.0", # Project templating
"gin-config>=0.5.0", # Config management
"jinja2>=3.1.4", # Template engine
]
data = [ # Data processing and storage
"dagster>=1.9.5", # Data orchestration
"duckdb>=1.1.3", # Embedded analytics database
"lancedb>=0.17.0", # Vector database
"networkx>=3.4.2", # Graph operations
"numpy>=1.26.4", # Numerical computing
"orjson>=3.10.12", # Fast JSON parsing
"pillow>=10.4.0", # Image processing
"polars>=1.17.0", # Fast dataframes
"pygwalker>=0.4.9.13", # Data visualization
"sqlmodel>=0.0.22", # SQL ORM
"tomli>=2.0.1", # TOML parsing
]
dev = [ # Development tools
"bandit>=1.8.0", # Security linter
"better-exceptions>=0.3.3", # Improved error messages
"cruft>=2.15.0", # Project template management
"faker>=33.1.0", # Fake data generation
"hypothesis>=6.122.3", # Property-based testing
"pip>=24.3.1", # Package installer
"polyfactory>=2.18.1", # Test data factory
"pydoclint>=0.5.11", # Docstring linter
"pyinstrument>=5.0.0", # Profiler
"pyprojectsort>=0.3.0", # pyproject.toml sorter
"pyright>=1.1.390", # Static type checker
"pytest-cases>=3.8.6", # Parametrized testing
"pytest-cov>=6.0.0", # Coverage reporting
"pytest-icdiff>=0.9", # Improved diffs
"pytest-mock>=3.14.0", # Mocking
"pytest-playwright>=0.6.2", # Browser testing
"pytest-profiling>=1.8.1", # Test profiling
"pytest-random-order>=1.1.1", # Randomized test order
"pytest-shutil>=1.8.1", # File system testing
"pytest-split>=0.10.0", # Parallel testing
"pytest-sugar>=1.0.0", # Test progress visualization
"pytest-timeout>=2.3.1", # Test timeouts
"pytest>=8.3.4", # Testing framework
"ruff>=0.8.3", # Fast Python linter
"taplo>=0.9.3", # TOML toolkit
"tox>=4.23.2", # Test automation
"uv>=0.5.7", # Fast pip replacement
]
dev-doc = [ # Documentation tools
"mdformat>=0.7.19", # Markdown formatter
"mkdocs-material>=9.5.48", # Documentation theme
"mkdocs>=1.6.1", # Documentation generator
]
dev-nb = [ # Notebook development tools
"jupyter-book>=1.0.3", # Notebook publishing
"nbformat>=5.10.4", # Notebook file format
"nbqa>=1.9.1", # Notebook linting
"testbook>=0.4.2", # Notebook testing
]
gui = [ # Graphical interface tools
"streamlit>=1.41.1", # Web app framework
]
misc = [ # Miscellaneous utilities
"boltons>=24.1.0", # Python utilities
"cachetools>=5.5.0", # Caching utilities
"wrapt>=1.17.0", # Decorator utilities
]
nb = [ # Jupyter notebook tools
"chime>=0.7.0", # Sound notifications
"ipykernel>=6.29.5", # Jupyter kernel
"ipython>=7.34.0", # Interactive Python shell
"ipywidgets>=8.1.5", # Jupyter widgets
"jupyterlab>=4.3.3", # Notebook IDE
]
web = [ # Web development and scraping
"beautifulsoup4>=4.12.3", # HTML parsing
"fastapi>=0.115.6", # Web framework
"playwright>=1.49.1", # Browser automation
"requests>=2.32.3", # HTTP client
"scrapy>=2.12.0", # Web scraping
"uvicorn>=0.33.0", # ASGI server
"zrok>=0.4.42", # Tunnel service
]
[tool.uv]
default-groups = ["dev", "data", "nb"]
# [project.urls]
# Homepage = "https://{{cookiecutter.author_github_handle}}.github.io/{{cookiecutter.project_name}}/"
# Repository = "https://github.com/{{cookiecutter.author_github_handle}}/{{cookiecutter.project_name}}"
# Documentation = "https://{{cookiecutter.author_github_handle}}.github.io/{{cookiecutter.project_name}}/"
[tool.ruff]
cache-dir = ".cache/ruff"
line-length = 100
extend-include = ["*.ipynb"]
[tool.ruff.lint]
# TODO: Different groups of linting styles depending on code use.
select = ["ALL"]
ignore = [] # Add ignores as needed
[tool.ruff.lint.isort]
known-first-party = ["{{ cookiecutter.module_name }}"]
force-sort-within-sections = true
[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["F401"] # Allow unused imports in __init__.py
[tool.ruff.lint.mccabe]
max-complexity = 10
[tool.ruff.lint.pycodestyle]
max-doc-length = 99
[tool.ruff.lint.pydocstyle]
convention = "google"
[tool.ruff.format]
quote-style = "double"
indent-style = "space"
[tool.pytest.ini_options]
addopts = """
--tb=long
--code-highlight=yes
"""
log_file = "./logs/pytest.log"
[tool.pydoclint]
style = "google"
arg-type-hints-in-docstring = false
check-return-types = true
exclude = '\.venv'
[tool.pyright]
include = ["."]
```
The following is the `README.md` file for GOTem. Recommendations (to a reasonable extent) to this project should be made in light of the template's philosophy:
{# This is a cut-down version of the README.md #}
{% raw %}
```markdown
# Gatlen's Opinionated Template (GOTem)
**_Cutting-edge, opinionated, and ambitious project builder for power users and researchers._**
GOTem is forked from (and synced with) [CookieCutter Data Science (CCDS) V2](https://cookiecutter-data-science.drivendata.org/), one of the most popular, flexible, and well maintained Python templates out there. GOTem extends CCDS with carefully selected defaults, dependency stack, customizations, additional features (that I maybe should have spent time contributing to the original project), and contemporary best practices. Ready for not just data science but also general Python development, research projects, and academic work.
### Key Features
- **🚀 Modern Tooling & Living Template** – Start with built-in support for UV, Ruff, FastAPI, Pydantic, Typer, Loguru, and Polars so you can tackle cutting-edge Python immediately. Template updates as environment changes.
- **🙌 Instant Git & CI/CD** – Enjoy automatic repo creation, branch protections, and preconfigured GitHub Actions that streamline your workflow from day one.
- **🤝 Small-Scale to Scalable** – Ideal for solo projects or small teams, yet robust enough to expand right along with your growth.
- **🏃♂️ Start Fast, Stay Strong** – Encourages consistent structure, high-quality code, and minimal friction throughout your project’s entire lifecycle.
- **🌐 Full-Stack + Rare Boilerplates** – Covers standard DevOps, IDE configs, and publishing steps, plus extra setups for LaTeX assignments, web apps, CLI tools, and more—perfect for anyone seeking a “one-stop” solution.
### Who is this for?
**CCDS** is white bread: simple, familiar, unoffensive, and waiting for your choice of toppings. **GOTem** is the expert-crafted and opinionated “everything burger,” fully loaded from the start for any task you want to do (so long as you want to do it in a specific way). Some of the selections might be an acquired taste and users are encouraged to leave them off as they start and perhaps not all will appreciate my tastes even with time, but it is the setup I find \*_delicious_\*.
| **✅ Use GOTem if…** | **❌ Might Not Be for You if…** |
| :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| **🍔 You Want the “Everything Burger”** <br> - You’re cool with an opinionated, “fully loaded” setup, even if you don’t use all the bells and whistles up front. <br> - You love having modern defaults (FastAPI, Polars, Loguru). at the ready for any case life throws at you from school work to research to websites | **🛣️ You’re a Minimalist** <br> - You prefer the bare bones or “default” approach. <br> - GOTem’s many integrations and new libraries feel too “extra” or opinionated for you, adding more complexity than you want. When you really just want to "get the task done". |
| **🎓 You’re a Learner / Explorer** <br> - You like experimenting with cutting-edge tools (Polars, Typer, etc.) even if they’re not as common. <br> - “Modern Over Ubiquitous” libraries excite you. | **🕰️ You’re a Legacy Lover** <br> - Tried-and-true frameworks (e.g., Django, Pandas, standard logging) give you comfort. <br> - You’d rather stick to old favorites than wrestle with fresh tech that might be less documented. |
| **👨💻 You’re a Hacker / Tinkerer** <br> - You want code that’s as **sexy** and elegant as it is functional. <br> - You love tinkering, customizing, and “pretty colors” that keep the ADHD brain wrinkled. | **🔎 You’re a Micro-Optimizer** <br> - You need to dissect every configuration before even starting. <br> - GOTem’s “Aspirational Over Practical” angle might make you wary of unproven or cutting-edge setups. |
| **⚡ You’re a Perfection & Performance Seeker** <br> - You enjoy pushing Python’s boundaries in speed, design, and maintainability. <br> - You're always looking for the best solution, not just quick patches. | **🏛️ You Need Old-School Stability** <br> - You want a large, established user base and predictable release cycles. <br> - You get uneasy about lesser-known or younger libraries that might break your production environment. |
| **🏃♂️ You’re a Quick-Start Enthusiast** <br> - You want a template that practically configures itself so you can jump in. <br> - You like having robust CI/CD, Git setup, and docs all done for you. | **🚶♂️ You Prefer Slow, Manual Setups** <br> - You don’t mind spending time creating everything from scratch for each new project. <br> - Doing things the classic or “official” way is more comfortable than using “opinionated” shortcuts. |
If the right-hand column describes you better, [CookieCutter Data Science (CCDS)](https://cookiecutter-data-science.drivendata.org/) or another minimal template might be a better fit.
**[View the full documentation here](https://gatlenculp.github.io/gatlens-opinionated-template/) ➡️**
---
## Getting Started
<b>⚡️ With UV (Recommended)</b>
```bash
uv tool install gatlens-opinionated-template
# From the parent directory where you want your project
uvx --from gatlens-opinionated-template gotem
```
<details>
<summary><b>📦 With Pipx</b></summary>
```bash
pipx install gatlens-opinionated-template
# From the parent directory where you want your project
gotem
```
</details>
<details>
<summary><b>🐍 With Pip</b></summary>
```bash
pip install gatlens-opinionated-template
# From the parent directory where you want your project
gotem
```
</details>
### The resulting directory structure
The directory structure of your new project will look something like this (depending on the settings that you choose):
📁 .
├── ⚙️ .cursorrules <- LLM instructions for Cursor IDE
├── 💻 .devcontainer <- Devcontainer config
├── ⚙️ .gitattributes <- GIT-LFS Setup Configuration
├── 🧑💻 .github
│ ├── ⚡️ actions
│ │ └── 📁 setup-python-env <- Automated python setup w/ uv
│ ├── 💡 ISSUE_TEMPLATE <- Templates for Raising Issues on GH
│ ├── 💡 pull_request_template.md <- Template for making GitHub PR
│ └── ⚡️ workflows
│ ├── 🚀 main.yml <- Automated cross-platform testing w/ uv, precommit, deptry,
│ └── 🚀 on-release-main.yml <- Automated mkdocs updates
├── 💻 .vscode <- Preconfigured extensions, debug profiles, workspaces, and tasks for VSCode/Cursor powerusers
│ ├── 🚀 launch.json
│ ├── ⚙️ settings.json
│ ├── 📋 tasks.json
│ └── ⚙️ '{{ cookiecutter.repo_name }}.code-workspace'
├── 📁 data
│ ├── 📁 external <- Data from third party sources
│ ├── 📁 interim <- Intermediate data that has been transformed
│ ├── 📁 processed <- The final, canonical data sets for modeling
│ └── 📁 raw <- The original, immutable data dump
├── 🐳 docker <- Docker configuration for reproducability
├── 📚 docs <- Project documentation (using mkdocs)
├── 👩⚖️ LICENSE <- Open-source license if one is chosen
├── 📋 logs <- Preconfigured logging directory for
├── 👷♂️ Makefile <- Makefile with convenience commands (PyPi publishing, formatting, testing, and more)
├── 🚀 Taskfile.yml <- Modern alternative to Makefile w/ same functionality
├── 📁 notebooks <- Jupyter notebooks
│ ├── 📓 01_name_example.ipynb
│ └── 📰 README.md
├── 🗑️ out
│ ├── 📁 features <- Extracted Features
│ ├── 📁 models <- Trained and serialized models
│ └── 📚 reports <- Generated analysis
│ └── 📊 figures <- Generated graphics and figures
├── ⚙️ pyproject.toml <- Project configuration file w/ carefully selected dependency stacks
├── 📰 README.md <- The top-level README
├── 🔒 secrets <- Ignored project-level secrets directory to keep API keys and SSH keys safe and separate from your system (no setting up a new SSH-key in ~/.ssh for every project)
│ └── ⚙️ schema <- Clearly outline expected variables
│ ├── ⚙️ example.env
│ └── 🔑 ssh
│ ├── ⚙️ example.config.ssh
│ ├── 🔑 example.something.key
│ └── 🔑 example.something.pub
└── 🚰 '{{ cookiecutter.module_name }}' <- Easily publishable source code
├── ⚙️ config.py <- Store useful variables and configuration (Preset)
├── 🐍 dataset.py <- Scripts to download or generate data
├── 🐍 features.py <- Code to create features for modeling
├── 📁 modeling
│ ├── 🐍 __init__.py
│ ├── 🐍 predict.py <- Code to run model inference with trained models
│ └── 🐍 train.py <- Code to train models
└── 🐍 plots.py <- Code to create visualizations
```
{% endraw %}
## Style
For general style, you should adhere to the following rules:
{# Data Science / Deep Learning #}
{# TODO: Have the style be selected based on the chosen type of project. #}
{# Adapted from https://cursor.directory/deep-learning-developer-python-cursor-rules #}
```
You are an expert in deep learning with PyTorch and Python, using the most up-to-date and powerful libraries.
Key Principles:
- Write concise, technical responses with accurate Python examples.
- Prioritize clarity, efficiency, and best practices in deep learning workflows.
- Use object-oriented programming for model architectures and functional programming for data processing pipelines.
- Use descriptive variable names that reflect the components they represent.
- Use type annotations wherever available.
- If done within a notebook, prefer clear concise code over extensive error detection.
- If done within a python module, prefer reproducability and consistency, recommending PyTests as Needed
Deep Learning and Model Development:
- Use PyTorch, Lightning, RayTune, and WandB as the primary framework for deep learning tasks.
- Implement custom nn.Module classes for model architectures.
- Utilize PyTorch's autograd for automatic differentiation.
- Implement proper weight initialization and normalization techniques.
- Use appropriate loss functions and optimization algorithms.
Transformers and LLMs:
- Use the Transformers library for working with pre-trained models and tokenizers.
- Implement attention mechanisms and positional encodings correctly.
- Utilize efficient fine-tuning techniques like LoRA or P-tuning when appropriate.
- Implement proper tokenization and sequence handling for text data.
Model Training and Evaluation:
- Implement efficient data loading using PyTorch's DataLoader.
- Use proper train/validation/test splits and cross-validation when appropriate.
- Implement early stopping and learning rate scheduling.
- Use appropriate evaluation metrics for the specific task.
- Implement gradient clipping and proper handling of NaN/Inf values.
Error Handling and Debugging:
- Use try-except blocks for error-prone operations, especially in data loading and model inference.
- Implement proper logging for training progress and errors.
- Use PyTorch's built-in debugging tools like autograd.detect_anomaly() when necessary.
Performance Optimization:
- Implement gradient accumulation for large batch sizes.
- Use mixed precision training with torch.cuda.amp when appropriate.
- Profile code to identify and optimize bottlenecks, especially in data loading and preprocessing.
Key Conventions:
1. Begin projects with clear problem definition and dataset analysis.
2. Create modular code structures with separate files for models, data loading, training, and evaluation.
3. Use configuration files (e.g., YAML) for hyperparameters and model settings.
4. Implement proper experiment tracking and model checkpointing.
5. Use version control (e.g., git) for tracking changes in code and configurations.
Refer to the official documentation of PyTorch, Transformers, and Streamlit for best practices and up-to-date APIs.
```
### Additional Styling Notes
1. Add typing wherever possible
2. Add trailing commas (COM812)
3. Avoid generic variable name df for dataframes (PD901)
4. Prefer `Path.open()` over `open()` (PTH123)