Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and

## Unreleased

- {issue}`735` adds the `pytask.lock` lockfile as the primary state backend with a
portable format, documentation, and a one-run SQLite fallback when no lockfile
exists.
- {pull}`766` moves runtime profiling persistence from SQLite to a JSON snapshot plus
append-only journal in `.pytask/`, keeping runtime data resilient to crashes and
compacted on normal build exits.
Expand Down
1 change: 1 addition & 0 deletions docs/source/how_to_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ maxdepth: 1
---
migrating_from_scripts_to_pytask
interfaces_for_dependencies_products
portability
remote_files
functional_interface
capture_warnings
Expand Down
88 changes: 88 additions & 0 deletions docs/source/how_to_guides/portability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Portability

This guide explains what you need to do to move a pytask project between machines and
why the lockfile is central to that process.

```{seealso}
The lockfile format and behavior are documented in the
[reference guide](../reference_guides/lockfile.md).
```

## How to port a project

Use this checklist when you move a project to another machine or environment.

1. **Update state once on the source machine.**

Run a normal build so `pytask.lock` is up to date:

```console
$ pytask build
```

If you already have a recent lockfile and up-to-date outputs, you can skip this step.

1. **Ship the right files.**

Commit `pytask.lock` to your repository and move it with the project. In practice,
you should move:

- the project files tracked in version control (source, configuration, data inputs
and `pytask.lock`)
- the build artifacts you want to reuse (often in `bld/` if you follow the tutorial
layout)
- the `.pytask` folder in case you are using the data catalog and it manages some of
the files

1. **Files outside the project**

If you have files outside the project root (the folder with the `pyproject.toml`
file), you need to make sure that the same relative layout exists on the target
machine.

1. **Run pytask on the target machine.**

When states match, tasks are skipped. When they differ, tasks run and the lockfile is
updated.

## What makes a project portable

There are two things that must stay stable across machines:

First, task and node IDs must be stable. An ID is the unique identifier that ties a task
or node to an entry in `pytask.lock`. pytask builds these IDs from project-relative
paths anchored at the project root, so most users do not need to do anything. If you
implement custom nodes, make sure their IDs remain project-relative and stable across
machines.

Second, state values must be portable. The lockfile stores opaque state strings from
`PNode.state()` and `PTask.state()`, and pytask uses them to decide whether a task is up
to date. Content hashes are portable; timestamps or absolute paths are not. This mostly
matters when you define custom nodes or custom hash functions.

## Tips for stable state values

- Prefer file content hashes over timestamps for custom nodes.
- For `PythonNode` values that are not natively stable, provide a custom hash function.
- Avoid machine-specific paths or timestamps in custom `state()` implementations.

```{seealso}
For custom nodes, see [Writing custom nodes](writing_custom_nodes.md).
For hashing guidance, see
[Hashing inputs of tasks](hashing_inputs_of_tasks.md).
```

## Cleaning up the lockfile

`pytask.lock` is updated incrementally. Entries are only replaced when the corresponding
tasks run. If tasks are removed or renamed, their old entries remain as stale data and
are ignored.

To clean up stale entries without deleting the file, run:

```console
$ pytask build --clean-lockfile
```

This rewrites the lockfile after a successful build with only the currently collected
tasks and their current state values.
12 changes: 7 additions & 5 deletions docs/source/reference_guides/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,11 +44,13 @@ are welcome to also support macOS.

````{confval} database_url

pytask uses a database to keep track of tasks, products, and dependencies over runs. By
default, it will create an SQLite database in the project's root directory called
`.pytask/pytask.sqlite3`. If you want to use a different name or a different dialect
[supported by sqlalchemy](https://docs.sqlalchemy.org/en/latest/core/engines.html#backend-specific-urls),
use either {option}`pytask build --database-url` or `database_url` in the config.
SQLite is the legacy state format. pytask now uses `pytask.lock` as the primary state
backend and only consults the database when no lockfile exists. During that first run,
the lockfile is written. Subsequent runs use only the lockfile and do not update the
database state.

The `database_url` option remains for backwards compatibility and controls the legacy
database location and dialect ([supported by sqlalchemy](https://docs.sqlalchemy.org/en/latest/core/engines.html#backend-specific-urls)).

```toml
database_url = "sqlite:///.pytask/pytask.sqlite3"
Expand Down
1 change: 1 addition & 0 deletions docs/source/reference_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ maxdepth: 1
---
command_line_interface
configuration
lockfile
hookspecs
api
```
86 changes: 86 additions & 0 deletions docs/source/reference_guides/lockfile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# The Lock File

`pytask.lock` is the default state backend. It stores task state in a portable,
git-friendly format so runs can be resumed or shared across machines.

```{note}
SQLite is the legacy format. It is still read when no lockfile exists, and a lockfile
is written during that first run. Subsequent runs use only the lockfile and do not
update the database state.
```

## Example

```toml
# This file is automatically @generated by pytask.
# It is not intended for manual editing.

lock-version = "1"

[[task]]
id = "src/tasks/data.py::task_clean_data"
state = "f9e8d7c6..."

[task.depends_on]
"data/raw/input.csv" = "e5f6g7h8..."

[task.produces]
"data/processed/clean.parquet" = "m3n4o5p6..."
```

## Behavior

On each run, pytask:

1. Reads `pytask.lock` (if present).
1. Compares current dependency/product/task `state()` to stored `state`.
1. Skips tasks whose states match; runs the rest.
1. Updates `pytask.lock` after each completed task (atomic write).
1. Updates `pytask.lock` after skipping unchanged tasks (unless `--dry-run` or
`--explain` are active).

## Portability

There are two portability concerns:

1. **IDs**: Lockfile IDs must be project‑relative and stable across machines.
1. **State values**: `state` is opaque; portability depends on each node’s `state()`
implementation. Content hashes are portable; timestamps are not.

## Maintenance

Use `pytask build --clean-lockfile` to rewrite `pytask.lock` with only currently
collected tasks. The rewrite happens after a successful build and recomputes current
state values without executing tasks again.

## File Format Reference

### Top-Level

| Field | Required | Description |
| -------------- | -------- | -------------------------------- |
| `lock-version` | Yes | Schema version (currently `"1"`) |

### Task Entry

| Field | Required | Description |
| ------------ | -------- | ----------------------------- |
| `id` | Yes | Portable task identifier |
| `state` | Yes | Opaque state string |
| `depends_on` | No | Mapping from node id to state |
| `produces` | No | Mapping from node id to state |

### Dependency/Product Entry

Node entries are stored as key-value pairs inside `depends_on` and `produces`, where the
key is the node id and the value is the node state string.

## Version Compatibility

Only lock-version `"1"` is supported. Older or newer versions error with a clear upgrade
message.

## Implementation Notes

- The lockfile is encoded/decoded with `msgspec`’s TOML support.
- Writes are atomic: pytask writes a temporary file and replaces `pytask.lock`.
2 changes: 1 addition & 1 deletion docs/source/tutorials/making_tasks_persist.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ In this case, you can apply the {func}`@pytask.mark.persist <pytask.mark.persist
decorator to the task, which will skip its execution as long as all products exist.

Internally, the state of the dependencies, the source file, and the products are updated
in the database such that the subsequent execution will skip the task successfully.
in the lockfile such that the subsequent execution will skip the task successfully.

## When is this useful?

Expand Down
39 changes: 39 additions & 0 deletions plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Plan: Lockfile Review Follow-ups

## Goals
- Align implementation with intended backend behavior (DB read-only when no lockfile; lockfile-only afterwards).
- Fix correctness and compatibility risks in the lockfile implementation.
- Close documentation and test gaps.

## Findings to Address
1. **DB writes continue after lockfile exists**
- `update_states()` always calls `_db_update_states(...)` even when `pytask.lock` is present.
- **Action:** Guard `_db_update_states` so DB writes stop once the lockfile exists.
2. **PythonNode ID collisions**
- `node_info.path` segments are joined with `"-"`; this can collide for certain tuples.
- **Action:** Encode `node_info.path` losslessly (e.g., JSON/msgspec with a stable prefix or length-prefix segments).
3. **Decode error handling**
- Initial decode can raise `msgspec.DecodeError` without a clean `LockfileError`.
- **Action:** Wrap the first decode in the same error handling as the typed decode.
4. **Docs mismatch**
- Docs say DB is only consulted when no lockfile exists, but DB is still created/updated.
- **Action:** Update docs to match behavior after gating; clarify skip behavior and lockfile updates.
5. **Tests missing**
- No test that lockfile-only skipping works when DB changes.
- No test for DB no-write after lockfile exists.
- No portability tests for ID generation (relative paths, `..`, `UPath`).
- No test for malformed lockfile error behavior.

## Tasks
1. **Backend gating**
- Implement guard in `update_states()` so `_db_update_states` is skipped once a lockfile exists.
- Optionally skip DB creation when `pytask.lock` already exists (confirm desired behavior first).
2. **ID encoding**
- Update `build_portable_node_id()` to encode `node_info.path` without collisions.
- Add unit tests covering ambiguous tuples that would collide under `"-"` join.
3. **Error handling**
- Wrap first decode in `read_lockfile()` with `msgspec.DecodeError` handling.
4. **Docs**
- Align lockfile + configuration docs with actual behavior after gating.
5. **Tests**
- Add regression tests for lockfile-only skip behavior, DB no-write after lockfile, portability IDs, and malformed lockfile formats.
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ dependencies = [
"pluggy>=1.3.0",
"rich>=13.8.0",
"sqlalchemy>=2.0.31",
"msgspec[toml]>=0.18.6",
'tomli>=1; python_version < "3.11"',
'typing-extensions>=4.8.0; python_version < "3.11"',
"universal-pathlib>=0.2.2",
Expand All @@ -55,7 +56,7 @@ docs = [
"matplotlib>=3.5.0",
"myst-parser>=3.0.0",
"myst-nb>=1.2.0",
"sphinx>=7.0.0",
"sphinx>=7.0.0,<9.0.0",
"sphinx-click>=6.0.0",
"sphinx-copybutton>=0.5.2",
"sphinx-design>=0.3",
Expand Down
10 changes: 10 additions & 0 deletions src/_pytask/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ def build( # noqa: PLR0913
debug_pytask: bool = False,
disable_warnings: bool = False,
dry_run: bool = False,
clean_lockfile: bool = False,
editor_url_scheme: Literal["no_link", "file", "vscode", "pycharm"] # noqa: PYI051
| str = "file",
explain: bool = False,
Expand Down Expand Up @@ -121,6 +122,8 @@ def build( # noqa: PLR0913
Whether warnings should be disabled and not displayed.
dry_run
Whether a dry-run should be performed that shows which tasks need to be rerun.
clean_lockfile
Whether the lockfile should be rewritten to only include collected tasks.
editor_url_scheme
An url scheme that allows to click on task names, node names and filenames and
jump right into you preferred editor to the right line.
Expand Down Expand Up @@ -189,6 +192,7 @@ def build( # noqa: PLR0913
"debug_pytask": debug_pytask,
"disable_warnings": disable_warnings,
"dry_run": dry_run,
"clean_lockfile": clean_lockfile,
"editor_url_scheme": editor_url_scheme,
"explain": explain,
"expression": expression,
Expand Down Expand Up @@ -305,6 +309,12 @@ def build( # noqa: PLR0913
default=False,
help="Execute a task even if it succeeded successfully before.",
)
@click.option(
"--clean-lockfile",
is_flag=True,
default=False,
help="Rewrite the lockfile with only currently collected tasks.",
)
@click.option(
"--explain",
is_flag=True,
Expand Down
20 changes: 18 additions & 2 deletions src/_pytask/console.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,10 +111,26 @@ def render_to_string(
example, render warnings with colors or text in exceptions.

"""
buffer = console.render(renderable)
render_console = console
if not strip_styles and console.no_color and console.color_system is not None:
theme: Theme | None
try:
theme = Theme(console._theme_stack._entries[-1])
except (AttributeError, IndexError, TypeError):
theme = None
render_console = Console(
color_system=console.color_system, # type: ignore[invalid-argument-type]
force_terminal=True,
width=console.width,
no_color=False,
markup=getattr(console, "_markup", True),
theme=theme,
)

buffer = render_console.render(renderable)
if strip_styles:
buffer = Segment.strip_styles(buffer)
return console._render_buffer(buffer)
return render_console._render_buffer(buffer)


def format_task_name(task: PTask, editor_url_scheme: str) -> Text:
Expand Down
8 changes: 6 additions & 2 deletions src/_pytask/database.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from sqlalchemy.engine import make_url

from _pytask.database_utils import create_database
from _pytask.database_utils import configure_database_if_present
from _pytask.pluginmanager import hookimpl


Expand Down Expand Up @@ -45,4 +45,8 @@ def pytask_parse_config(config: dict[str, Any]) -> None:
@hookimpl
def pytask_post_parse(config: dict[str, Any]) -> None:
"""Post-parse the configuration."""
create_database(config["database_url"])
lockfile_path = config["root"] / "pytask.lock"
command = config.get("command")
if lockfile_path.exists() and command in (None, "build"):
return
configure_database_if_present(config["database_url"])
Loading