Fixing errors in the repo setup by TheVidz · Pull Request #3 · genxnetwork/flan

TheVidz · 2026-03-31T10:10:02Z

Summary by Sourcery

Fix QC preprocessing and sample splitting to consistently operate on QC-processed genotype data and add the deep_ancestry package marker module.

Bug Fixes:

Ensure QC preprocessing writes to a separate *_qc pfile and updates the cache to use this QC output for downstream steps.
Update PLINK missing variant ID handling in QC to avoid errors when variant ID '.' appears multiple times.
Default the number of folds to a minimum of 5 in the sample splitter when not provided to prevent invalid split configurations.
Force genotype splitting to consume QC-processed pfiles rather than raw inputs, ensuring consistency between QC and downstream splits.
Ensure the sample splitter uses the QC output pfile in its fit_transform workflow to avoid path-related errors.
Add an init module for the deep_ancestry package to fix import and naming issues.

sourcery-ai · 2026-03-31T10:10:09Z

Reviewer's Guide

Refactors QC and sample splitting to correctly produce and consume QC’d PLINK pfiles, adds safer defaults for fold counts, and introduces a basic deep_ancestry package initializer to fix repo import/setup issues.

Sequence diagram for QC and sample splitting pipeline after fixes

sequenceDiagram
    actor User
    participant QCProcessor
    participant SampleSplitter
    participant FileCache
    participant Plink

    User->>QCProcessor: fit_transform(cache)
    QCProcessor->>FileCache: pfile_path()
    QCProcessor->>Plink: run_plink(--pfile original_pfile --make-pgen --out original_pfile_qc --set-missing-var-ids @:#:$r:$a qc_config)
    Plink-->>QCProcessor: QC pfile generated
    QCProcessor->>FileCache: set _pfile_path = original_pfile_qc

    User->>SampleSplitter: fit_transform(cache)
    SampleSplitter->>FileCache: pfile_path()
    SampleSplitter->>FileCache: ids_path()
    SampleSplitter->>SampleSplitter: _split_ids(cache)
    SampleSplitter->>SampleSplitter: _split_genotypes(cache)
    SampleSplitter->>FileCache: pfile_path()
    SampleSplitter->>Plink: run_plink(--pfile qc_pfile --keep fold_ids --make-pgen --out fold_pfile)
    SampleSplitter->>SampleSplitter: _split_phenotypes(cache)
    SampleSplitter->>Plink: run_plink for phenotypes
    Plink-->>SampleSplitter: split pfiles and phenotype files ready
    SampleSplitter-->>User: cache with QC based splits

Updated class diagram for QC, sample splitting, and cache handling

classDiagram
    class QCProcessor {
        - qc_config Dict
        + __init__(qc_config Dict) None
        + fit_transform(cache FileCache) None
        + transform(source_path str, dest_path str) None
    }

    class SampleSplitter {
        - args Any
        + __init__(args Any) None
        + _split_ids(cache FileCache, y Any, random_state int) None
        + _split_genotypes(cache FileCache) None
        + _split_phenotypes(cache FileCache) None
        + fit_transform(cache FileCache) None
    }

    class FileCache {
        - _pfile_path str
        + pfile_path() str
        + ids_path(fold_index int, part str) str
        + ids_path() str
        + num_folds int
    }

    class PlinkRunner {
        + run_plink(args_list List, args_dict Dict) None
    }

    QCProcessor --> FileCache : uses
    QCProcessor --> PlinkRunner : calls
    SampleSplitter --> FileCache : uses
    SampleSplitter --> PlinkRunner : calls
    SampleSplitter --> QCProcessor : consumes QC output via cache
    FileCache o-- QCProcessor : stores QC pfile_path
    FileCache o-- SampleSplitter : provides QC pfile_path and ids_path

Flow diagram for QC to fold-split genotype processing

flowchart TD
    A[Raw PLINK pfile in FileCache] --> B[QCProcessor fit_transform]
    B --> C[run_plink with --set-missing-var-ids @:#:$r:$a]
    C --> D[QC pfile with _qc suffix]
    D --> E[Update FileCache _pfile_path to QC pfile]
    E --> F[SampleSplitter fit_transform]
    F --> G[_split_ids uses ids_path and num_folds default 5]
    G --> H[_split_genotypes uses QC pfile base_path ending with _qc]
    H --> I[run_plink per fold and split train val test]
    F --> J[_split_phenotypes uses QC based folds]
    I --> K[Per fold genotype pfiles]
    J --> L[Per fold phenotype files]
    K --> M[Downstream training uses QC genotypes]
    L --> M

File-Level Changes

Change	Details	Files
QC pipeline now writes to a dedicated *_qc pfile and updates downstream cache usage accordingly.	Change QC fit_transform to output PLINK pfiles to a new _qc path instead of overwriting the original pfile. Update PLINK arguments to remove the stray 'vzs' token and use an expanded --set-missing-var-ids pattern '@:#:$r:$a' to avoid duplicate '.' variant IDs. Mutate the FileCache instance after QC to point its internal pfile path to the new _qc output so later stages operate on QC’d data.	`flan/preprocess/qc.py`
Sample splitting now has a safe default for num_folds and consistently uses QC-processed genotypes.	Ensure args.num_folds has a minimum default of 5 when not provided, preventing errors from missing configuration. Force genotype splitting to read from the QC pfile by deriving a _qc base path for --pfile instead of using the raw cache.pfile_path. In fit_transform, normalize the cache pfile path to _qc if necessary so all splitting steps are aligned on QC output.	`flan/preprocess/sample_splitter.py`
Adds a minimal deep_ancestry package initializer to resolve naming/import issues and properly redirect to flan.	Create an empty or lightweight init.py for the deep_ancestry package so it can be imported and linked into the main flan repo structure.	`deep_ancestry/__init__.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 3 issues, and left some high level feedback:

Directly mutating the private attribute cache._pfile_path in multiple places couples this logic tightly to the current FileCache implementation; consider exposing a public setter or method on FileCache to update the pfile path instead.
The _qc suffix handling is currently duplicated and inconsistent (qc.fit_transform mutates the cache path, _split_genotypes recomputes a _qc path, and fit_transform in sample_splitter may append _qc again); it would be safer to centralize this logic in one place or add a helper that guarantees a single, well-formed QC path.
The comment in _split_ids says 'adding min 5 folds' but the code only provides a default of 5 when num_folds is missing; if a true minimum is desired, add an explicit check to clamp or validate self.args.num_folds.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Directly mutating the private attribute `cache._pfile_path` in multiple places couples this logic tightly to the current `FileCache` implementation; consider exposing a public setter or method on `FileCache` to update the pfile path instead.
- The `_qc` suffix handling is currently duplicated and inconsistent (`qc.fit_transform` mutates the cache path, `_split_genotypes` recomputes a `_qc` path, and `fit_transform` in `sample_splitter` may append `_qc` again); it would be safer to centralize this logic in one place or add a helper that guarantees a single, well-formed QC path.
- The comment in `_split_ids` says 'adding min 5 folds' but the code only provides a default of 5 when `num_folds` is missing; if a true minimum is desired, add an explicit check to clamp or validate `self.args.num_folds`.

## Individual Comments

### Comment 1
<location path="flan/preprocess/qc.py" line_range="35" />
<code_context>
+        )
+
+        # ✅ VERY IMPORTANT: update cache to point to QC output
+        cache._pfile_path = qc_path

     def transform(self, source_path: str, dest_path: str) -> None:
</code_context>
<issue_to_address>
**suggestion:** Avoid mutating a cache private attribute directly; prefer a dedicated API on FileCache

Directly setting `cache._pfile_path` tightly couples this code to `FileCache`’s internals and may break if `FileCache` later adds validation or derives related paths from this value. Prefer exposing a public setter or a dedicated method (e.g. `set_qc_pfile_path`/`update_pfile_path`) on `FileCache` and calling that instead.

Suggested implementation:

```python
        )

        # ✅ VERY IMPORTANT: update cache to point to QC output
        cache.update_pfile_path(qc_path)

```

To fully implement this change you also need to:
1. Add a public method to the `FileCache` class (where it is defined), for example:
   ```python
   class FileCache:
       ...
       def update_pfile_path(self, new_path: Path | str) -> None:
           # Perform any validation/normalization here if needed
           self._pfile_path = Path(new_path)
   ```
2. Replace any other direct writes to `cache._pfile_path` in the codebase with calls to `update_pfile_path(...)` to keep the abstraction consistent.
</issue_to_address>

### Comment 2
<location path="flan/preprocess/sample_splitter.py" line_range="32-36" />
<code_context>
             y: y can be passed to trigger StratifiedKFold instead of KFold
             random_state (int): Fixed random_state for train_test_split sklearn function
         """
+        # adding min 5 folds
+        num_folds = getattr(self.args, "num_folds", 5)
+        self.args.num_folds = num_folds
+
         ids = pandas.read_table(cache.ids_path()).rename(columns={'#IID': 'IID'}).filter(['FID', 'IID'])
</code_context>
<issue_to_address>
**suggestion:** Comment and implementation for num_folds are slightly misleading and may surprise callers

The comment describes a minimum of 5 folds, but the code only sets a default of 5 when `num_folds` is missing; it doesn’t enforce a minimum when a smaller value is provided. It also always reassigns `self.args.num_folds` even when already set. Consider updating the comment to match the behavior and/or only setting `self.args.num_folds` when the attribute is absent, e.g. `if not hasattr(self.args, "num_folds"): self.args.num_folds = 5`.

```suggestion
        # default to 5 folds if not specified
        if not hasattr(self.args, "num_folds"):
            self.args.num_folds = 5

        ids = pandas.read_table(cache.ids_path()).rename(columns={'#IID': 'IID'}).filter(['FID', 'IID'])
```
</issue_to_address>

### Comment 3
<location path="flan/preprocess/sample_splitter.py" line_range="101-103" />
<code_context>

     def fit_transform(self, cache: FileCache) -> None:
-        
+        # Force splitter to use QC output
+        if not str(cache.pfile_path()).endswith("_qc"):
+            cache._pfile_path = str(cache.pfile_path()) + "_qc"
         self._split_ids(cache)
         self._split_genotypes(cache)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Duplicated logic for enforcing QC `_qc` suffix and direct private attribute mutation may lead to inconsistencies

This re-derives the QC path via string operations and directly assigns `cache._pfile_path`, duplicating the logic already present in `QC.fit_transform` / `_split_genotypes`. To avoid divergent suffix rules or paths like `_qc_qc` on repeated calls, consider exposing a `FileCache` API (e.g., a method to switch to QC data) and using that here instead of mutating the private attribute and re-encoding the `_qc` convention.

Suggested implementation:

```python
    def fit_transform(self, cache: FileCache) -> None:
        # Ensure splitter uses QC output via FileCache API (avoids direct private attribute mutation)
        cache.use_qc_pfiles()
        self._split_ids(cache)
        self._split_genotypes(cache)
        self._split_phenotypes(cache)

```

To fully implement this suggestion, you should also:
1. Add a method like `use_qc_pfiles()` (or similar) to the `FileCache` class. That method should:
   - Encapsulate the logic for switching to QC paths (e.g., appending `_qc` only once, or following whatever convention is used in `QC.fit_transform` / `_split_genotypes`).
   - Avoid re-adding the `_qc` suffix if it's already present.
2. Update any other call sites (e.g., in `QC.fit_transform` / `_split_genotypes`) that currently:
   - Manually append `"_qc"` to `pfile_path`, or
   - Mutate `cache._pfile_path` directly,
   so that they call `cache.use_qc_pfiles()` instead. This centralizes the QC path rule and prevents inconsistencies like `_qc_qc`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

TheVidz added 4 commits March 31, 2026 01:31

fix naming issue deep_ancestry, redirecting to flan

9e38946

error in qc.py file causing issue on "prepare"

242990b

fix num folds error in sample_splitter

e9e9d8a

Fixing Error: --read-freq variant ID '.' appears multiple times

a339c66

sourcery-ai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread flan/preprocess/qc.py Outdated

Comment thread flan/preprocess/sample_splitter.py

Comment thread flan/preprocess/sample_splitter.py Outdated

TheVidz added 3 commits May 24, 2026 20:36

revert commit a339c66

e0d0bd1

coment-out commit e9e9d8a

ffa8170

revert commit 242990b

2de0dc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing errors in the repo setup#3

Fixing errors in the repo setup#3
TheVidz wants to merge 7 commits into
genxnetwork:mainfrom
TheVidz:main

TheVidz commented Mar 31, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Mar 31, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheVidz commented Mar 31, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for QC and sample splitting pipeline after fixes

Updated class diagram for QC, sample splitting, and cache handling

Flow diagram for QC to fold-split genotype processing

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TheVidz commented Mar 31, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Mar 31, 2026 •

edited

Loading