Fixing errors in the repo setup#3
Open
TheVidz wants to merge 7 commits into
Open
Conversation
Reviewer's GuideRefactors QC and sample splitting to correctly produce and consume QC’d PLINK pfiles, adds safer defaults for fold counts, and introduces a basic deep_ancestry package initializer to fix repo import/setup issues. Sequence diagram for QC and sample splitting pipeline after fixessequenceDiagram
actor User
participant QCProcessor
participant SampleSplitter
participant FileCache
participant Plink
User->>QCProcessor: fit_transform(cache)
QCProcessor->>FileCache: pfile_path()
QCProcessor->>Plink: run_plink(--pfile original_pfile --make-pgen --out original_pfile_qc --set-missing-var-ids @:#:$r:$a qc_config)
Plink-->>QCProcessor: QC pfile generated
QCProcessor->>FileCache: set _pfile_path = original_pfile_qc
User->>SampleSplitter: fit_transform(cache)
SampleSplitter->>FileCache: pfile_path()
SampleSplitter->>FileCache: ids_path()
SampleSplitter->>SampleSplitter: _split_ids(cache)
SampleSplitter->>SampleSplitter: _split_genotypes(cache)
SampleSplitter->>FileCache: pfile_path()
SampleSplitter->>Plink: run_plink(--pfile qc_pfile --keep fold_ids --make-pgen --out fold_pfile)
SampleSplitter->>SampleSplitter: _split_phenotypes(cache)
SampleSplitter->>Plink: run_plink for phenotypes
Plink-->>SampleSplitter: split pfiles and phenotype files ready
SampleSplitter-->>User: cache with QC based splits
Updated class diagram for QC, sample splitting, and cache handlingclassDiagram
class QCProcessor {
- qc_config Dict
+ __init__(qc_config Dict) None
+ fit_transform(cache FileCache) None
+ transform(source_path str, dest_path str) None
}
class SampleSplitter {
- args Any
+ __init__(args Any) None
+ _split_ids(cache FileCache, y Any, random_state int) None
+ _split_genotypes(cache FileCache) None
+ _split_phenotypes(cache FileCache) None
+ fit_transform(cache FileCache) None
}
class FileCache {
- _pfile_path str
+ pfile_path() str
+ ids_path(fold_index int, part str) str
+ ids_path() str
+ num_folds int
}
class PlinkRunner {
+ run_plink(args_list List, args_dict Dict) None
}
QCProcessor --> FileCache : uses
QCProcessor --> PlinkRunner : calls
SampleSplitter --> FileCache : uses
SampleSplitter --> PlinkRunner : calls
SampleSplitter --> QCProcessor : consumes QC output via cache
FileCache o-- QCProcessor : stores QC pfile_path
FileCache o-- SampleSplitter : provides QC pfile_path and ids_path
Flow diagram for QC to fold-split genotype processingflowchart TD
A[Raw PLINK pfile in FileCache] --> B[QCProcessor fit_transform]
B --> C[run_plink with --set-missing-var-ids @:#:$r:$a]
C --> D[QC pfile with _qc suffix]
D --> E[Update FileCache _pfile_path to QC pfile]
E --> F[SampleSplitter fit_transform]
F --> G[_split_ids uses ids_path and num_folds default 5]
G --> H[_split_genotypes uses QC pfile base_path ending with _qc]
H --> I[run_plink per fold and split train val test]
F --> J[_split_phenotypes uses QC based folds]
I --> K[Per fold genotype pfiles]
J --> L[Per fold phenotype files]
K --> M[Downstream training uses QC genotypes]
L --> M
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- Directly mutating the private attribute
cache._pfile_pathin multiple places couples this logic tightly to the currentFileCacheimplementation; consider exposing a public setter or method onFileCacheto update the pfile path instead. - The
_qcsuffix handling is currently duplicated and inconsistent (qc.fit_transformmutates the cache path,_split_genotypesrecomputes a_qcpath, andfit_transforminsample_splittermay append_qcagain); it would be safer to centralize this logic in one place or add a helper that guarantees a single, well-formed QC path. - The comment in
_split_idssays 'adding min 5 folds' but the code only provides a default of 5 whennum_foldsis missing; if a true minimum is desired, add an explicit check to clamp or validateself.args.num_folds.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Directly mutating the private attribute `cache._pfile_path` in multiple places couples this logic tightly to the current `FileCache` implementation; consider exposing a public setter or method on `FileCache` to update the pfile path instead.
- The `_qc` suffix handling is currently duplicated and inconsistent (`qc.fit_transform` mutates the cache path, `_split_genotypes` recomputes a `_qc` path, and `fit_transform` in `sample_splitter` may append `_qc` again); it would be safer to centralize this logic in one place or add a helper that guarantees a single, well-formed QC path.
- The comment in `_split_ids` says 'adding min 5 folds' but the code only provides a default of 5 when `num_folds` is missing; if a true minimum is desired, add an explicit check to clamp or validate `self.args.num_folds`.
## Individual Comments
### Comment 1
<location path="flan/preprocess/qc.py" line_range="35" />
<code_context>
+ )
+
+ # ✅ VERY IMPORTANT: update cache to point to QC output
+ cache._pfile_path = qc_path
def transform(self, source_path: str, dest_path: str) -> None:
</code_context>
<issue_to_address>
**suggestion:** Avoid mutating a cache private attribute directly; prefer a dedicated API on FileCache
Directly setting `cache._pfile_path` tightly couples this code to `FileCache`’s internals and may break if `FileCache` later adds validation or derives related paths from this value. Prefer exposing a public setter or a dedicated method (e.g. `set_qc_pfile_path`/`update_pfile_path`) on `FileCache` and calling that instead.
Suggested implementation:
```python
)
# ✅ VERY IMPORTANT: update cache to point to QC output
cache.update_pfile_path(qc_path)
```
To fully implement this change you also need to:
1. Add a public method to the `FileCache` class (where it is defined), for example:
```python
class FileCache:
...
def update_pfile_path(self, new_path: Path | str) -> None:
# Perform any validation/normalization here if needed
self._pfile_path = Path(new_path)
```
2. Replace any other direct writes to `cache._pfile_path` in the codebase with calls to `update_pfile_path(...)` to keep the abstraction consistent.
</issue_to_address>
### Comment 2
<location path="flan/preprocess/sample_splitter.py" line_range="32-36" />
<code_context>
y: y can be passed to trigger StratifiedKFold instead of KFold
random_state (int): Fixed random_state for train_test_split sklearn function
"""
+ # adding min 5 folds
+ num_folds = getattr(self.args, "num_folds", 5)
+ self.args.num_folds = num_folds
+
ids = pandas.read_table(cache.ids_path()).rename(columns={'#IID': 'IID'}).filter(['FID', 'IID'])
</code_context>
<issue_to_address>
**suggestion:** Comment and implementation for num_folds are slightly misleading and may surprise callers
The comment describes a minimum of 5 folds, but the code only sets a default of 5 when `num_folds` is missing; it doesn’t enforce a minimum when a smaller value is provided. It also always reassigns `self.args.num_folds` even when already set. Consider updating the comment to match the behavior and/or only setting `self.args.num_folds` when the attribute is absent, e.g. `if not hasattr(self.args, "num_folds"): self.args.num_folds = 5`.
```suggestion
# default to 5 folds if not specified
if not hasattr(self.args, "num_folds"):
self.args.num_folds = 5
ids = pandas.read_table(cache.ids_path()).rename(columns={'#IID': 'IID'}).filter(['FID', 'IID'])
```
</issue_to_address>
### Comment 3
<location path="flan/preprocess/sample_splitter.py" line_range="101-103" />
<code_context>
def fit_transform(self, cache: FileCache) -> None:
-
+ # Force splitter to use QC output
+ if not str(cache.pfile_path()).endswith("_qc"):
+ cache._pfile_path = str(cache.pfile_path()) + "_qc"
self._split_ids(cache)
self._split_genotypes(cache)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Duplicated logic for enforcing QC `_qc` suffix and direct private attribute mutation may lead to inconsistencies
This re-derives the QC path via string operations and directly assigns `cache._pfile_path`, duplicating the logic already present in `QC.fit_transform` / `_split_genotypes`. To avoid divergent suffix rules or paths like `_qc_qc` on repeated calls, consider exposing a `FileCache` API (e.g., a method to switch to QC data) and using that here instead of mutating the private attribute and re-encoding the `_qc` convention.
Suggested implementation:
```python
def fit_transform(self, cache: FileCache) -> None:
# Ensure splitter uses QC output via FileCache API (avoids direct private attribute mutation)
cache.use_qc_pfiles()
self._split_ids(cache)
self._split_genotypes(cache)
self._split_phenotypes(cache)
```
To fully implement this suggestion, you should also:
1. Add a method like `use_qc_pfiles()` (or similar) to the `FileCache` class. That method should:
- Encapsulate the logic for switching to QC paths (e.g., appending `_qc` only once, or following whatever convention is used in `QC.fit_transform` / `_split_genotypes`).
- Avoid re-adding the `_qc` suffix if it's already present.
2. Update any other call sites (e.g., in `QC.fit_transform` / `_split_genotypes`) that currently:
- Manually append `"_qc"` to `pfile_path`, or
- Mutate `cache._pfile_path` directly,
so that they call `cache.use_qc_pfiles()` instead. This centralizes the QC path rule and prevents inconsistencies like `_qc_qc`.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix naming issue deep_ancestry, redirecting to flan
error in qc.py file causing issue on "prepare"
fix num folds error in sample_splitter
Fixing Error: --read-freq variant ID '.' appears multiple times
Summary by Sourcery
Fix QC preprocessing and sample splitting to consistently operate on QC-processed genotype data and add the deep_ancestry package marker module.
Bug Fixes: