feat(db): extract get_trait_data_pft() as standalone file-I/O-free function#3978
feat(db): extract get_trait_data_pft() as standalone file-I/O-free function#3978omkarrr2533 wants to merge 4 commits into
Conversation
fd354ca to
8525b71
Compare
|
At a high level, I'm pretty confused about the relationship between get.trait.data.pft and get_trait_data_pft as neither calls the other, so which is the wrapper around the other? Instead of a thin backwards-compatible wrapper, we seems to have two implementations of the the same thing, which isn't what we're looking for. Also, nothing seems to be calling get_trait_data_pft yet. If this is the intended replacement, it seems like you should be swapping out the call in get.trait.data |
|
A straightforward fix that I could think of is |
|
Understood the wrapper needs to actually call get_trait_data_pft() for the query part rather than duplicating that logic. Will refactor get.trait.data.pft() to delegate to the standalone in the non-cache path, matching the pattern from #3860 where get.parameter.samples() delegates to get_parameter_samples(). Will push the fix shortly. |
8525b71 to
6a140a7
Compare
12219a8 to
e5f7353
Compare
…h roxygen2 7.3.3
e5f7353 to
3fd20cf
Compare
| computed <- get_trait_data_pft( | ||
| pft_name = pft[["name"]], | ||
| modeltype = modeltype, | ||
| dbcon = dbcon, | ||
| trait_names = trait.names, | ||
| constants = if (!is.null(pft$constants)) pft$constants else list() | ||
| ) |
There was a problem hiding this comment.
this get_trait_data_pft is called before the forceupdate check, meaning every call hits the db, even when the cache is valid
There was a problem hiding this comment.
A fix could be to move this inside the cache-miss section.
There was a problem hiding this comment.
moved the get_trait_data_pft() call into the cache-miss path. The DB is no longer queried on a cache-hit. Added a cheap query_pfts() call before the cache check to resolve pftid, pfttype, and pft_member_filename without fetching full trait data.
c590f33 to
43919c2
Compare
| ##' instead. | ||
| ##' Internally this wrapper delegates all database queries to | ||
| ##' \code{\link{get_trait_data_pft}} exactly once. The returned objects are | ||
| ##' used for both the cache-staleness check and the save step, so the database |
There was a problem hiding this comment.
Is this bit about cacheing something the user needs to know to be able to use the function correctly, or something that should just be internally documented within the function? I suspect it's going to confuse more than enlighten
There was a problem hiding this comment.
Fair point moved it out. The cache mechanics were noise in user-facing roxygen. The new wrapper delegates to get_trait_data_pft() exactly once at the top, so the cache details now live as a short inline comment block right above the staleness check, where someone reading the function body will see them and someone reading ?get.trait.data.pft won't.
| ##' detect changes between runs.} | ||
| ##' } | ||
| ##' | ||
| ##' **Downstream contract:** The files `trait.data.Rdata` and |
There was a problem hiding this comment.
I wouldn't remove this until run.meta.analysis.pft no longer expects these files
There was a problem hiding this comment.
Agreed restored the Downstream contract section. Added a one-line note that it'll be removed in the Week 2 PR once run.meta.analysis.pft() accepts in-memory inputs.
| ##' `pft` should be a list containing at least `name` and `outdir`, and | ||
| ##' optionally `posteriorid` and `constants`. | ||
| ##' | ||
| ##' **File-based side effects (saved to `pft$outdir`):** |
There was a problem hiding this comment.
All these file-based side effects are still true and still really important information to retain. I'd restore these comments
There was a problem hiding this comment.
Restored. Kept the full File-based side effects section — every file listed is still being written exactly as before, so the docs should reflect that. 👍
| ##' ID of the (possibly new) posterior record in BETYdb. The posterior ID can | ||
| ##' be used to locate the output files (`trait.data.Rdata`, `prior.distns.Rdata`, | ||
| ##' ID of the (possibly new) posterior record in BETYdb. Also contains | ||
| ##' `pft$trait_data` and `pft$prior_distns` for in-memory chaining on both |
There was a problem hiding this comment.
The "also contains" part is critical but insufficiently described/documented and insufficiently connected to the output files part. Indeed, I'm not 100% sure this function should return these objects at all, at least by default, as that's a nontrivial, breaking change to the previous behavior. The problem is that if these objects are returned as part of a pft input list that is then used to updated that part of the settings object, that doing so will break the ability to save the settings object. I wonder if we should add a flag to control this behavior?
The "in-memory chaining on both the cache-hit and cache-miss paths" is also unneeded.
There was a problem hiding this comment.
Good call on the flag. Added a return_data parameter, default FALSE, so legacy behavior is unchanged. When return_data = TRUE, the returned pft list also includes trait_data and prior_distns for in-memory chaining; otherwise nothing is attached.
Updated the @return docs to make the tradeoff explicit:
"Default is FALSE to preserve legacy behavior — when pft is embedded inside a settings object, attaching these data frames by default would inflate the settings and break serialization."
Also dropped the "in-memory chaining on both cache-hit and cache-miss paths" line per your comment.
| pft_member_filename = "cultivars.csv" | ||
| pftid <- pftres[["id"]] | ||
| pfttype <- pftres[["pft_type"]] | ||
| pft_member_filename <- if (pfttype == "cultivar") "cultivars.csv" else "species.csv" |
There was a problem hiding this comment.
This bit is not equivalent to what was here previously and as such we seem to be loosing the logger.severe associated with an unmatched pfttype. This would be a good test to add if we don't have it already -- mock having an invalid pft_type and check that the function stops rather than continuing on thinking it's working with plant / species.csv data
There was a problem hiding this comment.
that was a regression on my part. Restored the explicit three-branch check with logger.severe for unknown pft_type. The actual guard lives inside get_trait_data_pft() now (since the wrapper delegates there), but I added a wrapper-level test (test_that("wrapper errors for unknown pft_type, not silent fallback")) that mocks query_pfts to return a bad record and asserts the function stops. So the contract is covered from both ends.
| if (done) { | ||
| trait_env <- new.env(parent = emptyenv()) | ||
| prior_env <- new.env(parent = emptyenv()) | ||
| load(file.path(pft$outdir, "trait.data.Rdata"), envir = trait_env) |
There was a problem hiding this comment.
- I thought the whole point of the refactor was that these objects would be returned by get_trait_data_pft without having to write them to disk and then load them back in?
- see previous comment about not being sure returning these data structures should be the default behavior for the legacy function
There was a problem hiding this comment.
Right, that was a leftover from an earlier version of the diff. With the single-delegation structure, trait.data and prior.distns are already in scope from the top-of-function call, so the cache-hit path just attaches those directly (gated on return_data = TRUE). No load() from disk.
|
|
||
| ## Attach computed objects so downstream callers can chain in-memory | ||
| ## without loading files from pft$outdir. | ||
| pft$trait_data <- trait.data |
There was a problem hiding this comment.
same comment as before about adding a flag to make this optional. Should default to not returning (legacy behavior)
There was a problem hiding this comment.
Done handled by the same return_data flag from the comment above. Both the cache-hit and cache-miss return paths attach trait_data / prior_distns only when return_data = TRUE. Default FALSE = legacy behavior, no breaking change.
|
|
||
| # ---- Resolve PFT to a single database record ---- | ||
| pft_record <- query_pfts(dbcon, pft_name, modeltype, strict = TRUE) | ||
| if (nrow(pft_record) > 1L) { |
There was a problem hiding this comment.
get.trait.data.pft also has a check for length 0. Is that needed here? Also, if the checks are here, are they still needed in get.trait.data.pft too?
There was a problem hiding this comment.
that was a real duplication. With the new single-delegation structure, all DB queries and their associated validation (length-0, multi-PFT, pft_type guard, dbcon class, etc.) live only in get_trait_data_pft(). The wrapper inherits the errors through the delegation. So both checks are present once, in one place — no abstraction needed since there's no longer duplication.
| # ---- Fetch PFT member species or cultivars ---- | ||
| # The join table depends on pft_type. An unknown type is a hard error — | ||
| # silently falling through to the species path would produce wrong results. | ||
| if (identical(pft_type, "cultivar")) { |
There was a problem hiding this comment.
likewise, if this check is here, is it also needed in get.trait.data.pft? If all these checks are actually needed in both places, can they be abstracted to a function that is used in both places?
There was a problem hiding this comment.
the pft_type guard now lives only in get_trait_data_pft(). The wrapper's inline copy is gone in the new diff. If we ever do find a check that genuinely needs to be in both places, I agree a shared helper would be the right move, but right now there's nothing to abstract.
…comments, fix roxygen version
43919c2 to
a6d06f2
Compare
get.trait.data.pft() currently queries the DB, computes trait_data and
prior_distns, saves them to disk, then throws them away — forcing
run.meta.analysis.pft() to immediately reload the same files.
This PR cuts that first link in the chain:
returns list(trait_data, prior_distns, pft_info) directly
to the returned pft — so downstream functions can use them in-memory
without changing anything for callers that don't care
dbfile.insert(), caching all stay exactly as they were
Tests added: 3 validation tests (fire without DB), 8 DB-backed tests
including an end-to-end equivalence check that verifies the standalone
returns identical objects to what the wrapper writes to disk.
Part of GSoC 2026 Week 1 — follows the same pattern as
meta_analysis_standalone() which was already extracted this way.