feat: Adding support for gemma4 e2b models#1162
Conversation
bf62c0b to
938bc11
Compare
ec22b0d to
66b3d24
Compare
12f147e to
a606c79
Compare
047315d to
25fb705
Compare
c41ec78 to
4b27994
Compare
4b27994 to
8038ffe
Compare
There was a problem hiding this comment.
not going to block the PR over this, however this is getting pretty similar to vision encoder, maybe we can lift something up
There was a problem hiding this comment.
i believe we should sanity check the inputs provided by user in setters here
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
We needs to add some kind of symbol to messages with audio attached, now it is not clear for the user. Also, how big is the context in this model? It seems to be really small which makes it hard to keep conversation with this model using multimodal inputs. Another thing which is concerning is that the TTFT rapidly grows when we keep sending messages.
|
Clicking on the send button does not block UI which is not intuitive for the user that something is already happening. |
|
context is 2048 tokens |
5adf248 to
edd1537
Compare
edd1537 to
db00aa0
Compare
chmjkb
left a comment
There was a problem hiding this comment.
i'll test the app on iOS
ccb5526 to
539a125
Compare
msluszniak
left a comment
There was a problem hiding this comment.
Review of the latest state (ccb5526c1) across the native runner, TS layer, and the example app. Inline comments group into: confirmed correctness bugs — the prefiller return-handling (processMultimodalInput has no Error::Ok path + prefill() discards every helper's return; these are coupled), the PLE element-size shadowing that zeroes the per-chunk offset on long prompts, and the LLMModule imperative API dropping audioConfig — plus robustness/API gaps and nits. A few items are phrased as questions where I couldn't confirm intent (e.g. the audio-encoder scalar). Verified against the code; happy to discuss any of them.
| setError(`Recording problems: ${result.message}`); | ||
| return; | ||
| } | ||
| setIsRecording(true); |
There was a problem hiding this comment.
Two recording-robustness issues: (1) this result.status === 'error' branch is effectively dead — with no file output, AudioRecorder.start() always returns { status: 'success' }, so a real native start failure still flips the UI to 'recording' and yields empty audio; register recorder.current.onError(...) instead. (2) onAudioReady pushes a Float32Array per ~0.1s with no cap, so a long recording grows memory unbounded and hands a huge buffer to sendMessage despite the model's ~30s window — enforce a max-duration stop (and reject oversized decoded buffers in loadAudioFromUrl).
| }; | ||
|
|
||
| const startRecording = async () => { | ||
| if (!hasMicPermission) { |
There was a problem hiding this comment.
Mic permission is requested once on mount and this is a dead-end on denial: the button isn't disabled when !hasMicPermission, so tapping it only sets 'enable it in Settings' with no way to act, and there's no re-check or Linking.openSettings(). Re-request inside startRecording (await requestRecordingPermissions() and update state), offer Linking.openSettings() when Denied, and/or disable the button when permission is known-denied.
chmjkb
left a comment
There was a problem hiding this comment.
besides what Mateusz requested, i think everything is fine on my end
8f1d204 to
c5194ec
Compare
msluszniak
left a comment
There was a problem hiding this comment.
I think all issues that I raised are now addressed. Thank you for working on this one Mateusz. 🚀
<!-- Provide a concise and descriptive summary of the changes implemented in this PR. --> - [ ] Yes - [x] No - [ ] Bug fix (change which fixes an issue) - [x] New feature (change which adds functionality) - [x] Documentation update (improves or adds clarity to existing documentation) - [ ] Other (chores, tests, code style improvements etc.) - [x] iOS - [x] Android Test by running apps/llm app on llm screen (for text only model) and multimodal screen (for audio-vision-text model). Text model should work as any other llm model. Multimodal can process up-to-30sec audio chunks as well as image inputs, should be able to transcribe audio, describe pictures or similar. <!-- Add screenshots here, if applicable --> - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have updated the documentation accordingly - [ ] My changes generate no new warnings
Description
Introduces a breaking change?
Type of change
Tested on
Testing instructions
Test by running apps/llm app on llm screen (for text only model) and multimodal screen (for audio-vision-text model). Text model should work as any other llm model. Multimodal can process up-to-30sec audio chunks as well as image inputs, should be able to transcribe audio, describe pictures or similar.
Screenshots
Related issues
#1062
Checklist
Additional notes