Why
repi/ingestion/log_parser.py parses line-by-line. Two real-world ingestion gaps fall out of that:
- Multi-line stack traces — a Java/Python exception logs as one event but takes many lines. Each continuation line (e.g.
at com.foo.Bar.baz(Bar.java:42), File "foo.py", line 17, in bar) gets treated as a separate log entry: no timestamp, no level, plain text. The result is lines_with_timestamp drops dramatically on a service with frequent exceptions, time-based filters miss them, and the chunk count balloons with low-signal noise.
- Gzipped files —
POST /ingest reads the body as utf-8. Rotated log files (app.log.1.gz), CI artefact downloads, and S3 exports are routinely gzip-compressed. Today they arrive as binary, fail decode, and the user has to gunzip first.
Scope (in)
Multi-line stack trace joining
- Detect continuation lines: leading whitespace (
^\s+), Python ^\s+File "...", Java ^\s+at , ^Caused by:, ^Traceback, and a few common cousins.
- Glue them onto the previous parsed entry (extend its
message, keep its timestamp + level).
- Implementation lives in the parser, not the chunker — the chunker should keep treating one entry as one entry.
- Cap the number of joined continuation lines per entry (e.g. 200) as a guard against runaway non-log text.
Gzip support
- In
repi/api/ingest.py, detect the gzip magic bytes (\x1f\x8b) at the start of the upload and decompress on the fly before decoding utf-8.
- Filename-based detection is a fallback only — magic bytes are authoritative.
- Stream-decompress for large files (use
gzip.GzipFile over a BytesIO).
Scope (out)
- Other compression formats (.zst, .bz2) — file separately if there's demand.
- Binary log formats (journalctl raw, Windows .evtx) — that's a much bigger ingest path.
- Multi-line log formats other than stack traces (e.g. SQL EXPLAIN output) — handle if a user reports the gap.
Acceptance
- Ingesting a Python script that throws and logs a 12-line traceback produces one chunk with the full traceback in its
message, with the original log line's timestamp and level.
- Ingesting a Java app log with
at ... frames behaves the same way.
POST /ingest with a .log.gz body succeeds end-to-end and the resulting chunks match what ingesting the un-gzipped file would have produced.
- Existing single-line parsing is unchanged for the formats already supported.
Files
repi/ingestion/log_parser.py — continuation-line detection and join logic.
repi/ingestion/log_ingestor.py — call site if any state needs to carry between lines.
repi/api/ingest.py — gzip sniff + decompress.
tests/ingestion/test_log_parser.py — fixtures for Python + Java tracebacks.
tests/api/test_ingest.py — gzipped upload round-trip.
Why
repi/ingestion/log_parser.pyparses line-by-line. Two real-world ingestion gaps fall out of that:at com.foo.Bar.baz(Bar.java:42),File "foo.py", line 17, in bar) gets treated as a separate log entry: no timestamp, no level, plain text. The result islines_with_timestampdrops dramatically on a service with frequent exceptions, time-based filters miss them, and the chunk count balloons with low-signal noise.POST /ingestreads the body as utf-8. Rotated log files (app.log.1.gz), CI artefact downloads, and S3 exports are routinely gzip-compressed. Today they arrive as binary, fail decode, and the user has togunzipfirst.Scope (in)
Multi-line stack trace joining
^\s+), Python^\s+File "...", Java^\s+at,^Caused by:,^Traceback, and a few common cousins.message, keep its timestamp + level).Gzip support
repi/api/ingest.py, detect the gzip magic bytes (\x1f\x8b) at the start of the upload and decompress on the fly before decoding utf-8.gzip.GzipFileover aBytesIO).Scope (out)
Acceptance
message, with the original log line's timestamp and level.at ...frames behaves the same way.POST /ingestwith a.log.gzbody succeeds end-to-end and the resulting chunks match what ingesting the un-gzipped file would have produced.Files
repi/ingestion/log_parser.py— continuation-line detection and join logic.repi/ingestion/log_ingestor.py— call site if any state needs to carry between lines.repi/api/ingest.py— gzip sniff + decompress.tests/ingestion/test_log_parser.py— fixtures for Python + Java tracebacks.tests/api/test_ingest.py— gzipped upload round-trip.