Skip to content

⚡️ Speed up function is_tex_string by 84%#153

Open
codeflash-ai[bot] wants to merge 1 commit into
branch-3.9from
codeflash/optimize-is_tex_string-mhwtgnza
Open

⚡️ Speed up function is_tex_string by 84%#153
codeflash-ai[bot] wants to merge 1 commit into
branch-3.9from
codeflash/optimize-is_tex_string-mhwtgnza

Conversation

@codeflash-ai
Copy link
Copy Markdown

@codeflash-ai codeflash-ai Bot commented Nov 13, 2025

📄 84% (0.84x) speedup for is_tex_string in src/bokeh/embed/util.py

⏱️ Runtime : 276 microseconds 150 microseconds (best of 250 runs)

📝 Explanation and details

The optimization achieves an 83% speedup by precompiling the regex pattern at module load time instead of recompiling it on every function call.

Key optimization: The original code compiled a new regex pattern (re.compile) each time is_tex_string() was called, which is expensive. The optimized version moves this compilation outside the function, storing the precompiled pattern in the module-level variable _pat. Now each function call only performs the fast pattern matching operation.

Performance impact: Line profiler results show the dramatic improvement - the original version spent 73.3% of execution time (806μs out of 1100μs total) just compiling the regex pattern on each call. The optimized version eliminates this overhead entirely, reducing total execution time from 1100μs to 220μs.

Why this works: Regex compilation involves parsing the pattern string, building a finite state machine, and optimizing it - operations that don't need to be repeated since the MathJax delimiter patterns are constants. Python's re.compile returns an optimized pattern object that can be reused indefinitely.

Test case benefits: The optimization provides consistent speedups across all test scenarios:

  • Small strings: 90-230% faster (most common case)
  • Large strings (1000+ chars): 10-25% faster (regex matching dominates over compilation)
  • Edge cases: 100-200% faster for invalid patterns that fail quickly

This optimization is particularly valuable if is_tex_string() is called frequently in text processing pipelines, as the compilation overhead elimination scales linearly with call frequency.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 70 Passed
🌀 Generated Regression Tests 111 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
unit/bokeh/embed/test_util__embed.py::Test__tex_helpers.test_is_tex_string 21.5μs 8.58μs 151%✅
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

import re

# imports
import pytest  # used for our unit tests
from bokeh.embed.util import is_tex_string

#-----------------------------------------------------------------------------
# Code
#-----------------------------------------------------------------------------

# ----------------------
# Basic Test Cases
# ----------------------

def test_basic_dollar_delimiters():
    # Test with valid TeX string using $
    codeflash_output = is_tex_string("$x^2$") # 3.09μs -> 1.38μs (124% faster)
    # Test with valid TeX string using $ with whitespace inside
    codeflash_output = is_tex_string("$ x^2 + y^2 = z^2 $") # 1.43μs -> 740ns (93.0% faster)

def test_basic_brace_delimiters():
    # Test with valid TeX string using \[ \]
    codeflash_output = is_tex_string(r"\[x^2\]") # 2.89μs -> 1.14μs (155% faster)
    # Test with valid TeX string using \[ \] and whitespace
    codeflash_output = is_tex_string(r"\[ x^2 + y^2 = z^2 \]") # 1.39μs -> 633ns (119% faster)

def test_basic_paren_delimiters():
    # Test with valid TeX string using \( \)
    codeflash_output = is_tex_string(r"\(x^2\)") # 2.60μs -> 1.10μs (135% faster)
    # Test with valid TeX string using \( \) and whitespace
    codeflash_output = is_tex_string(r"\( x^2 + y^2 = z^2 \)") # 1.38μs -> 634ns (117% faster)

def test_basic_non_tex_strings():
    # Test with a normal string, not TeX
    codeflash_output = is_tex_string("x^2") # 2.24μs -> 673ns (233% faster)
    # Test with only delimiters but no content
    codeflash_output = is_tex_string("$ $") # 1.50μs -> 783ns (91.1% faster)
    codeflash_output = is_tex_string(r"\[ \]") # 923ns -> 346ns (167% faster)
    codeflash_output = is_tex_string(r"\( \)") # 832ns -> 351ns (137% faster)

# ----------------------
# Edge Test Cases
# ----------------------

def test_empty_string():
    # Test with empty string
    codeflash_output = is_tex_string("") # 2.19μs -> 678ns (222% faster)

def test_only_delimiters():
    # Only the delimiters, no content
    codeflash_output = is_tex_string("$ $") # 2.77μs -> 1.16μs (140% faster)
    codeflash_output = is_tex_string(r"\[ \]") # 1.23μs -> 413ns (199% faster)
    codeflash_output = is_tex_string(r"\( \)") # 911ns -> 385ns (137% faster)

def test_unmatched_delimiters():
    # Only start or end delimiter
    codeflash_output = is_tex_string("$x^2") # 2.52μs -> 921ns (174% faster)
    codeflash_output = is_tex_string("x^2$") # 1.14μs -> 342ns (233% faster)
    codeflash_output = is_tex_string(r"\[x^2") # 999ns -> 411ns (143% faster)
    codeflash_output = is_tex_string(r"x^2\]") # 718ns -> 243ns (195% faster)
    codeflash_output = is_tex_string(r"\(x^2") # 864ns -> 350ns (147% faster)
    codeflash_output = is_tex_string(r"x^2\)") # 713ns -> 223ns (220% faster)

def test_nested_delimiters():
    # Delimiters inside the content
    codeflash_output = is_tex_string("$a + $b$ + c$") # 2.81μs -> 1.29μs (118% faster)
    codeflash_output = is_tex_string(r"\[a + \[b\] + c\]") # 1.41μs -> 688ns (105% faster)

def test_escaped_delimiters():
    # Escaped delimiters inside content
    codeflash_output = is_tex_string(r"$x^2 \$\$ y^2$") # 2.70μs -> 1.13μs (138% faster)
    codeflash_output = is_tex_string(r"\[ x^2 \\] \]") # 1.39μs -> 596ns (132% faster)
    codeflash_output = is_tex_string(r"\( x^2 \\) \)") # 1.03μs -> 435ns (137% faster)

def test_leading_trailing_whitespace():
    # Whitespace outside delimiters should fail
    codeflash_output = is_tex_string(" $x^2$") # 2.28μs -> 686ns (232% faster)
    codeflash_output = is_tex_string("$x^2$ ") # 1.45μs -> 754ns (92.3% faster)
    codeflash_output = is_tex_string(r" \[x^2\]") # 830ns -> 252ns (229% faster)
    codeflash_output = is_tex_string(r"\[x^2\] ") # 927ns -> 442ns (110% faster)
    codeflash_output = is_tex_string(r" \(x^2\)") # 724ns -> 235ns (208% faster)
    codeflash_output = is_tex_string(r"\(x^2\) ") # 886ns -> 399ns (122% faster)

def test_multiline_content():
    # TeX string with newlines inside
    codeflash_output = is_tex_string("$x^2\ny^2$") # 2.61μs -> 1.10μs (138% faster)
    codeflash_output = is_tex_string(r"\[x^2\ny^2\]") # 1.37μs -> 589ns (133% faster)
    codeflash_output = is_tex_string(r"\(x^2\ny^2\)") # 1.08μs -> 464ns (132% faster)

def test_multiline_outside_delimiters():
    # Newlines outside delimiters should fail
    codeflash_output = is_tex_string("\n$x^2$") # 2.22μs -> 778ns (185% faster)
    codeflash_output = is_tex_string("$x^2$\n") # 1.49μs -> 756ns (97.6% faster)
    codeflash_output = is_tex_string("\n\[x^2\]") # 838ns -> 280ns (199% faster)
    codeflash_output = is_tex_string(r"\[x^2\]\n") # 1.02μs -> 517ns (98.3% faster)
    codeflash_output = is_tex_string("\n\(x^2\)") # 737ns -> 238ns (210% faster)
    codeflash_output = is_tex_string(r"\(x^2\)\n") # 941ns -> 420ns (124% faster)

def test_partial_delimiters():
    # Partial delimiters or typos
    codeflash_output = is_tex_string("$x^2$") # 2.27μs -> 765ns (197% faster)
    codeflash_output = is_tex_string(r"\[x^2\)") # 1.33μs -> 583ns (127% faster)
    codeflash_output = is_tex_string(r"\(x^2\]") # 993ns -> 424ns (134% faster)
    codeflash_output = is_tex_string(r"\x^2\]") # 772ns -> 258ns (199% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations

import re

# imports
import pytest  # used for our unit tests
from bokeh.embed.util import is_tex_string

#-----------------------------------------------------------------------------
# Unit Tests
#-----------------------------------------------------------------------------

# Basic Test Cases

def test_basic_dollar_delimiters():
    # Simple valid TeX string with $
    codeflash_output = is_tex_string("$x^2$") # 2.67μs -> 1.07μs (149% faster)

def test_basic_braces_delimiters():
    # Simple valid TeX string with \[ \]
    codeflash_output = is_tex_string(r"\[x^2\]") # 2.71μs -> 1.15μs (136% faster)

def test_basic_parens_delimiters():
    # Simple valid TeX string with \( \)
    codeflash_output = is_tex_string(r"\(x^2\)") # 2.75μs -> 1.15μs (139% faster)

def test_basic_non_tex_string():
    # String without any TeX delimiters
    codeflash_output = is_tex_string("x^2") # 2.25μs -> 745ns (203% faster)

def test_basic_partial_delimiters():
    # String with only starting delimiter
    codeflash_output = is_tex_string("$x^2") # 2.77μs -> 1.15μs (141% faster)
    codeflash_output = is_tex_string(r"\[x^2") # 1.39μs -> 549ns (153% faster)
    codeflash_output = is_tex_string(r"\(x^2") # 994ns -> 351ns (183% faster)

def test_basic_mismatched_delimiters():
    # Delimiters don't match
    codeflash_output = is_tex_string("$x^2\\]") # 2.61μs -> 1.01μs (158% faster)
    codeflash_output = is_tex_string(r"\[x^2$") # 1.28μs -> 497ns (158% faster)
    codeflash_output = is_tex_string(r"\(x^2\]") # 1.03μs -> 385ns (167% faster)

def test_basic_empty_tex_string():
    # Delimiters with no content
    codeflash_output = is_tex_string("$$") # 2.60μs -> 1.01μs (157% faster)
    codeflash_output = is_tex_string(r"\[\]") # 1.33μs -> 455ns (191% faster)
    codeflash_output = is_tex_string(r"\(\)") # 985ns -> 373ns (164% faster)

# Edge Test Cases

def test_edge_only_delimiters():
    # String is only delimiters
    codeflash_output = is_tex_string("$$") # 2.58μs -> 1.02μs (152% faster)
    codeflash_output = is_tex_string(r"\[\]") # 1.27μs -> 437ns (192% faster)
    codeflash_output = is_tex_string(r"\(\)") # 1.01μs -> 379ns (166% faster)

def test_edge_whitespace_inside_delimiters():
    # Whitespace inside delimiters should be valid
    codeflash_output = is_tex_string("$   $") # 2.53μs -> 1.11μs (128% faster)
    codeflash_output = is_tex_string(r"\[   \]") # 1.22μs -> 473ns (158% faster)
    codeflash_output = is_tex_string(r"\(   \)") # 967ns -> 435ns (122% faster)

def test_edge_whitespace_outside_delimiters():
    # Whitespace outside delimiters should not be valid
    codeflash_output = is_tex_string("  $x^2$") # 2.35μs -> 808ns (190% faster)
    codeflash_output = is_tex_string("$x^2$  ") # 1.47μs -> 747ns (96.4% faster)
    codeflash_output = is_tex_string(" \[x^2\]") # 873ns -> 275ns (217% faster)
    codeflash_output = is_tex_string(r"\[x^2\] ") # 1.04μs -> 514ns (103% faster)

def test_edge_newlines_inside_delimiters():
    # Newlines inside delimiters should be valid
    codeflash_output = is_tex_string("$\nx^2\n$") # 2.60μs -> 1.04μs (149% faster)
    codeflash_output = is_tex_string(r"\[\nx^2\n\]") # 1.27μs -> 525ns (142% faster)
    codeflash_output = is_tex_string(r"\(\nx^2\n\)") # 1.09μs -> 451ns (143% faster)

def test_edge_newlines_outside_delimiters():
    # Newlines outside delimiters should not be valid
    codeflash_output = is_tex_string("\n$x^2$") # 2.24μs -> 742ns (202% faster)
    codeflash_output = is_tex_string("$x^2$\n") # 1.51μs -> 701ns (116% faster)

def test_edge_escaped_delimiters():
    # Escaped delimiters inside content should be valid
    codeflash_output = is_tex_string("$x^2 \\$ y$") # 2.74μs -> 1.17μs (133% faster)
    codeflash_output = is_tex_string(r"\[x^2 \\] y\]") # 1.38μs -> 535ns (158% faster)
    codeflash_output = is_tex_string(r"\(x^2 \\) y\)") # 1.04μs -> 457ns (129% faster)

def test_edge_nested_delimiters():
    # Nested delimiters inside content should be valid
    codeflash_output = is_tex_string("$\\[x^2\\]$") # 2.52μs -> 1.01μs (150% faster)
    codeflash_output = is_tex_string(r"\[\(x^2\)\]") # 1.37μs -> 567ns (142% faster)
    codeflash_output = is_tex_string(r"\(\[x^2\]\)") # 1.06μs -> 407ns (160% faster)

def test_edge_incorrect_order():
    # Delimiters in incorrect order
    codeflash_output = is_tex_string("$x^2\\]$") # 2.44μs -> 985ns (147% faster)
    codeflash_output = is_tex_string(r"\[x^2$\]") # 1.24μs -> 481ns (158% faster)
    codeflash_output = is_tex_string(r"\(x^2\]") # 1.10μs -> 495ns (123% faster)

def test_edge_single_dollar():
    # Single dollar signs are not valid TeX delimiters
    codeflash_output = is_tex_string("$x^2$") # 2.30μs -> 794ns (189% faster)
    codeflash_output = is_tex_string("$x^2$") # 1.11μs -> 358ns (210% faster)
    codeflash_output = is_tex_string("$x^2$") # 1.08μs -> 442ns (144% faster)

def test_edge_unescaped_backslash():
    # Unescaped backslashes in content should not affect matching
    codeflash_output = is_tex_string(r"\[x^2\\\]") # 2.69μs -> 1.14μs (135% faster)

def test_edge_non_string_input():
    # Non-string input should raise TypeError
    with pytest.raises(TypeError):
        is_tex_string(None) # 2.92μs -> 1.24μs (136% faster)
    with pytest.raises(TypeError):
        is_tex_string(123) # 1.68μs -> 770ns (118% faster)
    with pytest.raises(TypeError):
        is_tex_string(["$x^2$"]) # 1.37μs -> 590ns (132% faster)

# Large Scale Test Cases

def test_large_tex_string_dollars():
    # Large valid TeX string with $
    large_content = "x" * 1000
    codeflash_output = is_tex_string(f"${large_content}$") # 8.19μs -> 6.60μs (24.0% faster)

def test_large_tex_string_braces():
    # Large valid TeX string with \[ \]
    large_content = "y" * 999
    codeflash_output = is_tex_string(r"\[" + large_content + r"\]") # 8.04μs -> 6.49μs (23.9% faster)

def test_large_tex_string_parens():
    # Large valid TeX string with \( \)
    large_content = "z" * 1000
    codeflash_output = is_tex_string(r"\(" + large_content + r"\)") # 8.10μs -> 6.48μs (24.9% faster)

def test_large_invalid_tex_string():
    # Large invalid string (no delimiters)
    large_content = "a" * 1000
    codeflash_output = is_tex_string(large_content) # 2.25μs -> 821ns (174% faster)

def test_large_multiple_delimiters():
    # Multiple valid delimiters in one string should not match unless at start/end
    s = "$x^2$ $y^2$"
    codeflash_output = is_tex_string(s) # 3.05μs -> 1.26μs (141% faster)
    s2 = r"\[x^2\] \[y^2\]"
    codeflash_output = is_tex_string(s2) # 1.41μs -> 607ns (132% faster)

def test_large_delimiters_at_edges_only():
    # Delimiters at edges, content in between, should match
    content = "b" * 998
    codeflash_output = is_tex_string(f"${content}$") # 7.93μs -> 6.32μs (25.6% faster)
    codeflash_output = is_tex_string(r"\[" + content + r"\]") # 6.60μs -> 5.78μs (14.1% faster)
    codeflash_output = is_tex_string(r"\(" + content + r"\)") # 6.34μs -> 5.68μs (11.5% faster)

def test_large_whitespace_outside_delimiters():
    # Large string with whitespace outside delimiters should not match
    content = "c" * 995
    codeflash_output = is_tex_string(f"  ${content}$") # 2.33μs -> 799ns (191% faster)
    codeflash_output = is_tex_string(f"${content}$  ") # 6.73μs -> 6.02μs (11.8% faster)

def test_large_newlines_inside_delimiters():
    # Large string with newlines inside delimiters should match
    content = "\n".join(["d" * 100 for _ in range(10)])
    codeflash_output = is_tex_string(f"${content}$") # 7.97μs -> 6.48μs (23.0% faster)
    codeflash_output = is_tex_string(r"\[" + content + r"\]") # 6.65μs -> 5.87μs (13.2% faster)
    codeflash_output = is_tex_string(r"\(" + content + r"\)") # 6.34μs -> 5.77μs (9.88% faster)

def test_large_malformed_delimiters():
    # Large string with malformed delimiters should not match
    content = "e" * 1000
    codeflash_output = is_tex_string(f"${content}$") # 2.27μs -> 803ns (183% faster)
    codeflash_output = is_tex_string(r"\[" + content + r"\)") # 6.70μs -> 5.94μs (12.8% faster)
    codeflash_output = is_tex_string(r"\(" + content + r"\]") # 6.27μs -> 5.72μs (9.51% faster)

# Additional edge: Unicode and special characters inside delimiters

def test_unicode_inside_delimiters():
    # Unicode characters inside delimiters should match
    codeflash_output = is_tex_string("$𝑥^2$") # 3.26μs -> 1.85μs (76.6% faster)
    codeflash_output = is_tex_string(r"\[𝑥^2\]") # 1.34μs -> 623ns (116% faster)
    codeflash_output = is_tex_string(r"\(𝑥^2\)") # 950ns -> 359ns (165% faster)

def test_special_characters_inside_delimiters():
    # Special characters inside delimiters should match
    codeflash_output = is_tex_string("$!@#$%^&*()_+{}|:<>?$") # 2.79μs -> 1.30μs (115% faster)
    codeflash_output = is_tex_string(r"\[!@#$%^&*()\]") # 1.31μs -> 506ns (159% faster)
    codeflash_output = is_tex_string(r"\(!@#$%^&*()\)") # 1.01μs -> 440ns (129% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from bokeh.embed.util import is_tex_string

def test_is_tex_string():
    is_tex_string('')
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_sstvtaha/tmpuf1h6l1m/test_concolic_coverage.py::test_is_tex_string 2.67μs 855ns 212%✅

To edit these changes git checkout codeflash/optimize-is_tex_string-mhwtgnza and push.

Codeflash Static Badge

The optimization achieves an **83% speedup** by **precompiling the regex pattern** at module load time instead of recompiling it on every function call.

**Key optimization**: The original code compiled a new regex pattern (`re.compile`) each time `is_tex_string()` was called, which is expensive. The optimized version moves this compilation outside the function, storing the precompiled pattern in the module-level variable `_pat`. Now each function call only performs the fast pattern matching operation.

**Performance impact**: Line profiler results show the dramatic improvement - the original version spent 73.3% of execution time (806μs out of 1100μs total) just compiling the regex pattern on each call. The optimized version eliminates this overhead entirely, reducing total execution time from 1100μs to 220μs.

**Why this works**: Regex compilation involves parsing the pattern string, building a finite state machine, and optimizing it - operations that don't need to be repeated since the MathJax delimiter patterns are constants. Python's `re.compile` returns an optimized pattern object that can be reused indefinitely.

**Test case benefits**: The optimization provides consistent speedups across all test scenarios:
- **Small strings**: 90-230% faster (most common case)
- **Large strings (1000+ chars)**: 10-25% faster (regex matching dominates over compilation)
- **Edge cases**: 100-200% faster for invalid patterns that fail quickly

This optimization is particularly valuable if `is_tex_string()` is called frequently in text processing pipelines, as the compilation overhead elimination scales linearly with call frequency.
@codeflash-ai codeflash-ai Bot requested a review from mashraf-222 November 13, 2025 02:35
@codeflash-ai codeflash-ai Bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants