Ideally, the tool should work with a wide variety of (or even all) languages. Possible approaches include:
- Pluggable lexers. For example, let the user provide the path to another command-line tool that accepts text via stdin and outputs string tokens (separated by some predetermined token separator) via stdout. That way, users can extend the tool with whatever lexers they want, whether that's a handwritten lexer, a machine learning model, etc.
- Train our own machine learning models for different languages.
- Develop a general-purpose machine learning model that lexes arbitrary projects without prior training on that specific language.
Ideally, the tool should work with a wide variety of (or even all) languages. Possible approaches include: