Skip to content

Unneeded patterns/rules influence the result of the parsing #29

@robin-xyzt-ai

Description

@robin-xyzt-ai

This might be an issue with retree rather than with the dateparser though.

The following test (which you cannot execute via the public API) fails:

    @Test
    public void parserWithLimitedPatterns(){
        List<String> rules = Arrays.asList(
          "(?<year>\\d{4})\\W{1}(?<month>\\d{1,2})\\W{1}(?<day>\\d{1,2})[^\\d]?",
          "\\W*(?:at )?(?<hour>\\d{1,2}):(?<minute>\\d{1,2})(?::(?<second>\\d{1,2}))?(?:[.,](?<ns>\\d{1,9}))?(?<zero>z)?",
          " ?(?<zoneOffset>[-+]\\d{1,2}:?(?:\\d{2})?)"
        );

        DateParser dateParser = new DateParser(rules, new HashSet<>(rules), Collections.emptyMap(), true, false);
        String input = "2022-08-09 19:04:31.600000+00:00";
        Date date = dateParser.parseDate(input);
        assertEquals(parser.parseDate(input), date);
    }

Note how those 3 rules should be sufficient to parse the date.

  • There is a rule for the year-month-day part
  • There is a rule for the hours:minutes:seconds.ns part
  • There is a rule for the zone offset part

However, during parsing the zoneoffset rule is never used. Instead, it uses the rule for the hours twice.

The weird thing is that when I add a rule that should not be used (`" ?(?\d{4})$"), the test suddenly succeeds:

    @Test
    public void parserWithLimitedPatterns(){
        List<String> rules = Arrays.asList(
          "(?<year>\\d{4})\\W{1}(?<month>\\d{1,2})\\W{1}(?<day>\\d{1,2})[^\\d]?",
          " ?(?<year>\\\\d{4})$",
          "\\W*(?:at )?(?<hour>\\d{1,2}):(?<minute>\\d{1,2})(?::(?<second>\\d{1,2}))?(?:[.,](?<ns>\\d{1,9}))?(?<zero>z)?",
          " ?(?<zoneOffset>[-+]\\d{1,2}:?(?:\\d{2})?)"
        );

        DateParser dateParser = new DateParser(rules, new HashSet<>(rules), Collections.emptyMap(), true, false);
        String input = "2022-08-09 19:04:31.600000+00:00";
        Date date = dateParser.parseDate(input);
        assertEquals(parser.parseDate(input), date);
    }

The position where I add that additional rule is important. For example adding it at the end of the list instead of at index 1 makes the test fail again.

I bumped into this issue for PR #28 , where I try to reduce the number of rules that are used for parsing to improve the performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions