jfiveparse: a java html5 parser

jfiveparse pass all the non-scripted tests for the tokenizer and tree construction from the html5lib-tests suite.

It provides both fragment and full document parsing. It can parse directly from a String or by streaming through a Reader (note: the encoding must be known, currently the parser does not implement an autodetect feature).

Version 1.1.4 and older require Java 11. Future 2.x.x versions will require Java 17.

Javadoc@javadoc.io.

Features

a html 5 parser
css selector parser and matcher (that works on both the "jfiveparse" DOM types or the org.w3c.dom.Document & co)
a typesafe selector builder and matcher
a converter from the internal "jfiveparse" DOM type to the org.w3c.dom.Document

License

jfiveparse is licensed under the Apache License Version 2.0.

Download

Stable

maven:

<dependency>
    <groupId>ch.digitalfondue.jfiveparse</groupId>
    <artifactId>jfiveparse</artifactId>
    <version>1.1.4</version>
</dependency>

gradle:

compile 'ch.digitalfondue.jfiveparse:jfiveparse:1.1.4'

Milestone releases

maven:

<dependency>
    <groupId>ch.digitalfondue.jfiveparse</groupId>
    <artifactId>jfiveparse</artifactId>
    <version>2.0.0-M1</version>
</dependency>

gradle:

compile 'ch.digitalfondue.jfiveparse:jfiveparse:2.0.0-M1'

Use:

If you use it as a module, remember to add requires ch.digitalfondue.jfiveparse; in your module-info. If you are using the W3CDom class (and the various inner classes), you may also need to require the java.xml module, as it's an optional dependency.

Examples:

Parse, select and print all title from HN

package ch.digitalfondue.jfiveparse.example;

import ch.digitalfondue.jfiveparse.*;

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URI;
import java.nio.charset.StandardCharsets;

public class LoadHNTitle {

    public static void main(String[] args) throws IOException {
        try (Reader reader = new InputStreamReader(URI.create("https://news.ycombinator.com/").toURL().openStream(), StandardCharsets.UTF_8)) {
            NodeMatcher<Node> matcher = Selector.parseSelector("td.title > span.titleline > a");
            JFiveParse.parse(reader).getAllNodesMatchingAsStream(matcher)
                    .map(Element.class::cast)
                    .forEach(a -> System.out.printf("%s [%s]\n", a.getTextContent(), a.getAttribute("href")));
        }
    }
}

Type safe selectors:

See https://github.com/digitalfondue/jfiveparse/blob/master/src/test/java/ch/digitalfondue/jfiveparse/NodeMatchersTest.java

Other examples

See directory: https://github.com/digitalfondue/jfiveparse/tree/master/src/test/java/ch/digitalfondue/jfiveparse/example

Convert to the java DOM representation

If you need to generate a org.w3c.dom.Document from the ch.digitalfondue.jfiveparse.Document representation, there is a static method in the helper class: W3CDom.toW3CDocument.

Notes:

Specs/Doc:

html by whatwg: https://html.spec.whatwg.org/multipage/syntax.html (of interest: tokenization, tree-construction)
entities: https://html.spec.whatwg.org/entities.json

Template element handling

The template element is a "normal" element, so the child nodes are not placed inside a documentFragment. This will be fixed.

Special parsing options

The parser can be customized to allow some non-standard behaviour, you can see the following tests: https://github.com/digitalfondue/jfiveparse/blob/master/src/test/java/ch/digitalfondue/jfiveparse/OptionParseTest.java

DISABLE_IGNORE_TOKEN_IN_BODY_START_TAG : allow to have for example "tr" tag without the containing table/tbody.
INTERPRET_SELF_CLOSING_ANYTHING_ELSE : When encountering unknown self-closing tag, they will be interpreted as it is and not as open tag only, thus creating a non-intuitive DOM.

Entities

The &ntities; are by default (and by specification) parsed and interpreted. This behavior can be disabled by:

passing the enum "Option.DONT_TRANSFORM_ENTITIES" to the Parser
when calling Node.get{Inner,Outer}HTML(), pass the enum "Option.DONT_TRANSFORM_ENTITIES" for disabling the escaping. It's possible that something will be wrong in the generated document.

Preserving as much as possible the original document when serializing

By default, when parsing/serializing, the following transformations will be applied:

entities will be interpreted and converted
all the attribute values will be double-quoted
tag and attribute names will be lower-case
the "/" character used in self-closing tag will be ignored
some whitespace will be ignored

Currently, jfiveparse can preserve the entities, the attribute quoting type and the case and the tag name case.

If you require to preserve as much as possible the document when serializing back in a string, pass the following parameters:

pass the enum "Option.DONT_TRANSFORM_ENTITIES" to the Parser
when calling Node.get{Inner,Outer}HTML(), pass the enums:
- Option.DONT_TRANSFORM_ENTITIES
- Option.PRINT_ORIGINAL_ATTRIBUTE_QUOTE
- Option.PRINT_ORIGINAL_ATTRIBUTES_CASE
- Option.PRINT_ORIGINAL_TAG_CASE

Uppercase handling in the tokenizer

Note: this is a deviation from the specification in term of implementation of the tokenizer, but globally, the end result is correct, as the attributes and tag names are then converted to lower case.

In the tokenizer, instead of applying the toLowerCase function on each character, the transformation is done in a single call in the TreeConstructor (see setTagName). This is used for saving the original case of the attributes and tag names.

TODO:

additional doc
expand the typesafe matcher api
keep track of lines, eventually chars too
profile
- various optimizations...
- TokenizerRCDataStates.handleRCDataState could be optimized
  - (textarea related)

mvn clean test jacoco:report

Name		Name	Last commit message	Last commit date
Latest commit History 407 Commits
.github/workflows		.github/workflows
.mvn/wrapper		.mvn/wrapper
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

jfiveparse: a java html5 parser

Features

License

Download

Stable

Milestone releases

Use:

Examples:

Parse, select and print all title from HN

Type safe selectors:

Other examples

Convert to the java DOM representation

Notes:

Specs/Doc:

Template element handling

Special parsing options

Entities

Preserving as much as possible the original document when serializing

Uppercase handling in the tokenizer

TODO:

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

digitalfondue/jfiveparse

Folders and files

Latest commit

History

Repository files navigation

jfiveparse: a java html5 parser

Features

License

Download

Stable

Milestone releases

Use:

Examples:

Parse, select and print all title from HN

Type safe selectors:

Other examples

Convert to the java DOM representation

Notes:

Specs/Doc:

Template element handling

Special parsing options

Entities

Preserving as much as possible the original document when serializing

Uppercase handling in the tokenizer

TODO:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages