Extract and process main text content from web pages
Discovery, extraction and processing for Web text
trafilatura
$ trafilatura https://example.com
$ trafilatura --json input.html
$ trafilatura --with-metadata https://example.com