Selfoss HTML Feed Plugin

Plugin for the self-hosted RSS reader Selfoss that extracts a feed from website content via XPath queries.

Selfoss is a self-hosted multi-purpose RSS reader and aggregation Web application. It already comes with several plugins for various non-RSS sources. When no proper feed or API is available and everything else fails, it however lacks the option to just process arbitrary HTML content. This is often easy to do in especially for pages with a news or blog layout.

The selfoss codebase allows to easily add this functionality as custom spout plugin, though. Given corresponding XPath expressions, feed items are built from the parsed HTML URL. For installation, the HTML Feed Extractor Plugin only needs to be copied to selfoss/src/spouts/html/extractor.php. In contrast to CSS-style selectors, XML-based HTML processing is often available in default PHP installations, so no additional dependencies should be needed.

selfoss HTML extractor sample configurations

As the example configurations show, an XPath query refers to all DOM nodes to be treated as individual article. Relative to that, further XPath matches define per-article properties.

Code & Download