Features:
- Pluggable sets of custom HTML tag/attribute/style filters
- Optional limit for total length of textual content (i.e. not tags)
- Optional whitespace normalization
- Encodes non-ASCII characters
TODO:
- create function to auto convert URLs to url
- add additional writers (specifically JsonML)
- example spider needs to respect Robots.txt!
- treat everything between certain tags as single CDATA block? (e.g. script, style)
- remove everything between certain tags? (e.g. script, style)
- replace tag pairs?
x - separate parser from writer
X - enable inline <% %> script parsing
X - Enable usage in JsonMLTextWriter:
X - Allow incremental parsing mode? i.e. keep tag stack, state and subsequent calls to parse continue where left off?
X - Strip Tags/Attributes not in Whitelists
X - "Configurable" whitelists (e.g. none, simple, secure)
X - "Strip" mode which converts to plaintext (i.e. empty whitelist)
X - Optional limit on size of images (height/width)
X - Limit total number of chars
X - Normalize line endings (CR/CRLF -> LF)
X - Limit line endings? no more than 2 LF in a row?
X - Entity-encode hi order chars (to avoid encoding issues)
X - Parse style tags and allow filtering of properties
X - Normalize whitespace (otherwise messes up 2 LF limit)
X - Image size limit should keep sizes proportional
X - support !Doctype
X - support XML declaration
X - support and conditional comments
X - Flag if a tag is (primarily?) plain text
X - Determine the complexity for parsed markup (plain/inline/block)
X - LF-to-BR: applied only to non-block elements
X - make a quick and dirty spider which uses this
X - literal callbacks? string callback(source, start, end/count);?