Package org.codelibs.nekohtml.filters
Class Purifier
- java.lang.Object
-
- org.codelibs.nekohtml.filters.DefaultFilter
-
- org.codelibs.nekohtml.filters.Purifier
-
- All Implemented Interfaces:
org.apache.xerces.xni.parser.XMLComponent,org.apache.xerces.xni.parser.XMLDocumentFilter,org.apache.xerces.xni.parser.XMLDocumentSource,org.apache.xerces.xni.XMLDocumentHandler,HTMLComponent
public class Purifier extends DefaultFilter
This filter purifies the HTML input to ensure XML well-formedness. The purification process includes:- fixing illegal characters in the document, including
- element and attribute names,
- processing instruction target and data,
- document text;
- ensuring the string "--" does not appear in the content of a comment;
- ensuring the string "]]>" does not appear in the content of a CDATA section;
- ensuring that the XML declaration has required pseudo-attributes and that the values are correct; and
- synthesized missing namespace bindings.
Illegal characters in XML names are converted to the character sequence "_u####_" where "####" is the value of the Unicode character represented in hexadecimal. Whereas illegal characters appearing in document content is converted to the character sequence "\\u####".
In comments, the character '-' is replaced by the character sequence "- " to prevent "--" from ever appearing in the comment content. For CDATA sections, the character ']' is replaced by the character sequence "] " to prevent "]]" from appearing.
The URI used for synthesized namespace bindings is "http://cyberneko.org/html/ns/synthesized/number" where number is generated to ensure uniqueness.
- Version:
- $Id: Purifier.java,v 1.5 2005/02/14 03:56:54 andyc Exp $
- Author:
- Andy Clark
-
-
Field Summary
Fields Modifier and Type Field Description protected static java.lang.StringAUGMENTATIONSInclude infoset augmentations.protected booleanfAugmentationsAugmentations.protected booleanfInCDATASectionTrue if inside a CDATA section.protected org.apache.xerces.xni.NamespaceContextfNamespaceContextNamespace information.protected booleanfNamespacesNamespaces.protected java.lang.StringfPublicIdPublic identifier of doctype declaration.protected booleanfSeenDoctypeTrue if the doctype declaration was seen.protected booleanfSeenRootElementTrue if root element was seen.protected intfSynthesizedNamespaceCountSynthesized namespace binding count.protected java.lang.StringfSystemIdSystem identifier of doctype declaration.protected static java.lang.StringNAMESPACESNamespaces.protected static HTMLEventInfoSYNTHESIZED_ITEMSynthesized event info item.static java.lang.StringSYNTHESIZED_NAMESPACE_PREFXSynthesized namespace binding prefix.-
Fields inherited from class org.codelibs.nekohtml.filters.DefaultFilter
fDocumentHandler, fDocumentSource
-
-
Constructor Summary
Constructors Constructor Description Purifier()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidcharacters(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)Characters.voidcomment(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)Comment.voiddoctypeDecl(java.lang.String root, java.lang.String pubid, java.lang.String sysid, org.apache.xerces.xni.Augmentations augs)Doctype declaration.voidemptyElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs)Empty element.voidendCDATA(org.apache.xerces.xni.Augmentations augs)End CDATA section.voidendElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs)End element.protected voidhandleStartDocument()Handle start document.protected voidhandleStartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs)Handle start element.voidprocessingInstruction(java.lang.String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs)Processing instruction.protected java.lang.StringpurifyName(java.lang.String name, boolean localpart)Purify name.protected org.apache.xerces.xni.QNamepurifyQName(org.apache.xerces.xni.QName qname)Purify qualified name.protected org.apache.xerces.xni.XMLStringpurifyText(org.apache.xerces.xni.XMLString text)Purify content.voidreset(org.apache.xerces.xni.parser.XMLComponentManager manager)Resets the component.voidstartCDATA(org.apache.xerces.xni.Augmentations augs)Start CDATA section.voidstartDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.Augmentations augs)Start document.voidstartDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs)Start document.voidstartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs)Start element.protected voidsynthesizeBinding(org.apache.xerces.xni.XMLAttributes attrs, java.lang.String ns)Synthesize namespace binding.protected org.apache.xerces.xni.AugmentationssynthesizedAugs()Returns an augmentations object with a synthesized item added.protected static java.lang.StringtoHexString(int c, int padlen)Returns a padded hexadecimal string for the given value.voidxmlDecl(java.lang.String version, java.lang.String encoding, java.lang.String standalone, org.apache.xerces.xni.Augmentations augs)XML declaration.-
Methods inherited from class org.codelibs.nekohtml.filters.DefaultFilter
endDocument, endGeneralEntity, endPrefixMapping, getDocumentHandler, getDocumentSource, getFeatureDefault, getPropertyDefault, getRecognizedFeatures, getRecognizedProperties, ignorableWhitespace, merge, setDocumentHandler, setDocumentSource, setFeature, setProperty, startGeneralEntity, startPrefixMapping, textDecl
-
-
-
-
Field Detail
-
SYNTHESIZED_NAMESPACE_PREFX
public static final java.lang.String SYNTHESIZED_NAMESPACE_PREFX
Synthesized namespace binding prefix.- See Also:
- Constant Field Values
-
NAMESPACES
protected static final java.lang.String NAMESPACES
Namespaces.- See Also:
- Constant Field Values
-
AUGMENTATIONS
protected static final java.lang.String AUGMENTATIONS
Include infoset augmentations.- See Also:
- Constant Field Values
-
SYNTHESIZED_ITEM
protected static final HTMLEventInfo SYNTHESIZED_ITEM
Synthesized event info item.
-
fNamespaces
protected boolean fNamespaces
Namespaces.
-
fAugmentations
protected boolean fAugmentations
Augmentations.
-
fSeenDoctype
protected boolean fSeenDoctype
True if the doctype declaration was seen.
-
fSeenRootElement
protected boolean fSeenRootElement
True if root element was seen.
-
fInCDATASection
protected boolean fInCDATASection
True if inside a CDATA section.
-
fPublicId
protected java.lang.String fPublicId
Public identifier of doctype declaration.
-
fSystemId
protected java.lang.String fSystemId
System identifier of doctype declaration.
-
fNamespaceContext
protected org.apache.xerces.xni.NamespaceContext fNamespaceContext
Namespace information.
-
fSynthesizedNamespaceCount
protected int fSynthesizedNamespaceCount
Synthesized namespace binding count.
-
-
Method Detail
-
reset
public void reset(org.apache.xerces.xni.parser.XMLComponentManager manager)
Description copied from class:DefaultFilterResets the component. The component can query the component manager about any features and properties that affect the operation of the component.- Specified by:
resetin interfaceorg.apache.xerces.xni.parser.XMLComponent- Overrides:
resetin classDefaultFilter- Parameters:
manager- The component manager.
-
startDocument
public void startDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.Augmentations augs)Start document.- Overrides:
startDocumentin classDefaultFilter
-
startDocument
public void startDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs)Start document.- Specified by:
startDocumentin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
startDocumentin classDefaultFilter
-
xmlDecl
public void xmlDecl(java.lang.String version, java.lang.String encoding, java.lang.String standalone, org.apache.xerces.xni.Augmentations augs)XML declaration.- Specified by:
xmlDeclin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
xmlDeclin classDefaultFilter
-
comment
public void comment(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)Comment.- Specified by:
commentin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
commentin classDefaultFilter
-
processingInstruction
public void processingInstruction(java.lang.String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs)Processing instruction.- Specified by:
processingInstructionin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
processingInstructionin classDefaultFilter
-
doctypeDecl
public void doctypeDecl(java.lang.String root, java.lang.String pubid, java.lang.String sysid, org.apache.xerces.xni.Augmentations augs)Doctype declaration.- Specified by:
doctypeDeclin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
doctypeDeclin classDefaultFilter
-
startElement
public void startElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs)Start element.- Specified by:
startElementin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
startElementin classDefaultFilter
-
emptyElement
public void emptyElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs)Empty element.- Specified by:
emptyElementin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
emptyElementin classDefaultFilter
-
startCDATA
public void startCDATA(org.apache.xerces.xni.Augmentations augs)
Start CDATA section.- Specified by:
startCDATAin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
startCDATAin classDefaultFilter
-
endCDATA
public void endCDATA(org.apache.xerces.xni.Augmentations augs)
End CDATA section.- Specified by:
endCDATAin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
endCDATAin classDefaultFilter
-
characters
public void characters(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)Characters.- Specified by:
charactersin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
charactersin classDefaultFilter
-
endElement
public void endElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs)End element.- Specified by:
endElementin interfaceorg.apache.xerces.xni.XMLDocumentHandler- Overrides:
endElementin classDefaultFilter
-
handleStartDocument
protected void handleStartDocument()
Handle start document.
-
handleStartElement
protected void handleStartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs)Handle start element.
-
synthesizeBinding
protected void synthesizeBinding(org.apache.xerces.xni.XMLAttributes attrs, java.lang.String ns)Synthesize namespace binding.
-
synthesizedAugs
protected final org.apache.xerces.xni.Augmentations synthesizedAugs()
Returns an augmentations object with a synthesized item added.
-
purifyQName
protected org.apache.xerces.xni.QName purifyQName(org.apache.xerces.xni.QName qname)
Purify qualified name.
-
purifyName
protected java.lang.String purifyName(java.lang.String name, boolean localpart)Purify name.
-
purifyText
protected org.apache.xerces.xni.XMLString purifyText(org.apache.xerces.xni.XMLString text)
Purify content.
-
toHexString
protected static java.lang.String toHexString(int c, int padlen)Returns a padded hexadecimal string for the given value.
-
-