Package org.codelibs.nekohtml
Class HTMLScanner
- java.lang.Object
-
- org.codelibs.nekohtml.HTMLScanner
-
- All Implemented Interfaces:
org.apache.xerces.xni.parser.XMLComponent,org.apache.xerces.xni.parser.XMLDocumentScanner,org.apache.xerces.xni.parser.XMLDocumentSource,org.apache.xerces.xni.XMLLocator,HTMLComponent
public class HTMLScanner extends java.lang.Object implements org.apache.xerces.xni.parser.XMLDocumentScanner, org.apache.xerces.xni.XMLLocator, HTMLComponent
A simple HTML scanner. This scanner makes no attempt to balance tags or fix other problems in the source document — it just scans what it can and generates XNI document "events", ignoring errors of all kinds.This component recognizes the following features:
- http://cyberneko.org/html/features/augmentations
- http://cyberneko.org/html/features/report-errors
- http://apache.org/xml/features/scanner/notify-char-refs
- http://apache.org/xml/features/scanner/notify-builtin-refs
- http://cyberneko.org/html/features/scanner/notify-builtin-refs
- http://cyberneko.org/html/features/scanner/fix-mswindows-refs
- http://cyberneko.org/html/features/scanner/script/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/script/strip-comment-delims
- http://cyberneko.org/html/features/scanner/style/strip-cdata-delims
- http://cyberneko.org/html/features/scanner/style/strip-comment-delims
- http://cyberneko.org/html/features/scanner/ignore-specified-charset
- http://cyberneko.org/html/features/scanner/cdata-sections
- http://cyberneko.org/html/features/override-doctype
- http://cyberneko.org/html/features/insert-doctype
- http://cyberneko.org/html/features/parse-noscript-content
- http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe
- http://cyberneko.org/html/features/scanner/allow-selfclosing-tags
This component recognizes the following properties:
- http://cyberneko.org/html/properties/names/elems
- http://cyberneko.org/html/properties/names/attrs
- http://cyberneko.org/html/properties/default-encoding
- http://cyberneko.org/html/properties/error-reporter
- http://cyberneko.org/html/properties/doctype/pubid
- http://cyberneko.org/html/properties/doctype/sysid
- Version:
- $Id: HTMLScanner.java,v 1.19 2005/06/14 05:52:37 andyc Exp $
- Author:
- Andy Clark, Marc Guillemot, Ahmed Ashour
- See Also:
HTMLElements,HTMLEntities
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description classHTMLScanner.ContentScannerThe primary HTML document scanner.static classHTMLScanner.CurrentEntityCurrent entity.protected static classHTMLScanner.LocationItemLocation infoset item.static classHTMLScanner.PlaybackInputStreamA playback input stream.static interfaceHTMLScanner.ScannerBasic scanner interface.classHTMLScanner.SpecialScannerSpecial scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.
-
Field Summary
Fields Modifier and Type Field Description static java.lang.StringALLOW_SELFCLOSING_IFRAMEAllows self closing <iframe/> tagstatic java.lang.StringALLOW_SELFCLOSING_TAGSAllows self closing tags e.g.protected static java.lang.StringAUGMENTATIONSInclude infoset augmentations.static java.lang.StringCDATA_SECTIONSScan CDATA sections.protected static intDEFAULT_BUFFER_SIZEDefault buffer size.protected static java.lang.StringDEFAULT_ENCODINGDefault encoding.protected static java.lang.StringDOCTYPE_PUBIDDoctype declaration public identifier.protected static java.lang.StringDOCTYPE_SYSIDDoctype declaration system identifier.protected static java.lang.StringERROR_REPORTERError reporter.protected booleanfAllowSelfclosingIframeAllows self closing iframe tags.protected booleanfAllowSelfclosingTagsAllows self closing tags.protected booleanfAugmentationsAugmentations.protected intfBeginCharacterOffsetBeginning character offset in the file.protected intfBeginColumnNumberBeginning column number.protected intfBeginLineNumberBeginning line number.protected HTMLScanner.PlaybackInputStreamfByteStreamThe playback byte stream.protected booleanfCDATASectionsCDATA sections.protected HTMLScanner.ScannerfContentScannerContent scanner.protected HTMLScanner.CurrentEntityfCurrentEntityCurrent entity.protected java.util.Stack<HTMLScanner.CurrentEntity>fCurrentEntityStackThe current entity stack.protected java.lang.StringfDefaultIANAEncodingDefault encoding.protected java.lang.StringfDoctypePubidDoctype declaration public identifier.protected java.lang.StringfDoctypeSysidDoctype declaration system identifier.protected org.apache.xerces.xni.XMLDocumentHandlerfDocumentHandlerThe document handler.protected intfElementCountElement count.protected intfElementDepthElement depth.protected intfEndCharacterOffsetEnding character offset in the file.protected intfEndColumnNumberEnding column number.protected intfEndLineNumberEnding line number.protected HTMLErrorReporterfErrorReporterError reporter.protected booleanfFixWindowsCharRefsFix Microsoft Windows® character entity references.protected java.lang.StringfIANAEncodingAuto-detected IANA encoding.protected booleanfIgnoreSpecifiedCharsetIgnore specified character set.protected booleanfInsertDoctypeInsert document type declaration.protected booleanfIso8859EncodingTrue if the encoding matches "ISO-8859-*".static java.lang.StringFIX_MSWINDOWS_REFSFix Microsoft Windows® character entity references.protected java.lang.StringfJavaEncodingAuto-detected Java encoding.protected shortfNamesAttrsModify HTML attribute names.protected shortfNamesElemsModify HTML element names.protected booleanfNormalizeAttributesNormalize attribute values.protected booleanfNotifyCharRefsNotify character entity references.protected booleanfNotifyHtmlBuiltinRefsNotify HTML built-in general entity references.protected booleanfNotifyXmlBuiltinRefsNotify XML built-in general entity references.protected booleanfOverrideDoctypeOverride doctype declaration public and system identifiers.protected booleanfParseNoFramesContentParse noframes content.protected booleanfParseNoScriptContentParse noscript content.protected booleanfReportErrorsReport errors.protected HTMLScanner.ScannerfScannerThe current scanner.protected shortfScannerStateThe current scanner state.protected booleanfScriptStripCDATADelimsStrip CDATA delimiters from SCRIPT tags.protected booleanfScriptStripCommentDelimsStrip comment delimiters from SCRIPT tags.protected HTMLScanner.SpecialScannerfSpecialScannerSpecial scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.protected org.apache.xerces.util.XMLStringBufferfStringBufferString buffer.protected booleanfStyleStripCDATADelimsStrip CDATA delimiters from STYLE tags.protected booleanfStyleStripCommentDelimsStrip comment delimiters from STYLE tags.static java.lang.StringHTML_4_01_FRAMESET_PUBIDHTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").static java.lang.StringHTML_4_01_FRAMESET_SYSIDHTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").static java.lang.StringHTML_4_01_STRICT_PUBIDHTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").static java.lang.StringHTML_4_01_STRICT_SYSIDHTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").static java.lang.StringHTML_4_01_TRANSITIONAL_PUBIDHTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").static java.lang.StringHTML_4_01_TRANSITIONAL_SYSIDHTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").static java.lang.StringIGNORE_SPECIFIED_CHARSETIgnore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instructionstatic java.lang.StringINSERT_DOCTYPEInsert document type declaration.protected static java.lang.StringNAMES_ATTRSModify HTML attribute names: { "upper", "lower", "default" }.protected static java.lang.StringNAMES_ELEMSModify HTML element names: { "upper", "lower", "default" }.protected static shortNAMES_LOWERCASELowercase HTML names.protected static shortNAMES_NO_CHANGEDon't modify HTML names.protected static shortNAMES_UPPERCASEUppercase HTML names.protected static java.lang.StringNORMALIZE_ATTRIBUTESNormalize attribute values.static java.lang.StringNOTIFY_CHAR_REFSNotify character entity references (e.g.static java.lang.StringNOTIFY_HTML_BUILTIN_REFSNotify handler of built-in entity references (e.g.static java.lang.StringNOTIFY_XML_BUILTIN_REFSNotify handler of built-in entity references (e.g.static java.lang.StringOVERRIDE_DOCTYPEOverride doctype declaration public and system identifiers.static java.lang.StringPARSE_NOSCRIPT_CONTENTParse <noscript>...</noscript> contentprotected static java.lang.StringREPORT_ERRORSReport errors.static java.lang.StringSCRIPT_STRIP_CDATA_DELIMSStrip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.static java.lang.StringSCRIPT_STRIP_COMMENT_DELIMSStrip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.protected static shortSTATE_CONTENTState: content.protected static shortSTATE_END_DOCUMENTState: end document.protected static shortSTATE_MARKUP_BRACKETState: markup bracket.protected static shortSTATE_START_DOCUMENTState: start document.static java.lang.StringSTYLE_STRIP_CDATA_DELIMSStrip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.static java.lang.StringSTYLE_STRIP_COMMENT_DELIMSStrip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.protected static HTMLEventInfoSYNTHESIZED_ITEMSynthesized event info item.
-
Constructor Summary
Constructors Constructor Description HTMLScanner()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected static booleanbuiltinXmlRef(java.lang.String name)Returns true if the name is a built-in XML general entity reference.voidcleanup(boolean closeall)Cleans up used resources.voidevaluateInputSource(org.apache.xerces.xni.parser.XMLInputSource inputSource)Immediately evaluates an input source and add the new content (e.g.static java.lang.StringexpandSystemId(java.lang.String systemId, java.lang.String baseSystemId)Expands a system id and returns the system id as a URI, if it can be expanded.protected static java.lang.StringfixURI(java.lang.String str)Fixes a platform dependent filename to standard URI form.protected intfixWindowsCharacter(int origChar)Fixes Microsoft Windows® specific characters.java.lang.StringgetBaseSystemId()Returns the base system identifier.intgetCharacterOffset()Returns the character offset.intgetColumnNumber()Returns the current column number.org.apache.xerces.xni.XMLDocumentHandlergetDocumentHandler()Returns the document handler.java.lang.StringgetEncoding()Returns the encoding.java.lang.StringgetExpandedSystemId()Returns the expanded system identifier.java.lang.BooleangetFeatureDefault(java.lang.String featureId)Returns the default state for a feature.intgetLineNumber()Returns the current line number.java.lang.StringgetLiteralSystemId()Returns the literal system identifier.protected static shortgetNamesValue(java.lang.String value)Converts HTML names string value to constant value.java.lang.ObjectgetPropertyDefault(java.lang.String propertyId)Returns the default state for a property.java.lang.StringgetPublicId()Returns the public identifier.java.lang.String[]getRecognizedFeatures()Returns recognized features.java.lang.String[]getRecognizedProperties()Returns recognized properties.protected static java.lang.StringgetValue(org.apache.xerces.xni.XMLAttributes attrs, java.lang.String aname)Returns the value of the specified attribute, ignoring case.java.lang.StringgetXMLVersion()Returns the XML version.protected org.apache.xerces.xni.AugmentationslocationAugs()Returns an augmentations object with a location item added.protected static java.lang.StringmodifyName(java.lang.String name, short mode)Modifies the given name based on the specified mode.voidpushInputSource(org.apache.xerces.xni.parser.XMLInputSource inputSource)Pushes an input source onto the current entity stack.protected intread()Reads a single character.protected intreadPreservingBufferContent()Reads a single character, preserving the old buffer contentvoidreset(org.apache.xerces.xni.parser.XMLComponentManager manager)Resets the component.protected org.apache.xerces.xni.XMLResourceIdentifierresourceId()Returns an empty resource identifier.protected voidscanDoctype()Scans a DOCTYPE line.booleanscanDocument(boolean complete)Scans the document.protected intscanEntityRef(org.apache.xerces.util.XMLStringBuffer str, boolean content)Scans an entity reference.protected java.lang.StringscanLiteral()Scans a quoted literal.protected java.lang.StringscanName(boolean strict)Scans a name.voidsetDocumentHandler(org.apache.xerces.xni.XMLDocumentHandler handler)Sets the document handler.voidsetFeature(java.lang.String featureId, boolean state)Sets a feature.voidsetInputSource(org.apache.xerces.xni.parser.XMLInputSource source)Sets the input source.voidsetProperty(java.lang.String propertyId, java.lang.Object value)Sets a property.protected voidsetScanner(HTMLScanner.Scanner scanner)Sets the scanner.protected voidsetScannerState(short state)Sets the scanner state.protected booleanskip(java.lang.String s, boolean caseSensitive)Returns true if the specified text is present and is skipped.protected booleanskipMarkup(boolean balance)Skips markup.protected intskipNewlines()Skips newlines and returns the number of newlines skipped.protected booleanskipSpaces()Skips whitespace.protected org.apache.xerces.xni.AugmentationssynthesizedAugs()Returns an augmentations object with a synthesized item added.
-
-
-
Field Detail
-
HTML_4_01_STRICT_PUBID
public static final java.lang.String HTML_4_01_STRICT_PUBID
HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").- See Also:
- Constant Field Values
-
HTML_4_01_STRICT_SYSID
public static final java.lang.String HTML_4_01_STRICT_SYSID
HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").- See Also:
- Constant Field Values
-
HTML_4_01_TRANSITIONAL_PUBID
public static final java.lang.String HTML_4_01_TRANSITIONAL_PUBID
HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").- See Also:
- Constant Field Values
-
HTML_4_01_TRANSITIONAL_SYSID
public static final java.lang.String HTML_4_01_TRANSITIONAL_SYSID
HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").- See Also:
- Constant Field Values
-
HTML_4_01_FRAMESET_PUBID
public static final java.lang.String HTML_4_01_FRAMESET_PUBID
HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").- See Also:
- Constant Field Values
-
HTML_4_01_FRAMESET_SYSID
public static final java.lang.String HTML_4_01_FRAMESET_SYSID
HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").- See Also:
- Constant Field Values
-
AUGMENTATIONS
protected static final java.lang.String AUGMENTATIONS
Include infoset augmentations.- See Also:
- Constant Field Values
-
REPORT_ERRORS
protected static final java.lang.String REPORT_ERRORS
Report errors.- See Also:
- Constant Field Values
-
NOTIFY_CHAR_REFS
public static final java.lang.String NOTIFY_CHAR_REFS
Notify character entity references (e.g.  ,  , etc).- See Also:
- Constant Field Values
-
NOTIFY_XML_BUILTIN_REFS
public static final java.lang.String NOTIFY_XML_BUILTIN_REFS
Notify handler of built-in entity references (e.g. &, <, etc).Note: This only applies to the five pre-defined XML general entities. Specifically, "amp", "lt", "gt", "quot", and "apos". This is done for compatibility with the Xerces feature.
To be notified of the built-in entity references in HTML, set the
http://cyberneko.org/html/features/scanner/notify-builtin-refsfeature totrue.- See Also:
- Constant Field Values
-
NOTIFY_HTML_BUILTIN_REFS
public static final java.lang.String NOTIFY_HTML_BUILTIN_REFS
Notify handler of built-in entity references (e.g. &nobr;, ©, etc).Note: This includes the five pre-defined XML general entities.
- See Also:
- Constant Field Values
-
FIX_MSWINDOWS_REFS
public static final java.lang.String FIX_MSWINDOWS_REFS
Fix Microsoft Windows® character entity references.- See Also:
- Constant Field Values
-
SCRIPT_STRIP_COMMENT_DELIMS
public static final java.lang.String SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.- See Also:
- Constant Field Values
-
SCRIPT_STRIP_CDATA_DELIMS
public static final java.lang.String SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.- See Also:
- Constant Field Values
-
STYLE_STRIP_COMMENT_DELIMS
public static final java.lang.String STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.- See Also:
- Constant Field Values
-
STYLE_STRIP_CDATA_DELIMS
public static final java.lang.String STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.- See Also:
- Constant Field Values
-
IGNORE_SPECIFIED_CHARSET
public static final java.lang.String IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag or in the <?xml … encoding='…'> processing instruction- See Also:
- Constant Field Values
-
CDATA_SECTIONS
public static final java.lang.String CDATA_SECTIONS
Scan CDATA sections.- See Also:
- Constant Field Values
-
OVERRIDE_DOCTYPE
public static final java.lang.String OVERRIDE_DOCTYPE
Override doctype declaration public and system identifiers.- See Also:
- Constant Field Values
-
INSERT_DOCTYPE
public static final java.lang.String INSERT_DOCTYPE
Insert document type declaration.- See Also:
- Constant Field Values
-
PARSE_NOSCRIPT_CONTENT
public static final java.lang.String PARSE_NOSCRIPT_CONTENT
Parse <noscript>...</noscript> content- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_IFRAME
public static final java.lang.String ALLOW_SELFCLOSING_IFRAME
Allows self closing <iframe/> tag- See Also:
- Constant Field Values
-
ALLOW_SELFCLOSING_TAGS
public static final java.lang.String ALLOW_SELFCLOSING_TAGS
Allows self closing tags e.g. <div/> (XHTML)- See Also:
- Constant Field Values
-
NORMALIZE_ATTRIBUTES
protected static final java.lang.String NORMALIZE_ATTRIBUTES
Normalize attribute values.- See Also:
- Constant Field Values
-
NAMES_ELEMS
protected static final java.lang.String NAMES_ELEMS
Modify HTML element names: { "upper", "lower", "default" }.- See Also:
- Constant Field Values
-
NAMES_ATTRS
protected static final java.lang.String NAMES_ATTRS
Modify HTML attribute names: { "upper", "lower", "default" }.- See Also:
- Constant Field Values
-
DEFAULT_ENCODING
protected static final java.lang.String DEFAULT_ENCODING
Default encoding.- See Also:
- Constant Field Values
-
ERROR_REPORTER
protected static final java.lang.String ERROR_REPORTER
Error reporter.- See Also:
- Constant Field Values
-
DOCTYPE_PUBID
protected static final java.lang.String DOCTYPE_PUBID
Doctype declaration public identifier.- See Also:
- Constant Field Values
-
DOCTYPE_SYSID
protected static final java.lang.String DOCTYPE_SYSID
Doctype declaration system identifier.- See Also:
- Constant Field Values
-
STATE_CONTENT
protected static final short STATE_CONTENT
State: content.- See Also:
- Constant Field Values
-
STATE_MARKUP_BRACKET
protected static final short STATE_MARKUP_BRACKET
State: markup bracket.- See Also:
- Constant Field Values
-
STATE_START_DOCUMENT
protected static final short STATE_START_DOCUMENT
State: start document.- See Also:
- Constant Field Values
-
STATE_END_DOCUMENT
protected static final short STATE_END_DOCUMENT
State: end document.- See Also:
- Constant Field Values
-
NAMES_NO_CHANGE
protected static final short NAMES_NO_CHANGE
Don't modify HTML names.- See Also:
- Constant Field Values
-
NAMES_UPPERCASE
protected static final short NAMES_UPPERCASE
Uppercase HTML names.- See Also:
- Constant Field Values
-
NAMES_LOWERCASE
protected static final short NAMES_LOWERCASE
Lowercase HTML names.- See Also:
- Constant Field Values
-
DEFAULT_BUFFER_SIZE
protected static final int DEFAULT_BUFFER_SIZE
Default buffer size.- See Also:
- Constant Field Values
-
SYNTHESIZED_ITEM
protected static final HTMLEventInfo SYNTHESIZED_ITEM
Synthesized event info item.
-
fAugmentations
protected boolean fAugmentations
Augmentations.
-
fReportErrors
protected boolean fReportErrors
Report errors.
-
fNotifyCharRefs
protected boolean fNotifyCharRefs
Notify character entity references.
-
fNotifyXmlBuiltinRefs
protected boolean fNotifyXmlBuiltinRefs
Notify XML built-in general entity references.
-
fNotifyHtmlBuiltinRefs
protected boolean fNotifyHtmlBuiltinRefs
Notify HTML built-in general entity references.
-
fFixWindowsCharRefs
protected boolean fFixWindowsCharRefs
Fix Microsoft Windows® character entity references.
-
fScriptStripCDATADelims
protected boolean fScriptStripCDATADelims
Strip CDATA delimiters from SCRIPT tags.
-
fScriptStripCommentDelims
protected boolean fScriptStripCommentDelims
Strip comment delimiters from SCRIPT tags.
-
fStyleStripCDATADelims
protected boolean fStyleStripCDATADelims
Strip CDATA delimiters from STYLE tags.
-
fStyleStripCommentDelims
protected boolean fStyleStripCommentDelims
Strip comment delimiters from STYLE tags.
-
fIgnoreSpecifiedCharset
protected boolean fIgnoreSpecifiedCharset
Ignore specified character set.
-
fCDATASections
protected boolean fCDATASections
CDATA sections.
-
fOverrideDoctype
protected boolean fOverrideDoctype
Override doctype declaration public and system identifiers.
-
fInsertDoctype
protected boolean fInsertDoctype
Insert document type declaration.
-
fNormalizeAttributes
protected boolean fNormalizeAttributes
Normalize attribute values.
-
fParseNoScriptContent
protected boolean fParseNoScriptContent
Parse noscript content.
-
fParseNoFramesContent
protected boolean fParseNoFramesContent
Parse noframes content.
-
fAllowSelfclosingIframe
protected boolean fAllowSelfclosingIframe
Allows self closing iframe tags.
-
fAllowSelfclosingTags
protected boolean fAllowSelfclosingTags
Allows self closing tags.
-
fNamesElems
protected short fNamesElems
Modify HTML element names.
-
fNamesAttrs
protected short fNamesAttrs
Modify HTML attribute names.
-
fDefaultIANAEncoding
protected java.lang.String fDefaultIANAEncoding
Default encoding.
-
fErrorReporter
protected HTMLErrorReporter fErrorReporter
Error reporter.
-
fDoctypePubid
protected java.lang.String fDoctypePubid
Doctype declaration public identifier.
-
fDoctypeSysid
protected java.lang.String fDoctypeSysid
Doctype declaration system identifier.
-
fBeginLineNumber
protected int fBeginLineNumber
Beginning line number.
-
fBeginColumnNumber
protected int fBeginColumnNumber
Beginning column number.
-
fBeginCharacterOffset
protected int fBeginCharacterOffset
Beginning character offset in the file.
-
fEndLineNumber
protected int fEndLineNumber
Ending line number.
-
fEndColumnNumber
protected int fEndColumnNumber
Ending column number.
-
fEndCharacterOffset
protected int fEndCharacterOffset
Ending character offset in the file.
-
fByteStream
protected HTMLScanner.PlaybackInputStream fByteStream
The playback byte stream.
-
fCurrentEntity
protected HTMLScanner.CurrentEntity fCurrentEntity
Current entity.
-
fCurrentEntityStack
protected final java.util.Stack<HTMLScanner.CurrentEntity> fCurrentEntityStack
The current entity stack.
-
fScanner
protected HTMLScanner.Scanner fScanner
The current scanner.
-
fScannerState
protected short fScannerState
The current scanner state.
-
fDocumentHandler
protected org.apache.xerces.xni.XMLDocumentHandler fDocumentHandler
The document handler.
-
fIANAEncoding
protected java.lang.String fIANAEncoding
Auto-detected IANA encoding.
-
fJavaEncoding
protected java.lang.String fJavaEncoding
Auto-detected Java encoding.
-
fIso8859Encoding
protected boolean fIso8859Encoding
True if the encoding matches "ISO-8859-*".
-
fElementCount
protected int fElementCount
Element count.
-
fElementDepth
protected int fElementDepth
Element depth.
-
fContentScanner
protected HTMLScanner.Scanner fContentScanner
Content scanner.
-
fSpecialScanner
protected HTMLScanner.SpecialScanner fSpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>.
-
fStringBuffer
protected final org.apache.xerces.util.XMLStringBuffer fStringBuffer
String buffer.
-
-
Method Detail
-
pushInputSource
public void pushInputSource(org.apache.xerces.xni.parser.XMLInputSource inputSource)
Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.
- Parameters:
inputSource- The new input source to start scanning.- See Also:
evaluateInputSource(XMLInputSource)
-
evaluateInputSource
public void evaluateInputSource(org.apache.xerces.xni.parser.XMLInputSource inputSource)
Immediately evaluates an input source and add the new content (e.g. the output written by an embedded script).- Parameters:
inputSource- The new input source to start evaluating.- See Also:
pushInputSource(XMLInputSource)
-
cleanup
public void cleanup(boolean closeall)
Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.- Parameters:
closeall- Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
-
getEncoding
public java.lang.String getEncoding()
Returns the encoding.- Specified by:
getEncodingin interfaceorg.apache.xerces.xni.XMLLocator
-
getPublicId
public java.lang.String getPublicId()
Returns the public identifier.- Specified by:
getPublicIdin interfaceorg.apache.xerces.xni.XMLLocator
-
getBaseSystemId
public java.lang.String getBaseSystemId()
Returns the base system identifier.- Specified by:
getBaseSystemIdin interfaceorg.apache.xerces.xni.XMLLocator
-
getLiteralSystemId
public java.lang.String getLiteralSystemId()
Returns the literal system identifier.- Specified by:
getLiteralSystemIdin interfaceorg.apache.xerces.xni.XMLLocator
-
getExpandedSystemId
public java.lang.String getExpandedSystemId()
Returns the expanded system identifier.- Specified by:
getExpandedSystemIdin interfaceorg.apache.xerces.xni.XMLLocator
-
getLineNumber
public int getLineNumber()
Returns the current line number.- Specified by:
getLineNumberin interfaceorg.apache.xerces.xni.XMLLocator
-
getColumnNumber
public int getColumnNumber()
Returns the current column number.- Specified by:
getColumnNumberin interfaceorg.apache.xerces.xni.XMLLocator
-
getXMLVersion
public java.lang.String getXMLVersion()
Returns the XML version.- Specified by:
getXMLVersionin interfaceorg.apache.xerces.xni.XMLLocator
-
getCharacterOffset
public int getCharacterOffset()
Returns the character offset.- Specified by:
getCharacterOffsetin interfaceorg.apache.xerces.xni.XMLLocator
-
getFeatureDefault
public java.lang.Boolean getFeatureDefault(java.lang.String featureId)
Returns the default state for a feature.- Specified by:
getFeatureDefaultin interfaceHTMLComponent- Specified by:
getFeatureDefaultin interfaceorg.apache.xerces.xni.parser.XMLComponent
-
getPropertyDefault
public java.lang.Object getPropertyDefault(java.lang.String propertyId)
Returns the default state for a property.- Specified by:
getPropertyDefaultin interfaceHTMLComponent- Specified by:
getPropertyDefaultin interfaceorg.apache.xerces.xni.parser.XMLComponent
-
getRecognizedFeatures
public java.lang.String[] getRecognizedFeatures()
Returns recognized features.- Specified by:
getRecognizedFeaturesin interfaceorg.apache.xerces.xni.parser.XMLComponent
-
getRecognizedProperties
public java.lang.String[] getRecognizedProperties()
Returns recognized properties.- Specified by:
getRecognizedPropertiesin interfaceorg.apache.xerces.xni.parser.XMLComponent
-
reset
public void reset(org.apache.xerces.xni.parser.XMLComponentManager manager)
Resets the component.- Specified by:
resetin interfaceorg.apache.xerces.xni.parser.XMLComponent
-
setFeature
public void setFeature(java.lang.String featureId, boolean state)Sets a feature.- Specified by:
setFeaturein interfaceorg.apache.xerces.xni.parser.XMLComponent
-
setProperty
public void setProperty(java.lang.String propertyId, java.lang.Object value)Sets a property.- Specified by:
setPropertyin interfaceorg.apache.xerces.xni.parser.XMLComponent
-
setInputSource
public void setInputSource(org.apache.xerces.xni.parser.XMLInputSource source) throws java.io.IOExceptionSets the input source.- Specified by:
setInputSourcein interfaceorg.apache.xerces.xni.parser.XMLDocumentScanner- Throws:
java.io.IOException
-
scanDocument
public boolean scanDocument(boolean complete) throws java.io.IOExceptionScans the document.- Specified by:
scanDocumentin interfaceorg.apache.xerces.xni.parser.XMLDocumentScanner- Throws:
java.io.IOException
-
setDocumentHandler
public void setDocumentHandler(org.apache.xerces.xni.XMLDocumentHandler handler)
Sets the document handler.- Specified by:
setDocumentHandlerin interfaceorg.apache.xerces.xni.parser.XMLDocumentSource
-
getDocumentHandler
public org.apache.xerces.xni.XMLDocumentHandler getDocumentHandler()
Returns the document handler.- Specified by:
getDocumentHandlerin interfaceorg.apache.xerces.xni.parser.XMLDocumentSource
-
getValue
protected static java.lang.String getValue(org.apache.xerces.xni.XMLAttributes attrs, java.lang.String aname)Returns the value of the specified attribute, ignoring case.
-
expandSystemId
public static java.lang.String expandSystemId(java.lang.String systemId, java.lang.String baseSystemId)Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.- Parameters:
systemId- The systemId to be expanded.- Returns:
- Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
-
fixURI
protected static java.lang.String fixURI(java.lang.String str)
Fixes a platform dependent filename to standard URI form.- Parameters:
str- The string to fix.- Returns:
- Returns the fixed URI string.
-
modifyName
protected static final java.lang.String modifyName(java.lang.String name, short mode)Modifies the given name based on the specified mode.
-
getNamesValue
protected static final short getNamesValue(java.lang.String value)
Converts HTML names string value to constant value.- See Also:
NAMES_NO_CHANGE,NAMES_LOWERCASE,NAMES_UPPERCASE
-
fixWindowsCharacter
protected int fixWindowsCharacter(int origChar)
Fixes Microsoft Windows® specific characters.Details about this common problem can be found at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
-
read
protected int read() throws java.io.IOExceptionReads a single character.- Throws:
java.io.IOException
-
setScanner
protected void setScanner(HTMLScanner.Scanner scanner)
Sets the scanner.
-
setScannerState
protected void setScannerState(short state)
Sets the scanner state.
-
scanDoctype
protected void scanDoctype() throws java.io.IOExceptionScans a DOCTYPE line.- Throws:
java.io.IOException
-
scanLiteral
protected java.lang.String scanLiteral() throws java.io.IOExceptionScans a quoted literal.- Throws:
java.io.IOException
-
scanName
protected java.lang.String scanName(boolean strict) throws java.io.IOExceptionScans a name.- Throws:
java.io.IOException
-
scanEntityRef
protected int scanEntityRef(org.apache.xerces.util.XMLStringBuffer str, boolean content) throws java.io.IOExceptionScans an entity reference.- Throws:
java.io.IOException
-
skip
protected boolean skip(java.lang.String s, boolean caseSensitive) throws java.io.IOExceptionReturns true if the specified text is present and is skipped.- Throws:
java.io.IOException
-
skipMarkup
protected boolean skipMarkup(boolean balance) throws java.io.IOExceptionSkips markup.- Throws:
java.io.IOException
-
skipSpaces
protected boolean skipSpaces() throws java.io.IOExceptionSkips whitespace.- Throws:
java.io.IOException
-
skipNewlines
protected int skipNewlines() throws java.io.IOExceptionSkips newlines and returns the number of newlines skipped.- Throws:
java.io.IOException
-
locationAugs
protected final org.apache.xerces.xni.Augmentations locationAugs()
Returns an augmentations object with a location item added.
-
synthesizedAugs
protected final org.apache.xerces.xni.Augmentations synthesizedAugs()
Returns an augmentations object with a synthesized item added.
-
resourceId
protected final org.apache.xerces.xni.XMLResourceIdentifier resourceId()
Returns an empty resource identifier.
-
builtinXmlRef
protected static boolean builtinXmlRef(java.lang.String name)
Returns true if the name is a built-in XML general entity reference.
-
readPreservingBufferContent
protected int readPreservingBufferContent() throws java.io.IOExceptionReads a single character, preserving the old buffer content- Throws:
java.io.IOException
-
-