Class TextExtractor
java.lang.Object
org.xml.sax.helpers.DefaultHandler
TextExtractor
- All Implemented Interfaces:
- org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler
public class TextExtractor
- extends org.xml.sax.helpers.DefaultHandler
This SAX parser reads through a whole XML document once, reading all the specified "orth" elements
and putting out their content seperated by whitespaces (or by linebreaks) to System.out.
Mainly used to extract samples of masked or unmasked content from corpora files.
- Version:
- 0.1, November 2007
- Author:
- Johannes Dellert
Constructor Summary |
TextExtractor(int startNumber,
int endNumber,
boolean ifLineBreak,
java.lang.String theOrthPath)
|
Method Summary |
void |
characters(char[] ch,
int start,
int length)
|
void |
endElement(java.lang.String namespaceURI,
java.lang.String sName,
java.lang.String qName)
|
static void |
main(java.lang.String[] args)
This program serves to output the content at certain positions in an XML document
Usage: TextExtractor (-l) (--userDefPaths) [xmlFile] ((startToken)) (endToken)
Option -l for linebreak after each token
Option --userDefPaths to manually specify extracted elements |
void |
startDocument()
|
void |
startElement(java.lang.String namespaceURI,
java.lang.String sName,
java.lang.String qName,
org.xml.sax.Attributes attrs)
|
Methods inherited from class org.xml.sax.helpers.DefaultHandler |
endDocument, endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startPrefixMapping, unparsedEntityDecl, warning |
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TextExtractor
public TextExtractor(int startNumber,
int endNumber,
boolean ifLineBreak,
java.lang.String theOrthPath)
startDocument
public void startDocument()
- Specified by:
startDocument
in interface org.xml.sax.ContentHandler
- Overrides:
startDocument
in class org.xml.sax.helpers.DefaultHandler
startElement
public void startElement(java.lang.String namespaceURI,
java.lang.String sName,
java.lang.String qName,
org.xml.sax.Attributes attrs)
- Specified by:
startElement
in interface org.xml.sax.ContentHandler
- Overrides:
startElement
in class org.xml.sax.helpers.DefaultHandler
endElement
public void endElement(java.lang.String namespaceURI,
java.lang.String sName,
java.lang.String qName)
- Specified by:
endElement
in interface org.xml.sax.ContentHandler
- Overrides:
endElement
in class org.xml.sax.helpers.DefaultHandler
characters
public void characters(char[] ch,
int start,
int length)
- Specified by:
characters
in interface org.xml.sax.ContentHandler
- Overrides:
characters
in class org.xml.sax.helpers.DefaultHandler
main
public static void main(java.lang.String[] args)
- This program serves to output the content at certain positions in an XML document
Usage: TextExtractor (-l) (--userDefPaths) [xmlFile] ((startToken)) (endToken)
Option -l for linebreak after each token
Option --userDefPaths to manually specify extracted elements