Class TextExtractor

java.lang.Object
  extended by org.xml.sax.helpers.DefaultHandler
      extended by TextExtractor
All Implemented Interfaces:
org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler

public class TextExtractor
extends org.xml.sax.helpers.DefaultHandler

This SAX parser reads through a whole XML document once, reading all the specified "orth" elements and putting out their content seperated by whitespaces (or by linebreaks) to System.out. Mainly used to extract samples of masked or unmasked content from corpora files.

Version:
0.1, November 2007
Author:
Johannes Dellert

Constructor Summary
TextExtractor(int startNumber, int endNumber, boolean ifLineBreak, java.lang.String theOrthPath)
           
 
Method Summary
 void characters(char[] ch, int start, int length)
           
 void endElement(java.lang.String namespaceURI, java.lang.String sName, java.lang.String qName)
           
static void main(java.lang.String[] args)
          This program serves to output the content at certain positions in an XML document Usage: TextExtractor (-l) (--userDefPaths) [xmlFile] ((startToken)) (endToken) Option -l for linebreak after each token Option --userDefPaths to manually specify extracted elements
 void startDocument()
           
 void startElement(java.lang.String namespaceURI, java.lang.String sName, java.lang.String qName, org.xml.sax.Attributes attrs)
           
 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
endDocument, endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startPrefixMapping, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TextExtractor

public TextExtractor(int startNumber,
                     int endNumber,
                     boolean ifLineBreak,
                     java.lang.String theOrthPath)
Method Detail

startDocument

public void startDocument()
Specified by:
startDocument in interface org.xml.sax.ContentHandler
Overrides:
startDocument in class org.xml.sax.helpers.DefaultHandler

startElement

public void startElement(java.lang.String namespaceURI,
                         java.lang.String sName,
                         java.lang.String qName,
                         org.xml.sax.Attributes attrs)
Specified by:
startElement in interface org.xml.sax.ContentHandler
Overrides:
startElement in class org.xml.sax.helpers.DefaultHandler

endElement

public void endElement(java.lang.String namespaceURI,
                       java.lang.String sName,
                       java.lang.String qName)
Specified by:
endElement in interface org.xml.sax.ContentHandler
Overrides:
endElement in class org.xml.sax.helpers.DefaultHandler

characters

public void characters(char[] ch,
                       int start,
                       int length)
Specified by:
characters in interface org.xml.sax.ContentHandler
Overrides:
characters in class org.xml.sax.helpers.DefaultHandler

main

public static void main(java.lang.String[] args)
This program serves to output the content at certain positions in an XML document Usage: TextExtractor (-l) (--userDefPaths) [xmlFile] ((startToken)) (endToken) Option -l for linebreak after each token Option --userDefPaths to manually specify extracted elements