Class DicSampleExtractor

java.lang.Object
  extended by org.xml.sax.helpers.DefaultHandler
      extended by DicSampleExtractor
All Implemented Interfaces:
org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler

public class DicSampleExtractor
extends org.xml.sax.helpers.DefaultHandler

This SAX parser reads through the first part of an XML corpu doument reading all the orth elements together with their POS tags and storing the replacements it generates in a ConversionDictionary until it achieves a sample size of 1000 masking dictionary entries

Version:
0.1, November 2007
Author:
Johannes Dellert

Constructor Summary
DicSampleExtractor(java.lang.String theOrthPath, java.lang.String thePosPath, java.util.Set<java.lang.String> preservedPOS, boolean verboseOutput, java.lang.String ccfFileName)
          A DictSampleExtractor needs specification of the content to be masked and a definition of untouchable content as well as a file defining the char classes for replacement.
 
Method Summary
 void characters(char[] ch, int start, int length)
           
 java.lang.String convert(java.lang.String toConvert)
          find a replacement pattern for a token
 char convertChar(char toConvert)
          convert a character using the char class definition
 void endDocument()
           
 void endElement(java.lang.String namespaceURI, java.lang.String sName, java.lang.String qName)
           
 ConversionDictionary getConvDict()
          get the masking dictionary generated from the document while traversing it
static java.util.HashMap<java.lang.String,java.util.ArrayList<java.lang.String>> loadCharClasses(java.lang.String fileName)
          specify the char classes for replace pattern generation by giving a file in the format exemplified by lang/german.ccf
 void retainMorphology(double occPerWord, int minOccurs, java.util.Set<java.lang.String> noMorphPOS)
          revise the whole dictionary to detect affixes for each POS class and reinsert them to the replacement forms (morphological information will largely be preserved) parameters serve to determine how permissive the algorithm evaluating the possible affixes will be
 void startDocument()
           
 void startElement(java.lang.String namespaceURI, java.lang.String sName, java.lang.String qName, org.xml.sax.Attributes attrs)
           
 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startPrefixMapping, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DicSampleExtractor

public DicSampleExtractor(java.lang.String theOrthPath,
                          java.lang.String thePosPath,
                          java.util.Set<java.lang.String> preservedPOS,
                          boolean verboseOutput,
                          java.lang.String ccfFileName)
A DictSampleExtractor needs specification of the content to be masked and a definition of untouchable content as well as a file defining the char classes for replacement.

Parameters:
theOrthPath - - the XPath expression to define masked content
thePosPath - - the XPath expression to define classification of content for dictionary retrieval
preservedPOS - - a set of POS classes whose members are exempt from being masked
verboseOutput - - true for very verbose output of the building process
ccfFileName - - the name of a file in the format exemplified by lang/german.ccf
Method Detail

getConvDict

public ConversionDictionary getConvDict()
get the masking dictionary generated from the document while traversing it

Returns:
the ConversionDictionary generated while traversing the document and generating replacement patterns

loadCharClasses

public static java.util.HashMap<java.lang.String,java.util.ArrayList<java.lang.String>> loadCharClasses(java.lang.String fileName)
specify the char classes for replace pattern generation by giving a file in the format exemplified by lang/german.ccf

Parameters:
fileName - - a filename leading to a file in the format exemplified by lang/german.ccf
Returns:
- a map with character class information

startDocument

public void startDocument()
Specified by:
startDocument in interface org.xml.sax.ContentHandler
Overrides:
startDocument in class org.xml.sax.helpers.DefaultHandler

endDocument

public void endDocument()
Specified by:
endDocument in interface org.xml.sax.ContentHandler
Overrides:
endDocument in class org.xml.sax.helpers.DefaultHandler

startElement

public void startElement(java.lang.String namespaceURI,
                         java.lang.String sName,
                         java.lang.String qName,
                         org.xml.sax.Attributes attrs)
                  throws org.xml.sax.SAXException
Specified by:
startElement in interface org.xml.sax.ContentHandler
Overrides:
startElement in class org.xml.sax.helpers.DefaultHandler
Throws:
org.xml.sax.SAXException

endElement

public void endElement(java.lang.String namespaceURI,
                       java.lang.String sName,
                       java.lang.String qName)
                throws org.xml.sax.SAXException
Specified by:
endElement in interface org.xml.sax.ContentHandler
Overrides:
endElement in class org.xml.sax.helpers.DefaultHandler
Throws:
org.xml.sax.SAXException

characters

public void characters(char[] ch,
                       int start,
                       int length)
Specified by:
characters in interface org.xml.sax.ContentHandler
Overrides:
characters in class org.xml.sax.helpers.DefaultHandler

convert

public java.lang.String convert(java.lang.String toConvert)
find a replacement pattern for a token

Parameters:
toConvert - - the string to convert
Returns:
- the converted String

convertChar

public char convertChar(char toConvert)
convert a character using the char class definition

Parameters:
toConvert - - the char to convert
Returns:
- the converted char

retainMorphology

public void retainMorphology(double occPerWord,
                             int minOccurs,
                             java.util.Set<java.lang.String> noMorphPOS)
revise the whole dictionary to detect affixes for each POS class and reinsert them to the replacement forms (morphological information will largely be preserved) parameters serve to determine how permissive the algorithm evaluating the possible affixes will be

Parameters:
occPerWord - - define how often an affix has to occur in its word class to be retained
minOccurs - - define the minimum number of times an affix has to occur to be retained
noMorphPOS - - define closed POS classes where no morphology should be inferred