Class DicExtractor

java.lang.Object
  extended by org.xml.sax.helpers.DefaultHandler
      extended by DicExtractor
All Implemented Interfaces:
org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler

public class DicExtractor
extends org.xml.sax.helpers.DefaultHandler

This SAX parser reads through the whole document once, reading all the orth elements together with their POS tags and storing the replacements it generates in a ConversionDictionary

Version:
0.1, November 2007
Author:
Johannes Dellert

Constructor Summary
DicExtractor(java.lang.String theOrthPath, java.lang.String thePosPath, java.util.Set<java.lang.String> preservedPOS, boolean verboseOutput, Drop drop, java.lang.String ccfFileName)
          A DictExtractor needs specification of the content to be masked and a definition of untouchable content as well as a file defining the char classes for replacement
DicExtractor(java.lang.String theOrthPath, java.lang.String thePosPath, java.util.Set<java.lang.String> preservedPOS, boolean verboseOutput, java.lang.String ccfFileName)
          This constructor is used for console output (no Drop object then needs to be specified)
 
Method Summary
 void characters(char[] ch, int start, int length)
           
 java.lang.String convert(java.lang.String toConvert)
          find a replacement pattern for a token
 char convertChar(char toConvert)
          convert a character using the char class definition
 void endDocument()
           
 void endElement(java.lang.String namespaceURI, java.lang.String sName, java.lang.String qName)
           
 ConversionDictionary getConvDict()
          get the masking dictionary generated from the document while traversing it
static java.util.HashMap<java.lang.String,java.util.ArrayList<java.lang.String>> loadCharClasses(java.lang.String fileName)
          specify the char classes for replace pattern generation by giving a file in the format exemplified by lang/german.ccf
 void log(java.lang.String toLog)
          used to display a message (debugging or progress info) in System.out or in the log window of the GUI
 void logln(java.lang.String toLog)
          used to display a message (debugging or progress info) in System.out or in the log window of the GUI, followed by a newline
 void retainMorphology(double occPerWord, int minOccurs, java.util.Set<java.lang.String> noMorphPOS)
          revise the whole dictionary to detect affixes for each POS class and reinsert them to the replacement forms (morphological information will largely be preserved) parameters serve to determine how permissive the algorithm evaluating the possible affixes will be
 void setConvDict(ConversionDictionary aDict)
          hand over a masking dictionary to be extended while traversing the document
 void startDocument()
           
 void startElement(java.lang.String namespaceURI, java.lang.String sName, java.lang.String qName, org.xml.sax.Attributes attrs)
           
 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startPrefixMapping, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DicExtractor

public DicExtractor(java.lang.String theOrthPath,
                    java.lang.String thePosPath,
                    java.util.Set<java.lang.String> preservedPOS,
                    boolean verboseOutput,
                    Drop drop,
                    java.lang.String ccfFileName)
A DictExtractor needs specification of the content to be masked and a definition of untouchable content as well as a file defining the char classes for replacement

Parameters:
theOrthPath - - the XPath expression to define masked content
thePosPath - - the XPath expression to define classification of content for dictionary retrieval
preservedPOS - - a set of POS classes whose members are exempt from being masked
verboseOutput - - true for very verbose output of the building process
drop - - a Drop object to handle logging messages in the GUI - null for console output
ccfFileName - - the name of a file in the format exemplified by lang/german.ccf

DicExtractor

public DicExtractor(java.lang.String theOrthPath,
                    java.lang.String thePosPath,
                    java.util.Set<java.lang.String> preservedPOS,
                    boolean verboseOutput,
                    java.lang.String ccfFileName)
This constructor is used for console output (no Drop object then needs to be specified)

Parameters:
theOrthPath - - the XPath expression to define masked content
thePosPath - - the XPath expression to define classification of content for dictionary retrieval
preservedPOS - - a set of POS classes whose members are exempt from being masked
verboseOutput - - true for very verbose output of the building process * @param ccfFileName - the name of a file in the format exemplified by lang/german.ccf
Method Detail

loadCharClasses

public static java.util.HashMap<java.lang.String,java.util.ArrayList<java.lang.String>> loadCharClasses(java.lang.String fileName)
specify the char classes for replace pattern generation by giving a file in the format exemplified by lang/german.ccf

Parameters:
fileName - - a filename leading to a file in the format exemplified by lang/german.ccf
Returns:
- a map with character class information

getConvDict

public ConversionDictionary getConvDict()
get the masking dictionary generated from the document while traversing it

Returns:
the ConversionDictionary generated while traversing the document and generating replacement patterns

setConvDict

public void setConvDict(ConversionDictionary aDict)
hand over a masking dictionary to be extended while traversing the document

Parameters:
aDict - - the ConversionDictionary to be used as a basis for extension during document traversal

startDocument

public void startDocument()
Specified by:
startDocument in interface org.xml.sax.ContentHandler
Overrides:
startDocument in class org.xml.sax.helpers.DefaultHandler

endDocument

public void endDocument()
Specified by:
endDocument in interface org.xml.sax.ContentHandler
Overrides:
endDocument in class org.xml.sax.helpers.DefaultHandler

startElement

public void startElement(java.lang.String namespaceURI,
                         java.lang.String sName,
                         java.lang.String qName,
                         org.xml.sax.Attributes attrs)
Specified by:
startElement in interface org.xml.sax.ContentHandler
Overrides:
startElement in class org.xml.sax.helpers.DefaultHandler

endElement

public void endElement(java.lang.String namespaceURI,
                       java.lang.String sName,
                       java.lang.String qName)
Specified by:
endElement in interface org.xml.sax.ContentHandler
Overrides:
endElement in class org.xml.sax.helpers.DefaultHandler

characters

public void characters(char[] ch,
                       int start,
                       int length)
Specified by:
characters in interface org.xml.sax.ContentHandler
Overrides:
characters in class org.xml.sax.helpers.DefaultHandler

convert

public java.lang.String convert(java.lang.String toConvert)
find a replacement pattern for a token

Parameters:
toConvert - - the string to convert
Returns:
- the converted String

convertChar

public char convertChar(char toConvert)
convert a character using the char class definition

Parameters:
toConvert - - the char to convert
Returns:
- the converted char

retainMorphology

public void retainMorphology(double occPerWord,
                             int minOccurs,
                             java.util.Set<java.lang.String> noMorphPOS)
revise the whole dictionary to detect affixes for each POS class and reinsert them to the replacement forms (morphological information will largely be preserved) parameters serve to determine how permissive the algorithm evaluating the possible affixes will be

Parameters:
occPerWord - - define how often an affix has to occur in its word class to be retained
minOccurs - - define the minimum number of times an affix has to occur to be retained
noMorphPOS - - define closed POS classes where no morphology should be inferred

log

public void log(java.lang.String toLog)
used to display a message (debugging or progress info) in System.out or in the log window of the GUI

Parameters:
toLog - - the logging message to be displayed in System.out or in the log window of the GUI

logln

public void logln(java.lang.String toLog)
used to display a message (debugging or progress info) in System.out or in the log window of the GUI, followed by a newline

Parameters:
toLog - - the logging message to be displayed in System.out or in the log window of the GUI