|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.xml.sax.helpers.DefaultHandler
DicExtractor
public class DicExtractor
This SAX parser reads through the whole document once, reading all the orth elements together with their POS tags and storing the replacements it generates in a ConversionDictionary
Constructor Summary | |
---|---|
DicExtractor(java.lang.String theOrthPath,
java.lang.String thePosPath,
java.util.Set<java.lang.String> preservedPOS,
boolean verboseOutput,
Drop drop,
java.lang.String ccfFileName)
A DictExtractor needs specification of the content to be masked and a definition of untouchable content as well as a file defining the char classes for replacement |
|
DicExtractor(java.lang.String theOrthPath,
java.lang.String thePosPath,
java.util.Set<java.lang.String> preservedPOS,
boolean verboseOutput,
java.lang.String ccfFileName)
This constructor is used for console output (no Drop object then needs to be specified) |
Method Summary | |
---|---|
void |
characters(char[] ch,
int start,
int length)
|
java.lang.String |
convert(java.lang.String toConvert)
find a replacement pattern for a token |
char |
convertChar(char toConvert)
convert a character using the char class definition |
void |
endDocument()
|
void |
endElement(java.lang.String namespaceURI,
java.lang.String sName,
java.lang.String qName)
|
ConversionDictionary |
getConvDict()
get the masking dictionary generated from the document while traversing it |
static java.util.HashMap<java.lang.String,java.util.ArrayList<java.lang.String>> |
loadCharClasses(java.lang.String fileName)
specify the char classes for replace pattern generation by giving a file in the format exemplified by lang/german.ccf |
void |
log(java.lang.String toLog)
used to display a message (debugging or progress info) in System.out or in the log window of the GUI |
void |
logln(java.lang.String toLog)
used to display a message (debugging or progress info) in System.out or in the log window of the GUI, followed by a newline |
void |
retainMorphology(double occPerWord,
int minOccurs,
java.util.Set<java.lang.String> noMorphPOS)
revise the whole dictionary to detect affixes for each POS class and reinsert them to the replacement forms (morphological information will largely be preserved) parameters serve to determine how permissive the algorithm evaluating the possible affixes will be |
void |
setConvDict(ConversionDictionary aDict)
hand over a masking dictionary to be extended while traversing the document |
void |
startDocument()
|
void |
startElement(java.lang.String namespaceURI,
java.lang.String sName,
java.lang.String qName,
org.xml.sax.Attributes attrs)
|
Methods inherited from class org.xml.sax.helpers.DefaultHandler |
---|
endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startPrefixMapping, unparsedEntityDecl, warning |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DicExtractor(java.lang.String theOrthPath, java.lang.String thePosPath, java.util.Set<java.lang.String> preservedPOS, boolean verboseOutput, Drop drop, java.lang.String ccfFileName)
theOrthPath
- - the XPath expression to define masked contentthePosPath
- - the XPath expression to define classification of content for dictionary retrievalpreservedPOS
- - a set of POS classes whose members are exempt from being maskedverboseOutput
- - true for very verbose output of the building processdrop
- - a Drop object to handle logging messages in the GUI - null for console outputccfFileName
- - the name of a file in the format exemplified by lang/german.ccfpublic DicExtractor(java.lang.String theOrthPath, java.lang.String thePosPath, java.util.Set<java.lang.String> preservedPOS, boolean verboseOutput, java.lang.String ccfFileName)
theOrthPath
- - the XPath expression to define masked contentthePosPath
- - the XPath expression to define classification of content for dictionary retrievalpreservedPOS
- - a set of POS classes whose members are exempt from being maskedverboseOutput
- - true for very verbose output of the building process
* @param ccfFileName - the name of a file in the format exemplified by lang/german.ccfMethod Detail |
---|
public static java.util.HashMap<java.lang.String,java.util.ArrayList<java.lang.String>> loadCharClasses(java.lang.String fileName)
fileName
- - a filename leading to a file in the format exemplified by lang/german.ccf
public ConversionDictionary getConvDict()
public void setConvDict(ConversionDictionary aDict)
aDict
- - the ConversionDictionary to be used as a basis for extension during document traversalpublic void startDocument()
startDocument
in interface org.xml.sax.ContentHandler
startDocument
in class org.xml.sax.helpers.DefaultHandler
public void endDocument()
endDocument
in interface org.xml.sax.ContentHandler
endDocument
in class org.xml.sax.helpers.DefaultHandler
public void startElement(java.lang.String namespaceURI, java.lang.String sName, java.lang.String qName, org.xml.sax.Attributes attrs)
startElement
in interface org.xml.sax.ContentHandler
startElement
in class org.xml.sax.helpers.DefaultHandler
public void endElement(java.lang.String namespaceURI, java.lang.String sName, java.lang.String qName)
endElement
in interface org.xml.sax.ContentHandler
endElement
in class org.xml.sax.helpers.DefaultHandler
public void characters(char[] ch, int start, int length)
characters
in interface org.xml.sax.ContentHandler
characters
in class org.xml.sax.helpers.DefaultHandler
public java.lang.String convert(java.lang.String toConvert)
toConvert
- - the string to convert
public char convertChar(char toConvert)
toConvert
- - the char to convert
public void retainMorphology(double occPerWord, int minOccurs, java.util.Set<java.lang.String> noMorphPOS)
occPerWord
- - define how often an affix has to occur in its word class to be retainedminOccurs
- - define the minimum number of times an affix has to occur to be retainednoMorphPOS
- - define closed POS classes where no morphology should be inferredpublic void log(java.lang.String toLog)
toLog
- - the logging message to be displayed in System.out or in the log window of the GUIpublic void logln(java.lang.String toLog)
toLog
- - the logging message to be displayed in System.out or in the log window of the GUI
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |