|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.xml.sax.helpers.DefaultHandler
DicSampleExtractor
public class DicSampleExtractor
This SAX parser reads through the first part of an XML corpu doument reading all the orth elements together with their POS tags and storing the replacements it generates in a ConversionDictionary until it achieves a sample size of 1000 masking dictionary entries
Constructor Summary | |
---|---|
DicSampleExtractor(java.lang.String theOrthPath,
java.lang.String thePosPath,
java.util.Set<java.lang.String> preservedPOS,
boolean verboseOutput,
java.lang.String ccfFileName)
A DictSampleExtractor needs specification of the content to be masked and a definition of untouchable content as well as a file defining the char classes for replacement. |
Method Summary | |
---|---|
void |
characters(char[] ch,
int start,
int length)
|
java.lang.String |
convert(java.lang.String toConvert)
find a replacement pattern for a token |
char |
convertChar(char toConvert)
convert a character using the char class definition |
void |
endDocument()
|
void |
endElement(java.lang.String namespaceURI,
java.lang.String sName,
java.lang.String qName)
|
ConversionDictionary |
getConvDict()
get the masking dictionary generated from the document while traversing it |
static java.util.HashMap<java.lang.String,java.util.ArrayList<java.lang.String>> |
loadCharClasses(java.lang.String fileName)
specify the char classes for replace pattern generation by giving a file in the format exemplified by lang/german.ccf |
void |
retainMorphology(double occPerWord,
int minOccurs,
java.util.Set<java.lang.String> noMorphPOS)
revise the whole dictionary to detect affixes for each POS class and reinsert them to the replacement forms (morphological information will largely be preserved) parameters serve to determine how permissive the algorithm evaluating the possible affixes will be |
void |
startDocument()
|
void |
startElement(java.lang.String namespaceURI,
java.lang.String sName,
java.lang.String qName,
org.xml.sax.Attributes attrs)
|
Methods inherited from class org.xml.sax.helpers.DefaultHandler |
---|
endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startPrefixMapping, unparsedEntityDecl, warning |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DicSampleExtractor(java.lang.String theOrthPath, java.lang.String thePosPath, java.util.Set<java.lang.String> preservedPOS, boolean verboseOutput, java.lang.String ccfFileName)
theOrthPath
- - the XPath expression to define masked contentthePosPath
- - the XPath expression to define classification of content for dictionary retrievalpreservedPOS
- - a set of POS classes whose members are exempt from being maskedverboseOutput
- - true for very verbose output of the building processccfFileName
- - the name of a file in the format exemplified by lang/german.ccfMethod Detail |
---|
public ConversionDictionary getConvDict()
public static java.util.HashMap<java.lang.String,java.util.ArrayList<java.lang.String>> loadCharClasses(java.lang.String fileName)
fileName
- - a filename leading to a file in the format exemplified by lang/german.ccf
public void startDocument()
startDocument
in interface org.xml.sax.ContentHandler
startDocument
in class org.xml.sax.helpers.DefaultHandler
public void endDocument()
endDocument
in interface org.xml.sax.ContentHandler
endDocument
in class org.xml.sax.helpers.DefaultHandler
public void startElement(java.lang.String namespaceURI, java.lang.String sName, java.lang.String qName, org.xml.sax.Attributes attrs) throws org.xml.sax.SAXException
startElement
in interface org.xml.sax.ContentHandler
startElement
in class org.xml.sax.helpers.DefaultHandler
org.xml.sax.SAXException
public void endElement(java.lang.String namespaceURI, java.lang.String sName, java.lang.String qName) throws org.xml.sax.SAXException
endElement
in interface org.xml.sax.ContentHandler
endElement
in class org.xml.sax.helpers.DefaultHandler
org.xml.sax.SAXException
public void characters(char[] ch, int start, int length)
characters
in interface org.xml.sax.ContentHandler
characters
in class org.xml.sax.helpers.DefaultHandler
public java.lang.String convert(java.lang.String toConvert)
toConvert
- - the string to convert
public char convertChar(char toConvert)
toConvert
- - the char to convert
public void retainMorphology(double occPerWord, int minOccurs, java.util.Set<java.lang.String> noMorphPOS)
occPerWord
- - define how often an affix has to occur in its word class to be retainedminOccurs
- - define the minimum number of times an affix has to occur to be retainednoMorphPOS
- - define closed POS classes where no morphology should be inferred
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |