lucene - How does (carrot) clustering work in solr? -
i have running lucene/solr 4 testing different features, "clustering". currently, 1 million documents indexed. every document has following fields:
id (unique key) example1: 10245 example2: 24974 topic (keywords of document) example1: "disaster/japan/nuclear power station" example2: "world/japan/nuclear power" headline (1 line of text): example1: "explosion @ nuclear power plant in japan" example2: "news japans nuclear power plant" text (the full text): "in japanese nuclear power plant in fukushima..."
all fields indexed , stored, exapt text, indexed, not stored. use following specific configuration:
<str name="carrot.title">topic</str> <str name="carrot.snippet">headline</str>
if looking example see, topic different, japan same. possible configure solr/carrot in way, example1 , example2 in 1 cluster? because of matching "japan"?!
further there 3rd topic "news/nuclear power", no "japan" inside headline , text using words: japans power plant. solr/carrot configuration relevant in order receive 3 news in 1 cluster?
thank you!
carrot2 designed cluster natural / unstructured text , such algorithms produce results human find perfect. unfortunately, such algorithms hard "debug" -- clusters produce depend on many factors, such frequencies words occur in documents. in specific example, word japan may not have been chosen form cluster because it's frequent -- appears in of documents quoted.
here few tips may want try tweak clusters:
try separating keywords period followed space rather slash, e.g. "disaster. japan. nuclear power station". if that, carrot2 treat word sequences, such "nuclear power station", phrases rather individual words.
try different carrot2 clustering algorithm, e.g. stc.
if there chance full story text field stored (or maybe part of it, such first paragraph), use headline carrot.title , full text / excerpt carrot.snippet.
play specific settings of carrot2 algorithms. best tool carrot2 clustering workbench. here's how connect solr: http://wiki.apache.org/solr/clusteringcomponent#tuning_carrot2_clustering
Comments
Post a Comment