lucene - How does (carrot) clustering work in solr? -


i have running lucene/solr 4 testing different features, "clustering". currently, 1 million documents indexed. every document has following fields:

id (unique key) example1: 10245                example2: 24974 topic (keywords of document) example1: "disaster/japan/nuclear power station"                                  example2: "world/japan/nuclear power" headline (1 line of text): example1: "explosion @ nuclear power plant in japan"                            example2: "news japans nuclear power plant" text (the full text): "in japanese nuclear power plant in fukushima..." 

all fields indexed , stored, exapt text, indexed, not stored. use following specific configuration:

  <str name="carrot.title">topic</str>    <str name="carrot.snippet">headline</str> 

if looking example see, topic different, japan same. possible configure solr/carrot in way, example1 , example2 in 1 cluster? because of matching "japan"?!

further there 3rd topic "news/nuclear power", no "japan" inside headline , text using words: japans power plant. solr/carrot configuration relevant in order receive 3 news in 1 cluster?

thank you!

carrot2 designed cluster natural / unstructured text , such algorithms produce results human find perfect. unfortunately, such algorithms hard "debug" -- clusters produce depend on many factors, such frequencies words occur in documents. in specific example, word japan may not have been chosen form cluster because it's frequent -- appears in of documents quoted.

here few tips may want try tweak clusters:

  • try separating keywords period followed space rather slash, e.g. "disaster. japan. nuclear power station". if that, carrot2 treat word sequences, such "nuclear power station", phrases rather individual words.

  • try different carrot2 clustering algorithm, e.g. stc.

  • if there chance full story text field stored (or maybe part of it, such first paragraph), use headline carrot.title , full text / excerpt carrot.snippet.

  • play specific settings of carrot2 algorithms. best tool carrot2 clustering workbench. here's how connect solr: http://wiki.apache.org/solr/clusteringcomponent#tuning_carrot2_clustering


Comments

Popular posts from this blog

c# - SharpSVN - How to get the previous revision? -

c++ - Is it possible to compile a VST on linux? -

url - Querystring manipulation of email Address in PHP -