Solr使用入门指南

Hanrea 发表于 2017-4-13 13:28:49 | 显示全部楼层 |阅读模式 [复制链接]
0 189

由于搜索引擎功能在门户社区中对提高用户体验有着重在门户社区中涉及大量需要搜索引擎的功能需求,目前在实现搜索引擎的方案上有集中方案可供选择:

EngineBUS enginebus

1. 基于Lucene自己进行封装实现站内搜索。工作量及扩展性都较大,不采用。

EngineBUS enginebus

2. 调用Google、Baidu的API实现站内搜索。同第三方搜索引擎绑定太死,无法满足后期业务扩展需要,暂时不采用。

EngineBUS enginebus

3. 基于Compass+Lucene实现站内搜索。适合于对数据库驱动的应用数据进行索引,尤其是替代传统的like ‘%expression%’来实现对varchar或clob等字段的索引,对于实现站内搜索是一种值得采纳的方案。但在分布式处理、接口封装上尚需要自己进行一定程度的封装,暂时不采用。

EngineBUS enginebus

4. 基于Solr实现站内搜索。封装及扩展性较好,提供了较为完备的解决方案,因此在门户社区中采用此方案,后期加入Compass方案。

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

1、 Solr简介

EngineBUS enginebus

Solr是一个基于Lucene的Java搜索引擎服务器。Solr 提供了层面搜索、命中醒目显示并且支持多种输出格式(包括 XML/XSLT 和 JSON 格式)。它易于安装和配置,而且附带了一个基于 HTTP 的管理界面。Solr已经在众多大型的网站中使用,较为成熟和稳定。Solr 包装并扩展了 Lucene,所以Solr的基本上沿用了Lucene的相关术语。更重要的是,Solr 创建的索引与 Lucene 搜索引擎库完全兼容。通过对Solr 进行适当的配置,某些情况下可能需要进行编码,Solr 可以阅读和使用构建到其他 Lucene 应用程序中的索引。此外,很多 Lucene 工具(如Nutch、 Luke)也可以使用Solr 创建的索引。

EngineBUS enginebus


EngineBUS enginebus EngineBUS enginebus2、 Tomcat下Solr安装配置
EngineBUS enginebus EngineBUS enginebus由于Solr基于java开发,因此Solr在windows及Linux都能较好部署使用,但由于Solr提供了一些用于测试及管理、维护较为方便的shell脚本,因此在生产部署时候建议安装在Linux上,测试时候可以在windows使用。

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

下面以Linux下安装配置Solr进行说明,windows与此类似。
EngineBUS enginebus EngineBUS enginebuswget http://apache.mirror.phpchina.co ... e-tomcat-6.0.16.zip
EngineBUS enginebus EngineBUS enginebusunzip apache-tomcat-6.0.16.zip
EngineBUS enginebus EngineBUS enginebusmv apache-tomcat-6.0.16 /opt/tomcat
EngineBUS enginebus EngineBUS enginebuschmod 755 /opt/tomcat/bin/*
EngineBUS enginebus EngineBUS enginebuswget http://apache.mirror.phpchina.com/lucene/solr/1.2/apache-solr-1.2.0.tgz
EngineBUS enginebus EngineBUS enginebustar zxvf apache-solr-1.2.0.tgz
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebusSolr的安装配置最为麻烦的是对solr.solr.home的理解和配置,主要有三种

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

基于当前路径的方式
EngineBUS enginebus EngineBUS enginebuscp apache-solr-1.2.0/dist/apache-solr-1.2.0.war /opt/tomcat/webapps/solr.war
EngineBUS enginebus EngineBUS enginebusmkdir /opt/solr-tomcat
EngineBUS enginebus EngineBUS enginebuscp -r apache-solr-1.2.0/example/solr/ /opt/solr-tomcat/
EngineBUS enginebus EngineBUS enginebuscd /opt/solr-tomcat
EngineBUS enginebus EngineBUS enginebus/opt/tomcat/bin/startup.sh
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus由于在此种情况下(没有设定solr.solr.home环境变量或JNDI的情况下),Solr查找./solr,因此在启动时候需要切换到/opt/solr-tomcat

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

基于环境变量solr.solr.home
EngineBUS enginebus EngineBUS enginebus在当前用户的环境变量中(.bash_profile)或在/opt/tomcat/catalina.sh中添加如下环境变量
EngineBUS enginebus EngineBUS enginebusexport JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/opt/solr-tomcat/solr"
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus基于JNDI配置
EngineBUS enginebus EngineBUS enginebusmkdir –p /opt/tomcat/conf/Catalina/localhost
EngineBUS enginebus EngineBUS enginebustouch /opt/tomcat/conf/Catalina/localhost/solr.xml ,内容如下:

EngineBUS enginebus
  1. <Context docBase="/opt/tomcat/webapps/solr.war" debug="0" crossContext="true" >  
    EngineBUS enginebus EngineBUS enginebus
  2.   <Environment name="solr/home" type="java.lang.String" value="/opt/solr-tomcat/solr" override="true" />  
    EngineBUS enginebus EngineBUS enginebus
  3. </Context>
复制代码

访问solr管理界面 http://ip:port/solr

EngineBUS enginebus


EngineBUS enginebus EngineBUS enginebus3、 Solr原理
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebusSolr对外提供标准的http接口来实现对数据的索引的增加、删除、修改、查询。在 Solr 中,用户通过向部署在servlet 容器中的 Solr Web 应用程序发送 HTTP 请求来启动索引和搜索。Solr 接受请求,确定要使用的适当SolrRequestHandler,然后处理请求。通过 HTTP 以同样的方式返回响应。默认配置返回Solr 的标准 XML 响应,也可以配置Solr 的备用响应格式。
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus可以向 Solr 索引 servlet 传递四个不同的索引请求:
EngineBUS enginebus EngineBUS enginebusadd/update 允许向 Solr 添加文档或更新文档。直到提交后才能搜索到这些添加和更新。
EngineBUS enginebus EngineBUS enginebuscommit 告诉 Solr,应该使上次提交以来所做的所有更改都可以搜索到。
EngineBUS enginebus EngineBUS enginebusoptimize 重构 Lucene 的文件以改进搜索性能。索引完成后执行一下优化通常比较好。如果更新比较频繁,则应该在使用率较低的时候安排优化。一个索引无需优化也可以正常地运行。优化是一个耗时较多的过程。
EngineBUS enginebus EngineBUS enginebusdelete 可以通过 id 或查询来指定。按 id 删除将删除具有指定 id 的文档;按查询删除将删除查询返回的所有文档。
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus一个典型的Add请求报文

EngineBUS enginebus
  1. <add>  
    EngineBUS enginebus EngineBUS enginebus
  2.   <doc>  
    EngineBUS enginebus EngineBUS enginebus
  3.     <field name="id">TWINX2048-3200PRO</field>  
    EngineBUS enginebus EngineBUS enginebus
  4.     <field name="name">CORSAIR  XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail</field>  
    EngineBUS enginebus EngineBUS enginebus
  5.     <field name="manu">Corsair Microsystems Inc.</field>  
    EngineBUS enginebus EngineBUS enginebus
  6.     <field name="cat">electronics</field>  
    EngineBUS enginebus EngineBUS enginebus
  7.     <field name="cat">memory</field>  
    EngineBUS enginebus EngineBUS enginebus
  8.     <field name="features">CAS latency 2, 2-3-3-6 timing, 2.75v, unbuffered, heat-spreader</field>  
    EngineBUS enginebus EngineBUS enginebus
  9.     <field name="price">185</field>  
    EngineBUS enginebus EngineBUS enginebus
  10.     <field name="popularity">5</field>  
    EngineBUS enginebus EngineBUS enginebus
  11.     <field name="inStock">true</field>  
    EngineBUS enginebus EngineBUS enginebus
  12.   </doc>  
    EngineBUS enginebus EngineBUS enginebus
  13.   <doc>  
    EngineBUS enginebus EngineBUS enginebus
  14.     <field name="id">VS1GB400C3</field>  
    EngineBUS enginebus EngineBUS enginebus
  15.     <field name="name">CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail</field>  
    EngineBUS enginebus EngineBUS enginebus
  16.     <field name="manu">Corsair Microsystems Inc.</field>  
    EngineBUS enginebus EngineBUS enginebus
  17.     <field name="cat">electronics</field>  
    EngineBUS enginebus EngineBUS enginebus
  18.     <field name="cat">memory</field>  
    EngineBUS enginebus EngineBUS enginebus
  19.     <field name="price">74.99</field>  
    EngineBUS enginebus EngineBUS enginebus
  20.     <field name="popularity">7</field>  
    EngineBUS enginebus EngineBUS enginebus
  21.     <field name="inStock">true</field>  
    EngineBUS enginebus EngineBUS enginebus
  22.   </doc>  
    EngineBUS enginebus EngineBUS enginebus
  23. </add>
复制代码
一个典型的搜索结果报文:
EngineBUS enginebus EngineBUS enginebus
  1. <response>  
    EngineBUS enginebus EngineBUS enginebus
  2.     <lst name="responseHeader">  
    EngineBUS enginebus EngineBUS enginebus
  3.         <int name="status">0</int>  
    EngineBUS enginebus EngineBUS enginebus
  4.         <int name="QTime">6</int>  
    EngineBUS enginebus EngineBUS enginebus
  5.         <lst name="params">  
    EngineBUS enginebus EngineBUS enginebus
  6.             <str name="rows">10</str>  
    EngineBUS enginebus EngineBUS enginebus
  7.             <str name="start">0</str>  
    EngineBUS enginebus EngineBUS enginebus
  8.             <str name="fl">*,score</str>  
    EngineBUS enginebus EngineBUS enginebus
  9.             <str name="hl">true</str>  
    EngineBUS enginebus EngineBUS enginebus
  10.             <str name="q">content:"faceted browsing"</str>  
    EngineBUS enginebus EngineBUS enginebus
  11.         </lst>  
    EngineBUS enginebus EngineBUS enginebus
  12.     </lst>  
    EngineBUS enginebus EngineBUS enginebus
  13.   
    EngineBUS enginebus EngineBUS enginebus
  14.     <result name="response" numFound="1" start="0" maxScore="1.058217">  
    EngineBUS enginebus EngineBUS enginebus
  15.         <doc>  
    EngineBUS enginebus EngineBUS enginebus
  16.             <float name="score">1.058217</float>  
    EngineBUS enginebus EngineBUS enginebus
  17.             <arr name="all">  
    EngineBUS enginebus EngineBUS enginebus
  18.                 <str>http://localhost/myBlog/solr-rocks-again.html</str>  
    EngineBUS enginebus EngineBUS enginebus
  19.                 <str>Solr is Great</str>  
    EngineBUS enginebus EngineBUS enginebus
  20.                 <str>solr,lucene,enterprise,search,greatness</str>  
    EngineBUS enginebus EngineBUS enginebus
  21.                 <str>Solr has some really great features, like faceted browsing and replication</str>  
    EngineBUS enginebus EngineBUS enginebus
  22.             </arr>  
    EngineBUS enginebus EngineBUS enginebus
  23.             <arr name="content">  
    EngineBUS enginebus EngineBUS enginebus
  24.                 <str>Solr has some really great features, like faceted browsing and replication</str>  
    EngineBUS enginebus EngineBUS enginebus
  25.             </arr>  
    EngineBUS enginebus EngineBUS enginebus
  26.             <date name="creationDate">2007-01-07T05:04:00.000Z</date>  
    EngineBUS enginebus EngineBUS enginebus
  27.             <arr name="keywords">  
    EngineBUS enginebus EngineBUS enginebus
  28.                 <str>solr,lucene,enterprise,search,greatness</str>  
    EngineBUS enginebus EngineBUS enginebus
  29.             </arr>  
    EngineBUS enginebus EngineBUS enginebus
  30.             <int name="rating">8</int>  
    EngineBUS enginebus EngineBUS enginebus
  31.             <str name="title">Solr is Great</str>  
    EngineBUS enginebus EngineBUS enginebus
  32.             <str name="url">http://localhost/myBlog/solr-rocks-again.html</str>  
    EngineBUS enginebus EngineBUS enginebus
  33.         </doc>  
    EngineBUS enginebus EngineBUS enginebus
  34.     </result>  
    EngineBUS enginebus EngineBUS enginebus
  35.   
    EngineBUS enginebus EngineBUS enginebus
  36.     <lst name="highlighting">  
    EngineBUS enginebus EngineBUS enginebus
  37.         <lst name="http://localhost/myBlog/solr-rocks-again.html">  
    EngineBUS enginebus EngineBUS enginebus
  38.             <arr name="content">  
    EngineBUS enginebus EngineBUS enginebus
  39.                 <str>Solr has some really great features, like <em>faceted</em>  
    EngineBUS enginebus EngineBUS enginebus
  40.                 <em>browsing</em> and replication</str>  
    EngineBUS enginebus EngineBUS enginebus
  41.             </arr>  
    EngineBUS enginebus EngineBUS enginebus
  42.         </lst>  
    EngineBUS enginebus EngineBUS enginebus
  43.     </lst>  
    EngineBUS enginebus EngineBUS enginebus
  44. </response>
复制代码
关于solr的详细使用说明,请参考
EngineBUS enginebus EngineBUS enginebushttp://wiki.apache.org/solr/FrontPage
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus4、 Solr测试使用
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebusSolr的安装包包含了相关的测试样例,路径在apache-solr-1.2.0/example/exampledocs
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus1. 使用shell脚本(curl)测试Solr的操作:
EngineBUS enginebus EngineBUS enginebuscd apache-solr-1.2.0/example/exampledocs
EngineBUS enginebus EngineBUS enginebusvi post.sh,根据tomcat的ip、port修改URL变量的值URL=http://localhost:8080/solr/update
EngineBUS enginebus EngineBUS enginebus./post.sh *.xml                 #
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus2. 使用Solr的java 包测试Solr的操作:
EngineBUS enginebus EngineBUS enginebus查看帮助:java -jar post.jar –help
EngineBUS enginebus EngineBUS enginebus提交测试数据:
EngineBUS enginebus EngineBUS enginebusjava -Durl=http://localhost:8080/solr/update -Ddata=files -jar post.jar  *.xml     
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus下面以增加索引字段liangchuan、url为例,说明一下Solr中索引命令的使用
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus1) 修改solr的schema,配置需要索引字段的说明:
EngineBUS enginebus EngineBUS enginebusvi /opt/solr-tomcat/solr/conf/schema.xml ,在<fields>中增加如下内容
EngineBUS enginebus EngineBUS enginebus
  1. <field name="liangchuan"  type="string" indexed="true" stored="true"/>  
    EngineBUS enginebus EngineBUS enginebus
  2. <field name="url"  type="string" indexed="true" stored="true"/>  
复制代码

2) 创建增加索引请求的xml测试文件
EngineBUS enginebus EngineBUS enginebustouch /root/apache-solr-1.2.0/example/exampledocs/liangchuan.xml,内容如下:

EngineBUS enginebus
  1. <add>  
    EngineBUS enginebus EngineBUS enginebus
  2.   <doc>  
    EngineBUS enginebus EngineBUS enginebus
  3.     <field name="id">liangchuan000</field>  
    EngineBUS enginebus EngineBUS enginebus
  4.     <field name="name">Solr, the Enterprise Search Server</field>  
    EngineBUS enginebus EngineBUS enginebus
  5.     <field name="manu">Apache Software Foundation</field>  
    EngineBUS enginebus EngineBUS enginebus
  6.     <field name="liangchuan">liangchuan's solr "hello,world" test</field>  
    EngineBUS enginebus EngineBUS enginebus
  7.     <field name="url">http://www.google.com</field>  
    EngineBUS enginebus EngineBUS enginebus
  8.   </doc>  
    EngineBUS enginebus EngineBUS enginebus
  9. </add>  
复制代码

3) 提交索引请求
EngineBUS enginebus EngineBUS enginebuscd apache-solr-1.2.0/example/exampledocs
EngineBUS enginebus EngineBUS enginebus./post.sh liangchuan.xml
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus4) 查询
EngineBUS enginebus EngineBUS enginebus通过solr的管理员界面http://localhost:8080/solr/admin查询
EngineBUS enginebus EngineBUS enginebus或通过curl 测试:
EngineBUS enginebus EngineBUS enginebusexport URL="http://localhost:8080/solr/select/"
EngineBUS enginebus EngineBUS enginebuscurl "$URL?indent=on&q=liangchuan&fl=*,score"
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus5、Solr查询条件参数说明
EngineBUS enginebus EngineBUS enginebus参数     描述     示例
EngineBUS enginebus EngineBUS enginebusq   

EngineBUS enginebus

Solr 中用来搜索的查询。可以通过追加一个分号和已索引且未进行断词的字段的名称来包含排序信息。默认的排序是 score desc,指按记分降序排序。   
EngineBUS enginebus EngineBUS enginebusq=myField:JavaAND otherField:developerWorks; date asc
EngineBUS enginebus EngineBUS enginebus此查询搜索指定的两个字段并根据一个日期字段对结果进行排序。

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

start     将初始偏移量指定到结果集中。可用于对结果进行分页。默认值为 0。     
EngineBUS enginebus EngineBUS enginebusstart=15
EngineBUS enginebus EngineBUS enginebus返回从第 15 个结果开始的结果。

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

rows     返回文档的最大数目。默认值为 10。     rows=25

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

fq     提供一个可选的筛选器查询。查询结果被限制为仅搜索筛选器查询返回的结果。筛选过的查询由 Solr 进行缓存。它们对提高复杂查询的速度非常有用。
EngineBUS enginebus EngineBUS enginebus任何可以用 q 参数传递的有效查询,排序信息除外。

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

hl     当 hl=true 时,在查询响应中醒目显示片段。默认为 false。参看醒目显示参数上的 Solr Wiki 部分可以查看更多选项     hl=true

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

fl     作为逗号分隔的列表指定文档结果中应返回的 Field 集。默认为 “*”,指所有的字段。“score” 指还应返回记分。     
EngineBUS enginebus EngineBUS enginebus*,score
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus其中关于Solr查询相关的参数详细的信息请参看:
EngineBUS enginebus EngineBUS enginebushttp://wiki.apache.org/solr/CommonQueryParameters
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebusSolr的查询条件参数q的格式与Lucene相同,具体参看:
EngineBUS enginebus EngineBUS enginebushttp://lucene.apache.org/java/docs/queryparsersyntax.html

EngineBUS enginebus

支持一下吆 请收藏一下:很好
EngineBUS enginebus EngineBUS enginebus

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

6、 在门户社区中solr使用模式
EngineBUS enginebus EngineBUS enginebus在门户社区中需要使用solr,可采用如下模式:

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

对原有系统已有的数据或需要索引的数据量较大的情况
EngineBUS enginebus EngineBUS enginebus直接采用通过http方式调用solr的接口方式,效率较差,采用solr本身对csv 的支持(http://wiki.apache.org/solr/UpdateCSV
EngineBUS enginebus EngineBUS enginebus),将数据导出为csv格式,然后调用solr的csv接口http://localhost:8080/solr/update/csv
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus对系统新增的数据
EngineBUS enginebus EngineBUS enginebus先将需要索引查询的数据组装成xml格式,然后使用httpclient 将数据提交到solr 的http接口,例如   
EngineBUS enginebus EngineBUS enginebushttp://localhost:8080/solr/update
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus也可以参考post.jar中的SimplePostTool的实现。
EngineBUS enginebus EngineBUS enginebushttp://svn.apache.org/viewvc/luc ... stTool.java?view=co

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

中文分词
EngineBUS enginebus EngineBUS enginebus采用庖丁解牛作为solr(Lucene)缺省的中文分词方案
EngineBUS enginebus EngineBUS enginebus项目库:http://code.google.com/p/paoding/

EngineBUS enginebus

Google groups:http://groups.google.com/group/paoding
EngineBUS enginebus EngineBUS enginebusJavaeye的groups:http://analysis.group.javaeye.com/

EngineBUS enginebus


EngineBUS enginebus EngineBUS enginebus与nutch的集成使用
EngineBUS enginebus EngineBUS enginebushttp://blog.foofactory.fi/2007/0 ... ing-nutch-with.html
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus嵌入式Solr
EngineBUS enginebus EngineBUS enginebushttp://wiki.apache.org/solr/Solrj#EmbeddedSolrServer
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus分布式索引
EngineBUS enginebus EngineBUS enginebushttp://wiki.apache.org/solr/CollectionDistribution

EngineBUS enginebus


EngineBUS enginebus EngineBUS enginebus7、参考资料
EngineBUS enginebus EngineBUS enginebushttp://wiki.apache.org/solr/
EngineBUS enginebus EngineBUS enginebushttp://www.ibm.com/developerworks/cn/java/j-solr1/
EngineBUS enginebus EngineBUS enginebushttp://www.ibm.com/developerworks/cn/java/j-solr2/
EngineBUS enginebus EngineBUS enginebushttp://www.xml.com/pub/a/2006/08 ... andrest.html?page=1
EngineBUS enginebus EngineBUS enginebushttp://lucene.apache.org/java/docs/queryparsersyntax.html
EngineBUS enginebus EngineBUS enginebushttp://www.blogjava.NET/RongHao/archive/2007/11/06/158621.html

EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus

EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus
EngineBUS enginebus EngineBUS enginebus
您需要登录后才可以回帖 登录 | Sign Up

本版积分规则

推荐阅读

QQ| Archiver|手机版|小黑屋| 引擎巴士 EngineBUS  

Powered by Discuz! X3.2© 2001-2013 Comsenz Inc.  

返回顶部 返回列表