SimpleDevelopment


My Little Development Blog
  • R
   SimpleDevelopment
  • Home
  • Blog
  • Portfolio
  • Contact Me

Type keywords and hit enter


Web Crawl and Search Engine with Nutch (1.5) and Solr (3.6)

Back

NutchSearch EngineSolrWeb CrawlWeb Development
Leave a comment

I have been doing some Google search on this combination of Nutch and Solr for quite a bit some time. There are lots of online posts talking about how they do it in certain ways; but most of them are very complex and I find myself lost during reading. Some of them introduce way too many steps or introduce extra software like Tomcat. However, here I am to show the simple approach to get the job done even though they are old version of Nutch and Solr in Linux.

First thing first, please install Java in Linux; and set up the path properly so that current user can access it. For my simple case, I used root user. So I put ‘JAVA_HOME‘ for Java bin folder path in profile under /etc. Usually for best practice, set it under .bash_profile in your home directory.

Secondly, please install Nutch 1.5 (Should be easy to find in Google; same for Solr 3.6). For my case, I unzip it in /usr/local folder (same with Solr installation).

Once Nutch is in place, let us configure it for your first web crawl.

1. Go to installation folder, go to “conf” folder and edit “nutch-site.xml“. Put the following between <configuration> tag

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>

2. In the same “conf” level, edit “regex-urlfilter.txt“. At the end of file, change “+.” to “+^http://([a-z0-9]*\.)*the-site-to-crawl.com/” (Note: the-site-to-crawl.com is the site you are about to crawl)

3. Back to installation root folder, create a new folder called “urls“; and inside that “urls“, create a new file “seed.txt” with the website domain you want to crawl. In our case, use “http://the-site-to-crawl.com“.

Once everything is done, it is time to use Crawl command. E.g. “bin/nutch crawl urls -dir crawl -depth 3 -topN 5”

Here is also my favorite reference site to check Nutch Tutorial: http://wiki.apache.org/nutch/NutchTutorial

After above Nutch experiment, we can start to work on Solr. For my case, I also put unzipped Solr files in /usr/local area. For the simple approach, we can just use example code from Solr default installation folder. In order to get Nutch working with Solr, we just need to copy “schema.xml” file from {Nutch_Installation_folder}/conf/schema.xml to {Solr_Installation_folder}/example/solr/conf/schema.xml (Note: different Solr versions may have different file structure; and you may also need to backup the default Solr schema.xml.).

After everything is ready, we can launch Solr by going to {Solr_Installation_folder}/example/ folder, and run java -jar start.jar (Note: After running java command to start Solr, you can check http://localhost:8983/solr/ to see Solr Admin page.) to start Solr instance with default example Solr setting. We can run the following command:

“bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5”

With above command, you can see from terminal that Nutch starts to crawl again; and after crawling, indexed data will be stored in Solr collection. And then you can view and query collection in Solr Admin page.

The above steps and instruction are all basic usage of Nutch and Solr. I will try to come up a better example of using Nutch and Solr for getting advanced search engine structure for WordPress site.

Cheers and Happy Web Development.

 




Previous Post - Quick Mailjet Integration and Custom Email Function in CodeIgniter

Next Post - Flotchart Animation using jQuery



Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *



Calendar

April 2018
S M T W T F S
« Jan    
1234567
891011121314
15161718192021
22232425262728
2930  

Blog Categories

  • Discovery (60)
  • Framework / Platform (48)
  • Neat Coding (88)
  • News (1)
  • Resource (18)

Archives History

  • January 2015 (9)
  • May 2014 (1)
  • January 2014 (15)
  • December 2013 (2)
  • May 2013 (4)
  • April 2013 (9)
  • March 2013 (11)
  • February 2013 (9)
  • October 2012 (4)
  • September 2012 (1)
  • August 2012 (7)
  • July 2012 (11)
  • June 2012 (11)
  • May 2012 (7)
  • April 2012 (7)
  • March 2012 (6)
  • October 2011 (1)
  • August 2011 (1)
  • July 2011 (2)
  • May 2011 (2)
  • April 2011 (1)
  • March 2011 (2)
  • February 2011 (1)
  • January 2011 (3)

HostGator Rocks!

  • Home
  • Blog
  • Portfolio
  • Contact Me

© Simple2kx 2018

  • Home
  • Blog
  • Portfolio
  • Contact Me
  • Facebook
  • Twitter
  • LinkedIn
  • Google+
  • RSS
  • Find Me in Radii