I have been doing some Google search on this combination of Nutch and Solr for quite a bit some time. There are lots of online posts talking about how they do it in certain ways; but most of them are very complex and I find myself lost during reading. Some of them introduce way too many steps or introduce extra software like Tomcat. However, here I am to show the simple approach to get the job done even though they are old version of Nutch and Solr in Linux.
First thing first, please install Java in Linux; and set up the path properly so that current user can access it. For my simple case, I used root user. So I put ‘JAVA_HOME‘ for Java bin folder path in profile under /etc. Usually for best practice, set it under .bash_profile in your home directory.
Secondly, please install Nutch 1.5 (Should be easy to find in Google; same for Solr 3.6). For my case, I unzip it in /usr/local folder (same with Solr installation).
Once Nutch is in place, let us configure it for your first web crawl.
1. Go to installation folder, go to “conf” folder and edit “nutch-site.xml“. Put the following between <configuration> tag
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property> |
2. In the same “conf” level, edit “regex-urlfilter.txt“. At the end of file, change “+.” to “+^http://([a-z0-9]*\.)*the-site-to-crawl.com/” (Note: the-site-to-crawl.com is the site you are about to crawl)
3. Back to installation root folder, create a new folder called “urls“; and inside that “urls“, create a new file “seed.txt” with the website domain you want to crawl. In our case, use “http://the-site-to-crawl.com“.
Once everything is done, it is time to use Crawl command. E.g. “bin/nutch crawl urls -dir crawl -depth 3 -topN 5”
Here is also my favorite reference site to check Nutch Tutorial: http://wiki.apache.org/nutch/NutchTutorial
After above Nutch experiment, we can start to work on Solr. For my case, I also put unzipped Solr files in /usr/local area. For the simple approach, we can just use example code from Solr default installation folder. In order to get Nutch working with Solr, we just need to copy “schema.xml” file from {Nutch_Installation_folder}/conf/schema.xml to {Solr_Installation_folder}/example/solr/conf/schema.xml (Note: different Solr versions may have different file structure; and you may also need to backup the default Solr schema.xml.).
After everything is ready, we can launch Solr by going to {Solr_Installation_folder}/example/ folder, and run java -jar start.jar (Note: After running java command to start Solr, you can check http://localhost:8983/solr/ to see Solr Admin page.) to start Solr instance with default example Solr setting. We can run the following command:
“bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5”
With above command, you can see from terminal that Nutch starts to crawl again; and after crawling, indexed data will be stored in Solr collection. And then you can view and query collection in Solr Admin page.
The above steps and instruction are all basic usage of Nutch and Solr. I will try to come up a better example of using Nutch and Solr for getting advanced search engine structure for WordPress site.
Cheers and Happy Web Development.