Web scraping using jsoup

               Smart Techie

            In this article, we will see how we can scrap the web using JSoup. Before getting into the details, we will see what is web scraping? and what are the use cases to use web scraping?

What is web scraping?

            Web scraping (web harvesting or web data extraction) is a data mining technique of extracting the data from websites Or Converting the unstructured data from web to structured data for analysis is known as “Web Scraping”.

What are the use cases to use web scraping?

The use cases to use “Web Scraping” is more. But, majorly

  • For Research

To visualize the unstructured data from  multiple sources, for analysis.

  • Market Analysis

To watch the services or products provided by competitors.

  • Lead Generation

To gather contact details like email address, phone numbers, website URL etc from justdial.com, yellowpages.com or linkedin.com for businesses or individuals.

  • To Avoid XSS

To inspect the user submitted data for XSS attacks.

Note: Web scraping may be against the terms of use of some websites.

Now, we will see how to set up the open source java HTML parser called “jsoup”. First we should download the latest jsoup jar from http://jsoup.org/download . In this article , I am using  jsoup 1.7.2 version.

To demonstrate jsoup, I have created a java application and kept the jsoup jar file in class path.

Once the project setup is done, connect to the URL using jsoup and get the html content as document.


Document doc = Jsoup.connect("http://www.amazon.com/Samsung-XE303C12-A01US-Chromebook-Wi-Fi-11-6-Inch/dp/B009LL9VDG/ref=sr_1_1?ie=UTF8&qid=1366683807&sr=8-1&keywords=laptop").get();

            Now, look at the view source of the mentioned URL to know the html tags to be extracted. In this case, we are trying to extract product name and the price. From the html source, we came to know that the product name is available under span tag given below.

<h1>
<span id="btAsinTitle">Samsung Chromebook (Wi-Fi, 11.6-Inch)</span>
</h1>

So, from the document, we need to extract span tag by calling select method.

Elements titleElements = doc.select("span[id=btAsinTitle]");

             The above code will return all the matched elements. But, with above CSS selector, we will get only one element. So, from the first element, we can extract the text between the span tags by calling text method.

String title = titleElements.get(0).text();

           The same way, we can extract the price of the product. In the html source, the price is available as shown below.

<span id="actualPriceValue">
<b>$249.00</b>
</span>

        So, we need to extract the price, from the <b> tag. The extraction code snippet is given below.

Elements priceElements = doc.select("b[class=priceLarge]");

From the above CSS selector, we will get only one element. So, from the first element, we  can extract the price.


String price = priceElements.get(0).text();

The source code used in this article is available at https://github.com/2013techsmarts/Web-Harvester/tree/master/Sample_Jsoup_Proj

In the coming article , we will see one more web harvest tool.

Keep reading….

Advertisements

I am Siva Prasad Rao Janapati. Working as a software developer. Has hands on experience on ATG Commerce(DAS/DPS/DCS), Mozu commerce, Broadleaf Commerce, Java, JEE, Spring, Play, JPA, Hibernate, Velocity, JMS, Jboss, Weblogic,Tomcat, Jetty, Apache, Apache Solr, Spring Batch, JQuery, NodeJS, SOAP, REST, MySQL, Oracle, Mongo DB, Memcached, HazelCast, Git, SVN, CVS, Ant, Maven, Gradle, Amazon Web services, Rackspace, Quartz, JMeter, Junit, Open NLP, Facebook Graph,Twitter4J, YouTube Gdata, Bazzarvoice,Yotpo, 4-Tell, Alatest, Shopzilla, Linkshare. I have hands on experience on open sources and commercial technologies.

Tagged with: , ,
Posted in jsoup, Web Scrapping
2 comments on “Web scraping using jsoup
  1. AlexB says:

    Any other web scrapping tools available?

  2. dujggp@gmail.com says:

    Good explanation with example.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

DZone

DZone MVB

Java Code Geeks
Java Code Geeks
%d bloggers like this: