Spy Bubble - Mobile Spy Software

Parsing Web Pages


Parsing a Web Page is also known as screen scraping. This task is to identify certain portions of a web page and extract the data out to fit your needs. For example, if you went to a website that showed today’s stock totals. You may want to use just the totals so that you can export it into an Excel spreadsheet.

If you are a programmer or familiar with program languages, you of course cold hack through and create your own algorithm to parse through the html tags. Unfortunately there is no standard way to display information on any given web page. Some webmasters for example will place table or row/column type information into an html tag called table. Others and most of the newer web designers are now displaying this type of information in html tags known as div. But, within the tables or divs, there could be other tags. So if you really want to parse web pages yourself, you should first have a grasp of the html language.

Just go to the local book store and pick up “HTML for Dummies”.

Typical programming languages used today are php and perl although there are a few others. Here is a small sample of a php program to download a webpage the parse it to get some totals.

This little php web page parsing code will return the value for the current week.

html source code:

<table>
	<tr>
		<td class="tableColumnDate" >
		2009-10-21 (current week) </td>
		<td class="tableColumnNumeric">$2139.01</td>
	</tr>
	<tr>
		<td class="tableColumnDate" >
		2009-10-14 </td>
		<td class="tableColumnNumeric">$1345.58</td>
	</tr>
	<tr>
		<td class="tableColumnDate"  >
		2009-10-07 </td>
		<td class="tableColumnNumeric">$3566.32</td>
	</tr>
	<tr>
		<td class="tableColumnDate"  >
		2009-09-30 </td>
		<td class="tableColumnNumeric">$4530.00</td>
	</tr>
	<tr>
		<td class="tableColumnDate"  >
		2009-09-23 </td>
		<td class="tableColumnNumeric">$2296.85</td>
	</tr>
</table>
2009-10-21 (current week) $2139.01
2009-10-14 $1345.58
2009-10-07 $3566.32
2009-09-30 $4530.00
2009-09-23 $2296.85

php code:

$results = file_get_contents(“/tmp/htmlsource.txt”);

preg_match_all(“/class=\”tableColumnNumeric\”.*/”, $results, $value);

$value = array_unique($value);

$t = strip_tags($value[0][0]);

$nval = split(“\\$”, $t);

echo “Status: $nval[1]\n”;

Although this little code will parse the web page output for the information I was looking for. On a very complete web page and depending on how much information you want to scrape off of it. Writing a web page parsing program could take you quite some time.

I recently found a program online that will do small and large web page parsing functions for me with just a few clicks of the mouse.

Click the link to find to watch the Video of how to Parse Web Page Data.
Parsing Page Pages Demo

, ,

  1. #1 by frankg - October 18th, 2009 at 11:32

    I looked at that web parsing program you mentioned and it is a huge time saver. In the past I have used the java based web page parser library from sourceforge. Worked very well for me but you must know or understand java. Here is a quote from their webpage: “HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. It is a fast, robust and well tested package.”

    For your readers to review the HTML Parser program, here’s the link.
    http://htmlparser.sourceforge.net/

  2. #2 by rsmith - October 18th, 2009 at 12:15

    You can do Python Parsing. Parsing web pages using python, which is a newer programming language than php and perl, can give you great functional control of the output you are looking for.

    There has been a number of Python tools out there that can help parsing out data structures from a web page, log file or general text files. Just to name a few of these tools:

    PyGgy
    PyGgy is a python package for generating parsers and lexers in python. The PyGgy distribution contains two tools:

    • PyLly – (Pronounced "pile-ey") A lexer generator that generates DFA tables for lexing tokens.
    • PyGgy – (Pronounced "piggy") A parser generator that generates SLR tables for a GLR parsing engine.

    Pyparsing
    pyparsing is a general parsing module for Python. Grammars are implemented directly in the client code using parsing objects, instead of externally, as with lex/yacc-type tools. Includes simple examples for parsing SQL, CORBA IDL, and 4-function math.

    picoparse
    Picoparse is a very small parser / scanner library for Python. It is built to make constructing parsers straight forward, and without the complications regular expressions bring to the table.

    Parsing web pages with Python does require a pretty good understanding of the Python programming language but if you already know it then getting the results should not be a big hassle.

  3. #3 by Jon - October 25th, 2009 at 23:00

    I work for a company that is trying to scrape several sites to get their pricing competitive with everyone else online. I spent a good month learning the programs out there and using them until I found one that actually worked. I am very new when it comes to web scraping but not only did I find something that worked, it blew me way. These guys have a great vision making the system open enough to have the data harvested and then sent back to the user via REST, FTP, or email. Try mozenda.com for your web scraper I think you will be very happy. I wish I had read this before spending so much time trying the other ones.

    In full disclosure I have to say that I Invested in this company (Mozenda Inc.) after trying their tool. The down side is that they charged a monthly fee.

    Hope this helps,
    Jon

You must be logged in to post a comment.

  1. No trackbacks yet.