Exploring Web Mining Tools: A Review of Weka and Analysis of Web Structure Mining

Categories: Technology

Introduction

Web mining, a critical aspect of data analysis, has been segmented into three distinct categories: Web Content Mining, Web Structure Mining, and Web Usage Mining. Among these, Web Structure Mining, the focus of this study, investigates the hyperlinks of websites to extract meaningful patterns and relationships. This paper reviews Weka, a prominent tool in web mining, and further explores the tools and techniques prevalent in Web Structure Mining.

Weka: An Open-Source Tool for Web Mining

Weka (Waikato Environment for Knowledge Analysis) is open source software issued under the GNU General Public License and contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization

Pros:

  • WEKA is known for offering many classification techniques, including artificial neural networks, decision trees, ID3, and C4,5 algorithms.
  • Multi-purpose tool. It can be used for web based mining as well as for research and educational purposes.
  • Based on java which can be learned easily if someone is from a non-technical background.
  • Continuously researched and developed.

    Get quality help now
    KarrieWrites
    KarrieWrites
    checked Verified writer

    Proficient in: Technology

    star star star star 5 (339)

    “ KarrieWrites did such a phenomenal job on this assignment! He completed it prior to its deadline and was thorough and informative. ”

    avatar avatar avatar
    +84 relevant experts are online
    Hire writer

    The development team is in University of Waikato, who continue to work and extend the tool.

  • Provides feature for user visualizations such as graph.

Cons:

  • Less powerful when it comes to other techniques such as Clustering.
  • Slow when loading a large amount of data. This is because the data mining tool tries to load all of it into the memory. To overcome this, WEKA offers a simple command line (CLI) that makes it easier to handle large amounts of data.

Web mining is often divided into three categories:

  • Web Content Mining
  • Web Structure Mining
  • Web Usage Mining

Pick one of the areas above (or find someone new) and answer the following:

Chosen Area: Web Structure Mining: Structure mining analyzes hyperlinks of the website to collect informative data and sort out in categories like similarities and relationship.

Get to Know The Price Estimate For Your Paper
Topic
Number of pages
Email Invalid email

By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email

"You must agree to out terms of services and privacy policy"
Write my paper

You won’t be charged yet!

Link analysis is a useful method to analyze the importance of a web page – Structure analysis is also called as Link-mining.

Most of the tools that use the algorithm (most common are Page Rank Algorithm, HITS Algorithm) are:

  • Majestic: Can get reliable and latest and even historic data so that you can analyze the performance of your websites. Provides the user with site’s ranking in terms of backlinks. https://majestic.com/
  • Google PR Checker: Google PageRank (PR) is a measure of a webpage from 0 – 10 and is based on backlinks. https://checkpagerank.net/index.php
  • Bixo: web mining open source tool that runs a series of Cascading pipes on top of Hadoop. https://github.com/bixo/bixo
  • Link Viewer: There was only one download link and it did not work.

Which tools are easiest to learn? Do they come with sample data / tutorial?

Answer:

  • Majestic: This tool is paid and shows no more than a summary unless you upgrade to a subscription. Although you can find good documentation and some video tutorials available on their own official website, but it uses its own trademarked metrics “trust” and “citation” flow.
  • Google PR Checker: it is an online tool which determines the ranking of your page. It uses the Page rank algorithm, which is fairly easy to learn and many videos tutorials are available. The terms used by this tool are standard as compared to Majestic.
  • Bixo: There is no documentation to learn this tool.

Some of the algorithms used in Web Structure Mining are as follows:

  • Page rank Algorithm
  • HITS Algorithm
  • Weighted Page rank Algorithm

I will be explaining about Page rank Algorithm, which is used by Google Search Engine. Google PR Checker uses the same algorithm and gives a score based on the significance of a page.

Page Rank (PR) Algorithm

Page rank approach leads to number of pages linking to a specific web page that calculates the importance of that page. These calculated links are known as backlinks. Example: Link from page A to page D is considered as a vote. If backlink D is produced from key page or an important page (suppose A), then this link will have a higher vote than those links that are coming from non-important pages.

Example: If you are accessing a personal blog from your Facebook profile, then the value of your blog is guaranteed to rise as the link to your website is given by one of the most popular websites.

Another example of the simplest calculations can be non-linked pages. A, B, C are 3 pages which are not linked at all. Then,

PR(A) = PR(B) = PR(C) = (1 – d)

Where 1 - d is the minimal PageRank value.

Then all pages have the same PageRank. The solution is independent from the number of web pages.

While, calculations are done if the pages are connected.

How is the calculation done?

First thing that should be computed is the number of links pointing to every Web page. It is based on the idea of a ’random surfer’, which is any random evolutionary process that depends only of the current state of a system and not on its history and the web is seen as a Markov Chain, where any user has the possibility to click on any link.

It is based on the formula:

PR (A) = (1-d) + d (PR (T1)/C (T1) + ... + PR (Tn)/C(Tn)) (1)

Where,

Tl - Tn are pages linking to page A,

C is the number of outbound links that a page has

d = damping factor, which is assumed to be 0.85

The PR of each page depends on the Page Rank of the pages pointing to it. But we won't know what PR those pages is until the pages pointing to them have a PR. According to Google, we can calculate a page's PR without knowing the final value of the PR of the other pages. Since calculating PR is iterative, we can run the calculation each time and get a precise estimation and the numbers start appearing similar.

What is interesting to know is once the PageRank calculations are done, the average PageRank for all pages will always be 1.

Conclusion

Weka offers a solid foundation for various web mining tasks, despite its limitations in handling large volumes of data and clustering. Web Structure Mining, powered by tools like Majestic and algorithms like the Page Rank Algorithm, plays a crucial role in understanding the web's intricate link structure. This exploration not only sheds light on the capabilities and limitations of Weka but also underscores the significance of Web Structure Mining in today's data-driven landscape.

Updated: Feb 23, 2024
Cite this page

Exploring Web Mining Tools: A Review of Weka and Analysis of Web Structure Mining. (2024, Feb 23). Retrieved from https://studymoose.com/document/exploring-web-mining-tools-a-review-of-weka-and-analysis-of-web-structure-mining

Live chat  with support 24/7

👋 Hi! I’m your smart assistant Amy!

Don’t know where to start? Type your requirements and I’ll connect you to an academic expert within 3 minutes.

get help with your assignment