To install StudyMoose App tap and then “Add to Home Screen”
Save to my list
Remove from my list
Web mining, a critical aspect of data analysis, has been segmented into three distinct categories: Web Content Mining, Web Structure Mining, and Web Usage Mining. Among these, Web Structure Mining, the focus of this study, investigates the hyperlinks of websites to extract meaningful patterns and relationships. This paper reviews Weka, a prominent tool in web mining, and further explores the tools and techniques prevalent in Web Structure Mining.
Weka (Waikato Environment for Knowledge Analysis) is open source software issued under the GNU General Public License and contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization
Pros:
The development team is in University of Waikato, who continue to work and extend the tool.
Cons:
Web mining is often divided into three categories:
Pick one of the areas above (or find someone new) and answer the following:
Chosen Area: Web Structure Mining: Structure mining analyzes hyperlinks of the website to collect informative data and sort out in categories like similarities and relationship.
Link analysis is a useful method to analyze the importance of a web page – Structure analysis is also called as Link-mining.
Most of the tools that use the algorithm (most common are Page Rank Algorithm, HITS Algorithm) are:
Which tools are easiest to learn? Do they come with sample data / tutorial?
Answer:
Some of the algorithms used in Web Structure Mining are as follows:
I will be explaining about Page rank Algorithm, which is used by Google Search Engine. Google PR Checker uses the same algorithm and gives a score based on the significance of a page.
Page rank approach leads to number of pages linking to a specific web page that calculates the importance of that page. These calculated links are known as backlinks. Example: Link from page A to page D is considered as a vote. If backlink D is produced from key page or an important page (suppose A), then this link will have a higher vote than those links that are coming from non-important pages.
Example: If you are accessing a personal blog from your Facebook profile, then the value of your blog is guaranteed to rise as the link to your website is given by one of the most popular websites.
Another example of the simplest calculations can be non-linked pages. A, B, C are 3 pages which are not linked at all. Then,
PR(A) = PR(B) = PR(C) = (1 – d)
Where 1 - d is the minimal PageRank value.
Then all pages have the same PageRank. The solution is independent from the number of web pages.
While, calculations are done if the pages are connected.
How is the calculation done?
First thing that should be computed is the number of links pointing to every Web page. It is based on the idea of a ’random surfer’, which is any random evolutionary process that depends only of the current state of a system and not on its history and the web is seen as a Markov Chain, where any user has the possibility to click on any link.
It is based on the formula:
PR (A) = (1-d) + d (PR (T1)/C (T1) + ... + PR (Tn)/C(Tn)) (1)
Where,
Tl - Tn are pages linking to page A,
C is the number of outbound links that a page has
d = damping factor, which is assumed to be 0.85
The PR of each page depends on the Page Rank of the pages pointing to it. But we won't know what PR those pages is until the pages pointing to them have a PR. According to Google, we can calculate a page's PR without knowing the final value of the PR of the other pages. Since calculating PR is iterative, we can run the calculation each time and get a precise estimation and the numbers start appearing similar.
What is interesting to know is once the PageRank calculations are done, the average PageRank for all pages will always be 1.
Weka offers a solid foundation for various web mining tasks, despite its limitations in handling large volumes of data and clustering. Web Structure Mining, powered by tools like Majestic and algorithms like the Page Rank Algorithm, plays a crucial role in understanding the web's intricate link structure. This exploration not only sheds light on the capabilities and limitations of Weka but also underscores the significance of Web Structure Mining in today's data-driven landscape.
Exploring Web Mining Tools: A Review of Weka and Analysis of Web Structure Mining. (2024, Feb 23). Retrieved from https://studymoose.com/document/exploring-web-mining-tools-a-review-of-weka-and-analysis-of-web-structure-mining
👋 Hi! I’m your smart assistant Amy!
Don’t know where to start? Type your requirements and I’ll connect you to an academic expert within 3 minutes.
get help with your assignment