Web Is Structured Like A Graph Information Technology Essay

The web is structured like a graph with one million millions of pages liked with each other. We the users, track the web utilizing this hyperlink construction and entree specific information we need.But, how many new pages we visit each twenty-four hours, the reply will be merely a few depending on our demand. When seeking for a specific piece of information, we take aid from hunt engines and simply see the pages which are ranked less than 20 by hunt engines. The thought behind search engine procedure is the Web sycophant, besides known as web automatons or spiders traverse the web pages and hunt for specific information. This information can subsequently be downloaded or analyzed or ranked harmonizing to the demand of people and machine. There are chiefly three types of sycophants – universal, focused and topical sycophant. Cosmopolitan sycophants are chiefly used by the hunt engines to recover all information from web irrespective of their contents. Focused and topical sycophants are named discriminatory sycophants though they have some difference in seeking procedure. Topical sycophants are really specific to a given subject while they do non necessitate to develop the sycophants with illustration URL ‘s like focussed sycophants. Alternatively we feed a list of URL ‘s to topical sycophants. The necessity of such sycophant is that, we can easy recover relevant pages from a given set of texts or sentences that we need. Topical sycophants have become the basic of many specialised services like -financial portals, intelligent tools, scientific paper seeking among digital libraries etc.

Hire a custom writer who has experience.
It's time for you to submit amazing papers!


order now

Purpose & A ; Methodology

The intent of this undertaking is to analyse how a topical sycophant plants in existent web construction. In the past old ages, there has been much research done on discriminatory sycophant algorithms. We besides wanted to research howsuch creep and similarity step algorithms works when implemented, i.e. clip, latency, resource demand for sycophants every bit good as their consequences. There are many creeping algorirhms availablethat works better in topical creeping process- for e.g. Best-first, Breadth-first, InfoSpideretc.that we will analyse for implementing topical sycophants. Once we know how to track, we need to stand for the paperss in a standard manner and so to happen similaritiesof question papers to the searched docuemnts. Here we need algorithms like Vector Space theoretical account or Naive Bayeisanto represent the paperss in a standard manner and an algorithm to happen similarity beween paperss e.g. Cosine Similarity algorithm. Naive Bayesian algorithm besides really popular to happen similaries but we did non implement it beausce of its probabilisits nature.We have analyzed several scientific documents in this country, e.g. [ 1 ] and a polular book in Web excavation [ 2 ] . The concluding undertaking is to happen a topic country to feed the cralwer and it can be anything that interets users. For our prototye, we considered a research subject in parallel computer science and so tried to happen such subjects in all Swedish universities.

Structure of the study

The papers consists of a figure of subdivisions that cover the whole procedure of web sycophant. The first subdivision provides a reappraisal of the subject background. Second subdivision provides a item description of web creeping algorithm that we use in our undertaking. Third subdivision of the study discusses about two algorithms- vector infinite theoretical account and cosine similarity that are needed for stand foring a papers and happen similarity of that papers with the question papers. The 4th subdivision discusses the execution issues that have to be dealt with when implementing a topical sycophant. Finally, we have some treatments on our experience every bit good as future plants.

Execution Platform

In implementing such a sycophant, we decided to be independent of runing systems. JavaTM is the best pick for execution for this and there are legion unfastened beginning API ‘s available ( in instance we need ) for Java. Sing development environments, we used EclipseTM Java IDE and AdobeA® Flex Buildera„? for showing the information in a browser.

TOPICAL WEB CRAWLER

Algorithm

Crawler algorithms exploit the web ‘s hyperlinked construction in cagey ways and recover surpassing links from the latest traverse links. The construct of topical sycophant was foremost introduced by Menczer [ 1 ] . Topical sycophants complement the hunt engines and they are the chief consumers of internet bandwidths. The demand for a cralwer algorithm is to direct the sycophant of how to track the links.As we mentioned in hypothesis sycophant algorithsms- Breadth-first ( Pinkerton, 1998 ) , Shark-Search ( Hersovici et al. 1998 ) , InfoSpiders ( Menczer, 1998 ) and Best foremost. We have selected Best-first sycophant algorithmwhich was foremost proposed by Cho et Al. ( 1998 ) [ 1 ] .

How it works

The basic thought behind Best foremost algorithm is that, the best relevant nexus is selected from the frontier by the algorithm. A frontier is a precedence waiting line that shops all the unvisited links or URL ‘s retrieved from a visited nexus. At any clip, this precedence waiting line is sorted harmonizing to their relevancy degree that is in a descending order.Here in our instance, all the child links are given the same cosine value as their parent nexus and stored in the Frontier. The sycophant algorithm retrieves the best nexus from the frontier ( based on cosine value ) and so we calculate the existent cosine value of that nexus and happen the similarity with regard to question papers. Therefore, in best first algorithm, we go deeper of a nexus if we find its cosine similarity is higher. Once a nexus is visited from the frontier, it is deleted. And like this, at the terminal all the links stored in the frontier are traversed, calculated and compared in footings of question papers. We write all these information of a visited nexus in a file for later representation to the users. One thing to reference, this precedence queue ne’er let any extra URL and its capacity is limitless.

As we mentioned earlier, one large advantage of topical sycophant is that, every hunt by sycophant is fresh hits. Unlike traditional hunt engines, no stale consequences are returned by the topical sycophants. The ground is that all pages are visited during query clip. This makes topical sycophants suited for seeking late posted web logs or updated information which we may non happen by traditional hunt engines. An illustration can be given for existent clip hunt is Collecta[ 1 ]which searches like real-time. Finally one thing to advert, all the hunt engines like- Google[ 2 ]and Bing[ 3 ], run many topical sycophants behind seeking procedure on a schedule footing on different subjects countries and update their ain ranking pages for functioning faster consequences to users.

OTHER ALGORITHMS

Now one time we direct the sycophant to creep the web utilizing an algorithm, we need represent the returned paperss in a proper manner to recover information. What we mean to stand for is that, there are many irrelevant information on the paperss which we do non necessitate to happen similarity with the question papers. The most of import thing is to show all the text in a standard format to compare. For case like a word vector format or a matrix format etc.

Vector Space Model

Vector infinite theoretical account is a good known theoretical account and good used in the country of information retrieval, IR. The thought is to stand for paperss as a vector of words where each word has a weight. The weight can be the term frequence and so it is so called Term frequence, TF, but there are other ways to cipher the weight such as TF-IDF. Here we will merely discourse the Term frequence strategy for burdening since this was the method we used in our undertaking.

The basic thought is to happen all alone footings in a papers and cipher the figure occurrences each term has and do a vector. This vector can be represented like this:

Windex = ( term, freq )

Document = { W1, W2, W3, aˆ¦.. Wn } N is the figure of alone footings

In our undertaking we used this method in the question papers every bit good as the paperss that we wanted to compare it with.

Similarity Test

One manner to cipher the grade of similarity of a papers to a question is to utilize the cosine similarity trial. Cosine Similarity is a good known method for mensurating similarity. Cosine similarity is the cosine angle between the question vector and the papers vector. image_thumb_6

The expression used is shown below:

The consequence of the cosine similarity has the scope of 0 to 1, where 0 is no similarity and 1 is really high similarity. If the angle between the two vectors is 0 the cosine value will be 1, which means the question and the papers are indistinguishable.

IMPLEMENTATION ISSUES

Bringing

This is the first procedure of the execution issues. The sycophant fetches pages like web client i.e. that sends HTTP petition to the waiter, which in bend responds with a 200 All right messages intending that the needed pages are accepted and so reads the responded page.In instance if the waiter does non response or the connexion failed, the sycophant will seek to bring following page, in status after the sycophant has tried to bring the page three times.

When each page has been fetched, a important information construction is used, in our instance, hash-table which in bend usage hash map. The usage of hash map is to map each nexus to a different index ( value ) to avoid extra visited pages.

Parsing and Tokenizing

This is the following procedure after the page has been downloaded. The sycophant continue by parsing the content of the page which means it extracts all the relevant information such as surpassing URLs, text etc. and so take all the sort of the HTML semantics ( & lt ; tags & gt ; ) .

In the following measure, the sycophant identify/tokenize the important footings as a twine by filtrating the different delimiters such as % & A ; /* ” # . At the same clip, it removes all the numeral values, which it normalizes the papers for farther processing.

Stopword Removal and Steming

In this procedure, we filter out all the notice words ( in our instance, English list [ 4 ] ) of the text i.e. the words that do non transport important footings and might overweight the contents similarities such as “ the ” , “ at ” , “ of ” , “ on ” etc.

As an algorithm in this portion, we use Porter Steming Algorithm that is written in Java [ 3 ] .

The two thoughts used behind stemming:

The first thought is to convey all the different types of the each word in different signifier into its basic signifier, for case, “ authorship, wrote, written ” are converted to “ compose ” .

The 2nd thought is to unify the words that have the similar significances i.e. the equivalent word of the each word.

URL Extraction-Canonicalization

This portion is used foremost to sort if the URL-path is comparative or absolute and so to convertrelative- to absolute way to acquire whole the nexus while bringing the pages. This procedure has to be done before links are added to the frontier.

Here are three illustrations on how the comparative and absolute are classifies:

Relative classified:

URL that start with / or../ or? etc. : “ /minasidor/student.html ”

Absolute classified:

URLs start with HTTP or WWW and incorporate a sphere name such as.com, .org, or.se: “ www.kth.se/ ”

Relative converted to absolute:

“ www.kth.se/minasidor/student.html ”

Page Depository

This is the last measure of execution issues. The undertaking of this measure is to hive away each page that was fetched and ranked, in a text file.

Here is the information that specifically been stored:

The URL ( the page we were looking for )

What seed URL it came from?

Who referred to it ( Parent URL )

How deep in the nexus tree we found it.

Which relevancy value did the pageget?

The sycophant classifies a papers as positive or negative depending on the threshold value for similarity, which means values less than threshold are negative and frailty versa.

Discussion

During the execution of the topical sycophant we encountered several countries where we considered that we could n’t implement due to clip restraints of the undertaking. Here below will discourse some of these countries:

Spider traps

One of import thing to be cognizant of is spider traps. These are web sites where URLs are dynamically generated depending on some actions the user makes while shoping the site. These sites may bring forth infinite figure of URLs for a sycophant and this in bend will do that the sycophant would see and bring these pages for of all time. One solution that is proposed in the literature is to restrict the figure URLs that can be fetched from one sphere. Another solution may be to restrict the figure of characters in a Uniform resource locator to state 256.

Concurrence

One manner to increase the public presentation of a web sycophants is to manage concurrence, which is fundamentally a manner to do several threads/processes do work independently. The lone thing the these togss or procedures portion is the frontier where they each choice and set back URLs to see. One of import thing that must be handled is doing certain that merely one thread/process can salvage new URLs at a clip. Besides if the frontier is empty it does n’t intend that the creep is over since some of the togss or procedures may still be working with bringing new URLs. The plan should stop by holding a chief yarn that polls all the togss for workers that are kiping and if all togss are kiping so the plan ends.

Another ethical issue is to avoid holding several threads/processes in one and same web waiter by doing some sort of waiting line for threads/processes to entree a waiter.

Scalability

Another of import country for web sycophants is robustness, efficient memory direction and efficient bandwidth use.

Decision

In our undertaking we developed a simple consecutive topical web sycophant where we tried to work out many of the execution issues discussed earlier. The concluding consequence was two applications, which will be described below:

Web Crawler ( Java )

The Java application is a consecutive topical sycophant which is console based. It takes as input the undermentioned Fieldss:

Similarity value ( The threshold that says when a papers is similar, we tested 0.2 )

Link depth value ( How deep the sycophant will travel, we tested with 5 and 10 )

Number of outgoings links fetched for each papers ( by default we get all )

A seed file with get downing URLs to see

A question papers as a txt file or a URL that can be parsed

Undertaking can be imported in occultation and run with the occultation tool. The chief Java file is WebCrawler.java. The consequences are saved in existent clip in two text files positive.txt for similar paperss and negative.txt for paperss which are non similar. We have set a threshold bound of 0.2 ( the pick was made by us ) and all the paperss & gt ; = 0.2 semen in the positive.txt file and so on. The consequence looks like this in a consecutive mode –

( 1 ) hypertext transfer protocol: //www.ida.liu.se/edu/index.en.shtml ; ( 2 ) hypertext transfer protocol: //www.ida.liu.se/research/index.en.shtml ; ( 3 ) hypertext transfer protocol: //www.ida.liu.se/~chrke/ ; 2 ; 0.39905495199970265 ;

The Seed URL

The parent URL

The URL

The degree the papers is found get downing from the seed

The similarity value it got

Result Presentation Tool ( Adobe Flex )

The flex ( Adobe Flash ) application acts as a presentation and analytical tool for the consequences that we got from old measure ( web creeping ) . It takes as input the undermentioned things:

Posstive.txt

Negative.txt

And so show the consequences in a graphical format shown in a browser. We made three sorts a dynamic confabs to sum up the statistics so that users keeping mouse on an country of graph can see several values. Since this undertaking is a flex web application to open its you need an application waiter like Apache or tomcat. And to develop it farther your demand flex development environment, which can be downloaded from www.adobe.com. In this undertaking we packaged the presentation tool inside a tomcat application waiter. The tomcat application waiter can be run by traveling to the bin directory and snaping on startup.bat. The URL to entree the consequences is:

hypertext transfer protocol: //localhost:8080/TWC/

WEBREFERENCES

[ 1 ] . Menczer, F. ( 1997 ) . ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery. In D. Fisher, ed. , Proceedings of the 14th International Conference on Machine Learning ( ICML97 ) . Morgan Kaufmann.

[ 2 ] . Liu, Bing. Web Data Mining. Springer, 2006

[ 3 ] . Porter, Martin. Jan 2006. “ The Porter Stemming Algorithm. “ Java linguistic communication codification. hypertext transfer protocol: //tartarus.org/~martin/PorterStemmer/ . Retrieved on 20th Feb, 2010

[ 4 ] . Rinks.nl. “ English Stopwords. ” www.ranks.nl/resources/stopwords.html. Retrieved on 20th Feb, 2010 )

Glossary

HTML – Hyper Text Markup Language

HTTP – Hyper Text Transfer Protocol

PSA – Potter Steming Algorithm

URL – Uniform Resource Locator

VSM – Vector Space Model

WWW – World Wide Web

Appendixs

Personal Reflection ( Ziaul Kabir )

The whole undertaking can be divided into three parts – thought coevals and demand analysis, execution, and certification. We as a group of three worked as a squad from the beginning and worked all together in a common topographic point. In the brainstorming and demand analysis measure, we read book, extra resources, and lectures to so eventually come up what we want implement. All there members contributed same sum. My contemplation was on the algorithm choice for sycophant, and a topic country to look into. I came up with best-first sycophant and a subject in parallel calculating to seek among Swedish universities. In the execution stage, we code the undertaking all together. But some issues that we found complexnesss like, memory optimisation were handled by Ahmed at place subsequently after our work.

Here I was seeking to happen how a best first algorithm, parser and stripper work in existent life every bit good as how to decide a comparative URL to an absolute URL. I read book, took aid from one of my friend ( making PhD in USA in NLP ) and so summarized how we can implement these parts. In the presentation and study portion, we specifically divided the work into three parts. My portion in the presentation was to compose and show on Crawler algorithm, subject and URL query provender and consequences & A ; treatment countries. In the concluding study I contributed the same subject countries I did in the presentation.

Personal Reflection ( Ahmed Bile Hazi )

Personal contemplation ( Sinan Kanadan )

Our undertaking has been divided into three parts ; the pick of subject and the algorithm, execution and analysis & A ; consequences.

The resources that I have had information to run the undertaking ; class book, the Internet, lectures and treatment with the undertaking group.

All in the group have been worked with all parts, but each one has had a specific country to concentrate on. The portion that I have been most focused on, in our undertaking, is implementation subdivision, where I took attention of downloading web pages, parsing and tokenizing, stop-words and stemming, the URL extraction ( change overing of the comparative URL to absolute URL ) and the last measure to salvage the downloaded and graded pages. In short, the parts that I wrote in the study and present them in the seminar.

This class was wholly new to me and even though I jumped into the class two hebdomads after the class start, but I still learn a batch about how search engines work including the methods and algorithms to utilize, how the web sycophant procedure plants, etc.

It ‘s been a pleasance to work with my undertaking group. I have been proud that I have done a undertaking with my group. Our group can be described as a friendly, helpful, skilled, able to manage different effects in all state of affairss and do the occupation done in a short and smooth manner. I look frontward to work with the group in a big undertaking in the hereafter.

I would wish to thank our instructors for the interesting talks and merriment labs that helped us to hold an thought of the of import parts of both the undertaking and the class. A large thanks to our instructor Martin Hassel who has helped us to recognize our thought into a successful undertaking.