Although this may seem straightforward, there are actually numerous different ways that search engines employ to generate the list of links to websites that match a users query. Yahoo! Is one of the pioneers of Online Search, having started during 1994 when search engines do not exist yet. It was originally intended for personal use of its creators, David Filo and Jerry Yang, but later developed into a service opened to public when they found that people enthusiastic about using it.
Basically, Yahoo employs a list of websites frequented by users classified in categories and subcategories in a database similar to the listings in Yellow pages. During later years, several other search engines are born. Notable among them was Google, which developed an entirely new way of searching by analyzing the links pointing to a given website. It also uses an automated way of indexing its websites listing. Termed as spiders, web crawlers or meta crawlers search for websites and create a snapshot of them stored in a database as basis for the search results.
Taking the examples of pioneers like Google and Yahoo, hundreds, if not thousands, new search engines appeared, and almost all websites have a search bar and button that would query the entire website when used. Recently, Microsoft has also joined the competition in the arena of search engines, joining the big two: Google and Yahoo. As amount of information and data in the internet grows, it becomes more and more difficult to manage and search. Nowadays, a lot of search engines exists in the world wide web each employing their own unique techniques.
It had been common practice for websites to include a search functionality for their content, so as to improve navigation. There are those who allows users to search content across the internet, while some are limited to their own domains. These different techniques of searching is developed constantly that many different ways of generating search results had been developed. The primary purpose of this is to improve the relevancy of the results generated, since growing databases are prone to include websites that are not really asked for. Improving the relevancy means increasing the usability of the tool.
Thus online searching was extended from simple lists of websites to complex databases coupled with a search algorithm that queries data from the database. Search engines today, such as Google, does not depend on a directory listing anymore, as it has a robot program, called Googlebot, that crawls the web and files a snapshot of the websites it passes in a database for later viewing. II. Recent Developments Optimizing search engines is important in providing an efficient search tool for users across the internet, so a lot of ways to improve it had been employed by different search engine companies.
Since data being searched is stored in databases, accessing these data efficiently is a key factor. This goes down up to the disk level because accessing data on disk takes time and that would mean reducing the disk accesses would cut down the time it takes for queries to be answered. This can be done by employing a number of redundant disks instead of just one, so that multiple queries can be answered at the same time without waiting for the first one to be finished.
Also, the design of the database that would contain the listing itself is important, as differences in the use of data types as columns of database can have large effects to the response time, especially when dealing with large amount of queries being processed. Also, instead of having a monolithic server that accepts the queries, search engines can employ distributed servers that are light-weight and adaptive, since search engines are dealing with data that are constantly changing and dynamic.
There are also a number of algorithms used to improve the search engine results relevancy, or the correctness of the list returned. One technique is called Controlled Vocabulary, or what is also known as Subject Searching. This uses a set of controlled vocabulary that are standard terms pertaining to the contents of articles. These words are a list of terms that has equivalents, and can be considered similar to a thesaurus. Whenever a term is entered, the search engine would look up the terms that are of the same meaning to the entered term, if it exists in the list of controlled vocabulary.
The matches with the synonyms found would be included, as these are also deemed to be relevant, since they have the same meaning to the term queried by the users. This technique is not commonly used since its more complex and users required less. There there is also the more common method called keyword searching. This may not employ the use of controlled vocabulary, but instead uses substring matches of the words that a user enters. This method would allow the users to look for the appearance of words or phrases in an article or content.
This is one of the most commong way of searching databases, as it is fast and efficient, and most users simply want to look for the existence of the terms they had entered. When users want to use not only a single query for a search but multiple, a technique called Boolean searching is used. What this technique basically does is combine the search terms entered by the user and return a results set generated by the combined filters of the entered words. This technique is generally effective in reducing the amount of accesses to the database and makes searhing much more efficient and convenient.
Relevance searching is a much more complex technique for searching databases. It employs symbols that users can attach to their search terms to improve the results set. Aside from those mentioned, some engines employ a technique which is usually termed to as Proximity searching. This method takes into account the distance of the terms that you enter, especially useful when a term appears on the same page but are so far with each other that it wasnt supposed to be relevant and yet it ranks high. All databases are composed of different columns containing a specific type of data.
Some websites allows these fields to be searched directly to provide more specific results. This way, the users would know what the results are actually containing, and not results that came out of probability of page ranking. Some search engines also allows the use of wildcards, or placeholder characters to stand for some value, and are mixed with the query terms to produce more specific and relevant query results. Coupled with wildcards is the truncation character that would join the characters entered into a string and would be search as a term.
When truncation searching and the use of wildcards is used at the same time, it could prove to be a very powerful searching technique. Although these techniques are quite numerous, not all search engines online support them. Also there are other search engines who accepts submission of sponsored links. This is generally speaking, paid advertisement, since it increases the ranking of a certain website upon payment of the advertiser. In a way, it could be said that the search engines are selling their page rank, but this is entirely internal to the search engine company.
The search engine companies do not really release their search algorithms to the public, as it could be used as an exploit. III. Problems Since the use of search engines today had become very popular that Search engines like Google are having millions of search queries everyday, a lot of companies became interested in its methods, especially those that are profit-generated. The page rank of a website with a given query can be manipulated through search engine optimization. This is primarily because of the design of the algorithms of search engines.
In google, it ranks a website according to the number of links pointing to it with a given word. When that word is entered, the websites are ranked highest when lots of links are pointing to it whose text is the one entered in query. This fault makes search engines prone to manipulation, and thus their results would be less relevant for some keywords. There are companies that employ people just to generate those links pointing to a website that they are promoting, so when a certain keyword is entered, they would appear to have a high page rank than the others.
This process, called Search Engine Optimization, is known to generate a great quantity of spam in the internet, be it blogs, links, or articles. Companies big and small are investing on optimizing their page rank and this way the internet gets clogged with useless sites existing only as unneeded baggage. Aside from these websites spamming the internet with links that aim to improve results ranking, there is also the risk of searching websites that infects a user with maliciois software and viruses. Since the search engines does not differentiate between websites that are good and websites that are not based only on available sources.
Since the search engines crawl every link that it goes to, there are some cases that the website that should not have been viewable by the public can now be accessed as a search result from search engines. IV. Possible Solutions A possible solution for the spamming of search engines or the deception employed by websites is to alter their search algorithms to specifically filter out websites of these type. The bots or crawlers could also be improved to have a kind of artificial intelligence that wont allow it to index dubious websites.
As for the viruses that could have been included in the legitimate search results of engines, it would be possible to filter them out if the search engine companies would insall a reliable anti-virus sofware or spyware before sending the website to the users browser. They could also create a list of blacklisted domains that would never be included in the search results even if they should rank on the first page reslts. Also, when the search engines had detected these websites, they may report their owners and have them closed when they are proven to have violated the law by sending spreading for search engine optimization purposes only.
Although it is entirely difficult to have these spam websites get shut down, sa the internet is an unrefulated body, it can be reduced when a search engines such as google makes their search algorith a secret one, or keep changing and modying it, so that the people who want to take advantage and exploit the system would have a hard doing it. Search engine optimization is fast becoming a threat to the health of the internet, as it clogs the search engine results with useless data that are primarily profit-drive, aimed only to draw click to a certain website. V. Future Directions
Innovations would be made in terms of searching, and most are already starting today, technology being rapidly changing always. Correctly identifying the veracity of the links may be done efficiently through intelligent software, and through having a list of blacklisted websites. Also, content filtering should be administered on these websites as most of them delivers unrestricted content to users, when these users are actually of varying age range, secuality and religion. Some contents searched by crawlers may be offending to some, and so it must be allowed to be turned off.
Also, a technique has been proposed in some blogs that to improve the search results that we get, we must first relinquish some of our privacy rights. This is because some of the search terms that we enters are ambigous and the only way to solve this is to let the search engine know more about you. This could be in the form of allowing the website to keep track of all your past queries, or your browsing history. This would be a way of improving the relevancy of the results, although it comes at a price. IBM has also started research on how to bring search engines to the next level.
It is working on an engine that does not simply index the pages, it would have a natural language analyzer that would mine the entire internet and create searchable xml tags out of its contents. Data mining would be a brute force technique of indexing the web, but there seems to be no other option before any company could build an efficient engine. Improvments on disk technologies and servers would also improve the search engines. The faster the disk storage access is then the faster the query result is retrieved also.
Of course, when the computers had become fast, the refresh rate of the snapshots of these search engine would also gradually become faster, and at some point it could that the content of the search engines cache is almost the same as the real time content of the internet. Improvements in the standard query language would also improve the way searching is done, as searching is entirely based on these queries and are limited by what they can do. Like right now, the new version of Mysql now supports subqueries, making it easier to query tables and databases without making the query string ugly and difficult to look at.
References Yahoo! Inc. 2007. Yahoo Media Relations. http://docs. yahoo. com/info/misc/history. html Oreilly Website. Search Engine Spam? http://radar. oreilly. com/archives/2005/08/search_engine_s_2. html GRC. 2007.
Integrating Content Search with Structure Analysis for Hypermedia Retrieval and Management. http://www. cs. brown. edu/memex/ACM_HypertextTestbed/papers/29. htmlhttp://www. lib. ku. edu/research/technique. shtml KU Libraries. Database Search Techniques. 2006. http://www. newburynetworks. com Computerworld. Search Engine The Future. http://www. computerworld. com/softwaretopics/software/story/0,10801,91841,00. html.