1. Finding information on the Web
- Browse strategies: where is information stored?
- Search strategies: what does the information contain?
* Specific queries
e.g. encyclopaedia, library
* Broad queries
e.g. web directories
* Vague queries
e.g. search engines
2. Web Search
- All queries answered without accessing texts by indices alone
- Links: link topology, link popularity, who links
- Page structure: words in heading > words in text
- Spamming
* most search engines have rules against invisible text / meta tag abuse / heavy repetition / "domain spam"
3. Centralised architecture
e.g. Crawler-indexer
- Crawler: a program that traverses web to send new or update pages to main server (where they are indexed)
- Centralised use of index to answer queries
4. Distributed architecture
e.g. Harvest architecture
- Gatherers: collect and extract indexing information from one or more web servers at periodic time
- Brokers: provide indexing mechanism and query interface to data gathered
retrieve information from gatherers or other brokers, updating incrementally their indices
5. Google Search
- Crawling and Index Depth: aims to refresh its index on a monthly basis
- Ranking algorithms
* Variations of Boolean and vector space model: term frequency * inverse document frequency
* Hyperlinks between pages: Popularity / Relatedness (e.g. PageRank, HITS)
!! PageRank: Google finds a single type of universally important page -- intuitively, locations that are heavily visited in a random traversal of the Web's link structure
- Google Relevancy: Google ranks web pages based on the number, quality and content of links pointing at them (citation)
* Number of Links
* Link Quality
* Link Content
* Ranking boosts on text style
댓글 없음:
댓글 쓰기