Google og internettets sproglige diversitet

Googles officielle blog kom igår et indlæg af stor lingvistisk relevans. Det var nemlig et forklaring af de datalingvistiske metoder, som Google bruger for at forbedre brugernes søgninger – navnlig hvis man søger på andre sprog end engelsk.

  • Spell corrections: We recently launched spell corrections in Estonian. If your Estonian is rusty, and you don’t remember how to spell “smoke detector,” we can suggest a spell correction for [suitsuantur], leading to better search results.
  • Diacritical marks: Many languages have diacritical marks, which alter pronunciation. Our algorithms are built to support them, and even help users who mis-type or completely ignore them. For example, if you’re a resident of Quebec, Canada and would like to know the weather forecast in Quebec City, we’ll serve good results whether you type with diacritical signs [Météo à Québec] or without [meteo quebec]. Czech users can read the same excellent results for a popular kids’ cartoon by searching for [krtecek] and [krte?ek]. On the other hand, sometimes diacriticals change the meaning of the word and we have to use them correctly. For example, in Thai, [????] is “rice,” with completely different results than [????], which is “news”; or in Slovakia, results for “child” [die?a] are different than results for “diet” [diéta].
  • Synonyms: A general case of diacritical support is the handling of synonyms in different languages. Korean searches showed that “samsung” can be viewed as a synonym of “??”, so that when users search for [samsung], they find results which have the company’s name in Korean.
  • Compounding: Some languages allow compounding, which is the formation of new words by combining together existing words. You can see a nice example in Swedish, where we return documents about a Swedish credit card for both compounded [Visakort] and non-compounded [visa kort] queries.
  • Stemming: Google has developed morphological models that can receive compound words as queries, and return pages which contain their stem, possibly as part of a different compound. For example, when searching for cars in Saudi Arabia, you can search for [?????] and [??????] because both are variants of the same stem, and both return many common results. A Polish user can search for “movie” [film], and get back results that contain other variants of the stem, such as “filmów,” “filmu,” “filmie,” “filmy.” A user from Belarus will find results for all word forms of the capital, Minsk [?????]: “??????,” “??????,” “????????.”

De beskriver også deres bestræbelser på at gøre Google brugbar til søgninger på sprog som bruger andre skriftsystemer end vores latinske alfabet bl.a. gennem brug af en art fonemisk transskription med latinske bogstaver.