CQP-søgning i KorpusDK

I forbindelse med faget Funktionelle Kategorier i Dansk blev jeg bedt om at skrive nogle hurtige eksempler på hvordan man søger i KorpusDK vha. sproget CQP (Corpus Query Processor; på KorpusDK kalder de det formel søgning). Siden jeg nu er nørd, gammel Webcafe.dk-bagmand og fan af regular expressions (som CQP har en hel del tilfælles med), kunne jeg selvfølgelig ikke holde mig til bare nogle få eksempler, men endte med et reelt cheat sheet (eller: en introduktion) til CQP-søgninger i KorpusDK (… på engelsk).

Hent KorpusDK Cheat Sheet fra min Dropbox.

Google og internettets sproglige diversitet

Googles officielle blog kom igår et indlæg af stor lingvistisk relevans. Det var nemlig et forklaring af de datalingvistiske metoder, som Google bruger for at forbedre brugernes søgninger – navnlig hvis man søger på andre sprog end engelsk.

  • Spell corrections: We recently launched spell corrections in Estonian. If your Estonian is rusty, and you don’t remember how to spell “smoke detector,” we can suggest a spell correction for [suitsuantur], leading to better search results.
     
  • Diacritical marks: Many languages have diacritical marks, which alter pronunciation. Our algorithms are built to support them, and even help users who mis-type or completely ignore them. For example, if you’re a resident of Quebec, Canada and would like to know the weather forecast in Quebec City, we’ll serve good results whether you type with diacritical signs [Météo à Québec] or without [meteo quebec]. Czech users can read the same excellent results for a popular kids’ cartoon by searching for [krtecek] and [krte?ek]. On the other hand, sometimes diacriticals change the meaning of the word and we have to use them correctly. For example, in Thai, [????] is “rice,” with completely different results than [????], which is “news”; or in Slovakia, results for “child” [die?a] are different than results for “diet” [diéta].
     
  • Synonyms: A general case of diacritical support is the handling of synonyms in different languages. Korean searches showed that “samsung” can be viewed as a synonym of “??”, so that when users search for [samsung], they find results which have the company’s name in Korean.
     
  • Compounding: Some languages allow compounding, which is the formation of new words by combining together existing words. You can see a nice example in Swedish, where we return documents about a Swedish credit card for both compounded [Visakort] and non-compounded [visa kort] queries.
     
  • Stemming: Google has developed morphological models that can receive compound words as queries, and return pages which contain their stem, possibly as part of a different compound. For example, when searching for cars in Saudi Arabia, you can search for [?????] and [??????] because both are variants of the same stem, and both return many common results. A Polish user can search for “movie” [film], and get back results that contain other variants of the stem, such as “filmów,” “filmu,” “filmie,” “filmy.” A user from Belarus will find results for all word forms of the capital, Minsk [?????]: “??????,” “??????,” “????????.”

De beskriver også deres bestræbelser på at gøre Google brugbar til søgninger på sprog som bruger andre skriftsystemer end vores latinske alfabet bl.a. gennem brug af en art fonemisk transskription med latinske bogstaver.