Share to: share facebook share twitter share wa share telegram print page

Cosine similarity

In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval For example, two proportional vectors have a cosine similarity of 1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of -1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in .

For example, in information retrieval and text mining, each word is assigned a different coordinate and a document is represented by the vector of the numbers of occurrences of each word in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be, in terms of their subject matter, and independently of the length of the documents.[1]

The technique is also used to measure cohesion within clusters in the field of data mining.[2]

One advantage of cosine similarity is its low complexity, especially for sparse vectors: only the non-zero coordinates need to be considered.

Other names for cosine similarity include Orchini similarity and Tucker coefficient of congruence; the Otsuka–Ochiai similarity (see below) is cosine similarity applied to binary data.

Definition

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:

Given two n-dimensional vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as

where and are the th components of vectors and , respectively.

The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating orthogonality or decorrelation, while in-between values indicate intermediate similarity or dissimilarity.

For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from , since the term frequencies cannot be negative. This remains true when using TF-IDF weights. The angle between two term frequency vectors cannot be greater than 90°.

If the attribute vectors are normalized by subtracting the vector means (e.g., ), the measure is called the centered cosine similarity and is equivalent to the Pearson correlation coefficient. For an example of centering,

Cosine distance

The term cosine distance[3] is commonly used for the complement of cosine similarity in positive space, that is

It is important to note that the cosine distance is not a true distance metric as it does not exhibit the triangle inequality property—or, more formally, the Schwarz inequality—and it violates the coincidence axiom. One way to see this is to note that the cosine distance is half of the squared Euclidean distance of the normalization of the vectors, and squared Euclidean distance does not satisfy the triangle inequality either. To repair the triangle inequality property while maintaining the same ordering, it is necessary to convert to angular distance or Euclidean distance. Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below.

Angular distance and similarity

The normalized angle, referred to as angular distance, between any two vectors and is a formal distance metric and can be calculated from the cosine similarity.[4] The complement of the angular distance metric can then be used to define angular similarity function bounded between 0 and 1, inclusive.

When the vector elements may be positive or negative:

Or, if the vector elements are always positive:

Unfortunately, computing the inverse cosine (arccos) function is slow, making the use of the angular distance more computationally expensive than using the more common (but not metric) cosine distance above.

L2-normalized Euclidean distance

Another effective proxy for cosine distance can be obtained by normalisation of the vectors, followed by the application of normal Euclidean distance. Using this technique each term in each vector is first divided by the magnitude of the vector, yielding a vector of unit length. Then the Euclidean distance over the end-points of any two vectors is a proper metric which gives the same ordering as the cosine distance (a monotonic transformation of Euclidean distance; see below) for any comparison of vectors, and furthermore avoids the potentially expensive trigonometric operations required to yield a proper metric. Once the normalisation has occurred, the vector space can be used with the full range of techniques available to any Euclidean space, notably standard dimensionality reduction techniques. This normalised form distance is often used within many deep learning algorithms.

Otsuka–Ochiai coefficient

In biology, there is a similar concept known as the Otsuka–Ochiai coefficient named after Yanosuke Otsuka (also spelled as Ōtsuka, Ootsuka or Otuka,[5] Japanese: 大塚 弥之助)[6] and Akira Ochiai (Japanese: 落合 明),[7] also known as the Ochiai–Barkman[8] or Ochiai coefficient,[9] which can be represented as:

Here, and are sets, and is the number of elements in . If sets are represented as bit vectors, the Otsuka–Ochiai coefficient can be seen to be the same as the cosine similarity. It is identical to the score introduced by Godfrey Thomson.[10]

In a recent book,[11] the coefficient is tentatively misattributed to another Japanese researcher with the family name Otsuka. The confusion arises because in 1957 Akira Ochiai attributes the coefficient only to Otsuka (no first name mentioned)[7] by citing an article by Ikuso Hamai (Japanese: 浜井 生三),[12] who in turn cites the original 1936 article by Yanosuke Otsuka.[6]

Properties

The most noteworthy property of cosine similarity is that it reflects a relative, rather than absolute, comparison of the individual vector dimensions. For any constant and vector , the vectors and are maximally similar. The measure is thus most appropriate for data where frequency is more important than absolute values; notably, term frequency in documents. However more recent metrics with a grounding in information theory, such as Jensen–Shannon, SED, and triangular divergence have been shown to have improved semantics in at least some contexts. [13]

Cosine similarity is related to Euclidean distance as follows. Denote Euclidean distance by the usual , and observe that

(polarization identity)

by expansion. When A and B are normalized to unit length, so this expression is equal to

In short, the cosine distance can be expressed in terms of Euclidean distance as

.

The Euclidean distance is called the chord distance (because it is the length of the chord on the unit circle) and it is the Euclidean distance between the vectors which were normalized to unit sum of squared values within them.

Null distribution: For data which can be negative as well as positive, the null distribution for cosine similarity is the distribution of the dot product of two independent random unit vectors. This distribution has a mean of zero and a variance of (where is the number of dimensions), and although the distribution is bounded between -1 and +1, as grows large the distribution is increasingly well-approximated by the normal distribution.[14][15] Other types of data such as bitstreams, which only take the values 0 or 1, the null distribution takes a different form and may have a nonzero mean.[16]

Triangle inequality for cosine similarity

The ordinary triangle inequality for angles (i.e., arc lengths on a unit hypersphere) gives us that

Because the cosine function decreases as an angle in [0, π] radians increases, the sense of these inequalities is reversed when we take the cosine of each value:

Using the cosine addition and subtraction formulas, these two inequalities can be written in terms of the original cosines,

This form of the triangle inequality can be used to bound the minimum and maximum similarity of two objects A and B if the similarities to a reference object C is already known. This is used for example in metric data indexing, but has also been used to accelerate spherical k-means clustering[17] the same way the Euclidean triangle inequality has been used to accelerate regular k-means.

Soft cosine measure

A soft cosine or ("soft" similarity) between two vectors considers similarities between pairs of features.[18] The traditional cosine similarity considers the vector space model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity.

For example, in the field of natural language processing (NLP) the similarity among features is quite intuitive. Features such as words, n-grams, or syntactic n-grams[19] can be quite similar, though formally they are considered as different features in the VSM. For example, words “play” and “game” are different words and thus mapped to different points in VSM; yet they are semantically related. In case of n-grams or syntactic n-grams, Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well).

For calculating soft cosine, the matrix s is used to indicate similarity between features. It can be calculated through Levenshtein distance, WordNet similarity, or other similarity measures. Then we just multiply by this matrix.

Given two N-dimension vectors and , the soft cosine similarity is calculated as follows:

where sij = similarity(featurei, featurej).

If there is no similarity between features (sii = 1, sij = 0 for ij), the given equation is equivalent to the conventional cosine similarity formula.

The time complexity of this measure is quadratic, which makes it applicable to real-world tasks. Note that the complexity can be reduced to subquadratic.[20] An efficient implementation of such soft cosine similarity is included in the Gensim open source library.

See also

References

  1. ^ Singhal, Amit (2001). "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 35–43.
  2. ^ P.-N. Tan, M. Steinbach & V. Kumar, Introduction to Data Mining, Addison-Wesley (2005), ISBN 0-321-32136-7, chapter 8; page 500.
  3. ^ Wolfram Research (2007). "CosineDistance – Wolfram Language & System Documentation Center". wolfram.com.{{cite web}}: CS1 maint: numeric names: authors list (link)
  4. ^ "COSINE DISTANCE, COSINE SIMILARITY, ANGULAR COSINE DISTANCE, ANGULAR COSINE SIMILARITY". www.itl.nist.gov. Retrieved 2020-07-11.
  5. ^ Omori, Masae (2004). "Geological idea of Yanosuke Otuka, who built the foundation of neotectonics (geoscientist)". Earth Science. 58 (4): 256–259. doi:10.15080/agcjchikyukagaku.58.4_256.
  6. ^ a b Otsuka, Yanosuke (1936). "The faunal character of the Japanese Pleistocene marine Mollusca, as evidence of the climate having become colder during the Pleistocene in Japan". Bulletin of the Biogeographical Society of Japan. 6 (16): 165–170.
  7. ^ a b Ochiai, Akira (1957). "Zoogeographical studies on the soleoid fishes found in Japan and its neighhouring regions-II". Bulletin of the Japanese Society of Scientific Fisheries. 22 (9): 526–530. doi:10.2331/suisan.22.526.
  8. ^ Barkman, Jan J. (1958). Phytosociology and Ecology of Cryptogamic Epiphytes: Including a Taxonomic Survey and Description of Their Vegetation Units in Europe. Assen: Van Gorcum.
  9. ^ Romesburg, H. Charles (1984). Cluster Analysis for Researchers. Belmont, California: Lifetime Learning Publications. p. 149.
  10. ^ Thomson, Godfrey (1916). "A hierarchy without a general factor" (PDF). British Journal of Psychology. 8: 271–281.
  11. ^ Howarth, Richard J. (2017). Dictionary of Mathematical Geosciences: With Historical Notes. Cham: Springer. p. 421. doi:10.1007/978-3-319-57315-1. ISBN 978-3-319-57314-4. S2CID 67081034. […] attributed by him to "Otsuka" [?A. Otsuka of the Dept. of Fisheries, Tohoku University].
  12. ^ Hamai, Ikuso (1955). "Stratification of community by means of "community coefficient" (continued)". Japanese Journal of Ecology. 5 (1): 41–45. doi:10.18960/seitai.5.1_41.
  13. ^ Connor, Richard (2016). A Tale of Four Metrics. Similarity Search and Applications. Tokyo: Springer. doi:10.1007/978-3-319-46759-7_16.
  14. ^ Spruill, Marcus C. (2007). "Asymptotic distribution of coordinates on high dimensional spheres". Electronic Communications in Probability. 12: 234–247. doi:10.1214/ECP.v12-1294.
  15. ^ "Distribution of dot products between two random unit vectors in RD". CrossValidated.
  16. ^ Graham L. Giller (2012). "The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity". Giller Investments Research Notes (20121024/1). doi:10.2139/ssrn.2167044. S2CID 123332455.
  17. ^ Schubert, Erich; Lang, Andreas; Feher, Gloria (2021). Reyes, Nora; Connor, Richard; Kriege, Nils; Kazempour, Daniyal; Bartolini, Ilaria; Schubert, Erich; Chen, Jian-Jia (eds.). "Accelerating Spherical k-Means". Similarity Search and Applications. Lecture Notes in Computer Science. 13058. Cham: Springer International Publishing: 217–231. arXiv:2107.04074. doi:10.1007/978-3-030-89657-7_17. ISBN 978-3-030-89657-7. S2CID 235790358.
  18. ^ Sidorov, Grigori; Gelbukh, Alexander; Gómez-Adorno, Helena; Pinto, David (29 September 2014). "Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model". Computación y Sistemas. 18 (3): 491–504. doi:10.13053/CyS-18-3-2043. Retrieved 7 October 2014.
  19. ^ Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Efstathios; Gelbukh, Alexander; Chanona-Hernández, Liliana (2013). Advances in Computational Intelligence. Lecture Notes in Computer Science. Vol. 7630. LNAI 7630. pp. 1–11. doi:10.1007/978-3-642-37798-3_1. ISBN 978-3-642-37798-3.
  20. ^ Novotný, Vít (2018). Implementation Notes for the Soft Cosine Measure. The 27th ACM International Conference on Information and Knowledge Management. Torun, Italy: Association for Computing Machinery. pp. 1639–1642. arXiv:1808.09407. doi:10.1145/3269206.3269317. ISBN 978-1-4503-6014-2.

External links

Read more information:

2017 film by Martin McDonagh Three Billboards Outside Ebbing, MissouriTheatrical release posterDirected byMartin McDonaghWritten byMartin McDonaghProduced by Graham Broadbent Peter Czernin Martin McDonagh Starring Frances McDormand Woody Harrelson Sam Rockwell Abbie Cornish John Hawkes Peter Dinklage CinematographyBen DavisEdited byJon GregoryMusic byCarter BurwellProductioncompanies Fox Searchlight Pictures[1] Film4 Productions[1] Blueprint Pictures[1] Cutting Edge Group…

Dolores HartHart pada tahun 1959LahirDolores Hicks20 Oktober 1938 (umur 85)Chicago, Illinois, A.S.Tempat tinggalBethlehem, ConnecticutKebangsaanAmerikaNama lainRev. Mother Dolores Hart, O.S.B.PendidikanSekolah Katolik St. GregoryAlmamaterKolese MarymountTahun aktif1963–sekarang (relijius)1947–63 (aktris)Kota asalChicago, IllinoisSitus webEar of the heart, Ignatious  Rev. Mother Dolores Hart (kelahiran 20 Oktober 1938) merupakan seorang suster Katolik Roma Amerika dan…

Kepulauan TiwiNama lokal: Ratuati Irara (dua kepulauan)Tiwi Islands (Inggris)Citra satelit Kepulauan Tiwi, Melville (kanan atas) dan Bathurst (kiri) dengan daratan Australia (kanan bawah)Kepulauan TiwiTampilkan peta Northern TerritoryKepulauan TiwiTampilkan peta AustraliaGeografiLokasiLaut TimorKoordinat11°36′S 130°49′E / 11.600°S 130.817°E / -11.600; 130.817Koordinat: 11°36′S 130°49′E / 11.600°S 130.817°E / -11.600; 130.817Jumlah p…

Indian Land For Sale (Tanah Indian Dijual) oleh Departemen Dalam Negeri Amerika Serikat (1911) Kolonialisme pemukim terjadi ketika penjajah menyerbu dan menduduki wilayah untuk secara permanen menggantikan masyarakat yang ada dengan masyarakat penjajah.[1][2][3] Kolonialisme pemukim adalah bentuk dominasi eksogen yang biasanya diorganisir atau didukung oleh otoritas kekaisaran. Kolonialisme pemukim berbeda dengan kolonialisme eksploitasi, yang mencakup kebijakan ekonomi p…

Casytha A. Kathmandu Anggota Dewan Perwakilan DaerahRepublik IndonesiaPetahanaMulai menjabat 1 Oktober 2019PresidenJoko WidodoPerolehan suara2.080.525 (2019)Daerah pemilihanJawa Tengah Informasi pribadiLahir28 Desember 1987 (umur 36)Surakarta, Jawa Tengah, IndonesiaOrang tuaBambang Wuryanto (ayah)Alma materUniversitas Gadjah MadaUniversitas MelbournePekerjaanSenatorSunting kotak info • L • B Casytha Arriwi Kathmandu, S.E., M.Fin. (lahir 28 Desember 1987) adalah anggota DPD…

This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (January 2013) (Learn how and when to remove this template message) Portuguese Marine Corps (Fuzileiros)Corpo de FuzileirosCoat of arms of the Portuguese Marine CorpsActive1618-18511924-19341961-presentCountry PortugalBranch Portuguese NavyTypeCommandoSize1200Garrison/HQLisbon Naval BaseNickname(s)FuzosPat…

Art museum in The Hague, NetherlandsKunstmuseum Den HaagMuseum building designed by H.P. BerlageInteractive fullscreen mapFormer namesMuseum voor Moderne Kunst, GemeentemuseumEstablished29 May 1866 (1866-05-29)LocationStadhouderslaan 41The Hague, NetherlandsCoordinates52°5′21.80″N 4°16′50.48″E / 52.0893889°N 4.2806889°E / 52.0893889; 4.2806889TypeArt museumVisitors87.412 (2021)[1] Ranking 9th nationally (2016) DirectorMargriet Schavemake…

Austrian musician (1751–1829) This article is about the sister of Wolfgang Amadeus Mozart. For his mother, see Anna Maria Mozart. Marianne Mozart and Nannerl redirect here. For her cousin of the same name, see Maria Anna Thekla Mozart. For the American political theorist, see Nannerl O. Keohane. Maria Anna MozartMaria Anna Mozart, c. 1785Born(1751-07-30)30 July 1751Salzburg, Archbishopric of Salzburg, Holy Roman EmpireDied29 October 1829(1829-10-29) (aged 78)Salzburg, Austrian Empir…

رنسلار فالس الإحداثيات 44°35′31″N 75°19′09″W / 44.591944444444°N 75.319166666667°W / 44.591944444444; -75.319166666667   [1] تقسيم إداري  البلد الولايات المتحدة[2]  التقسيم الأعلى مقاطعة سانت لورنس  خصائص جغرافية  المساحة 0.821456 كيلومتر مربع0.821454 كيلومتر مربع (1 أبريل 2010)[3]  ا…

Ne doit pas être confondu avec Hydraulique ou Hydrographie. Pour l'Hydrologie médicale, voir Eau minérale naturelle et Thermalisme. HydrologieLe cycle de l'eau.Partie de SciencePratiqué par HydrologueChamps Hydrologie de surfacehydrographiehydrométriehydrological prognosis (d)hydrologic engineering (en)Objet Water balance (en)modifier - modifier le code - modifier Wikidata L'hydrologie (du grec ὕδωρ / hýdōr, « eau », et λόγος / lógos, « étud…

Фильм-рекордер Arrilaser. Фи́льм-реко́рдер, фильм-принтер – устройство графического вывода, предназначенное для печати цифрового изображения на киноплёнке или фотоплёнке. Применяется, главным образом, в цифровом кинематографе для изготовления плёночных фильмокопий, и в пол…

Xian nama kota beralih ke halaman ini. Untuk kegunaan lain, lihat Xian (disambiguasi). Xi'an (Hanzi: 西安; Pinyin: Xī'ān; Wade–Giles: Hsi-An; harfiah: 'Kedamaian di Sebelah Barat'; Romanisasi: Sian;[1] nama lamanya adalah Cháng'ān[1]), adalah ibu kota dari Provinsi Shaanxi di Republik Rakyat Tiongkok dan juga sebuah kota sub-provinsial. Sebagai salah satu kota tertua di Tiongkok, Xi'an adalah salah satu dari Empat Ibu kota Kuno Tiongkok karena kota ini tel…

Lil Rel HoweryLil Rel pada 2010Lahir17 Desember 1979 (umur 44)Chicago, IllinoisTahun aktif2001–sekarangGenreKomediKarya terkenal dan peranFriends of the People, The Carmichael Show, Get Out Milton “Lil Rel” Howery (lahir 17 Desember 1979) adalah seorang pemeran dan komedian asal Amerika Serikat. Howery dikenal atas perannya sebagai Robert Carmichael dalam serial komedi televisi NBC The Carmichael Show (2015–2017) dan atas perannya sebagai perwira TSA Rod Williams dalam film Get Out …

100 Days My PrincePoster promosiJudul asli백일의 낭군님 GenreSejarahKomedi romantisPembuatStudio DragonDitulis olehNo Ji-sulSutradaraLee Jong-jaePemeranDo Kyung-sooNam Ji-hyunJo Sung-haJo Han-chulKim Seon-hoHan So-heeKim Jae-youngNegara asalKorea SelatanBahasa asliKoreaJmlh. episode16 (+2 spesial)ProduksiProduser eksekutifLee Sang-baekLokasi produksiKoreaPengaturan kameraSingle-cameraDurasi67-85 menitRumah produksiAStoryDistributortvNRilis asliJaringantvN Trans TV, NET.Format gambar1080i (…

1956 United States Senate election in North Carolina ← 1954 (special) November 6, 1956 1962 →   Nominee Sam Ervin Joel A. Johnson Party Democratic Republican Popular vote 731,353 367,475 Percentage 66.56% 33.44% Senator before election Sam Ervin Democratic Elected Senator Sam Ervin Democratic Elections in North Carolina Federal government U.S. President 1792 1796 1800 1804 1808 1812 1816 1820 1824 1828 1832 1836 1840 1844 1848 1852 1856 1860 1868 1872 1876 1880 18…

This timeline of spaceflight may require cleanup to ensure consistency with other timeline of spaceflight articles. See Wikipedia:WikiProject Spaceflight/Timeline of spaceflight working group for guidelines on how to improve the article. Details Concerns have been raised that: A large amount of information is missing 1976 in spaceflightViking 2 on the surface of MarsOrbital launchesFirst6 JanuaryLast28 DecemberTotal131Catalogued128National firstsSatellite IndonesiaRocketsMaiden flightsThor …

County in Minnesota, United States County in MinnesotaCarver CountyCountyCarver County Sheriff's Office and Justice Center in Chaska, MinnesotaLocation within the U.S. state of MinnesotaMinnesota's location within the U.S.Coordinates: 44°49′N 93°48′W / 44.82°N 93.8°W / 44.82; -93.8Country United StatesState MinnesotaFoundedFebruary 20, 1855[1]Named forJonathan CarverSeatChaskaLargest cityChaskaArea • Total376 sq mi (970 …

Species of bat New Guinea long-eared bat Conservation status Least Concern  (IUCN 3.1)[1] Scientific classification Domain: Eukaryota Kingdom: Animalia Phylum: Chordata Class: Mammalia Order: Chiroptera Family: Vespertilionidae Genus: Nyctophilus Species: N. microtis Binomial name Nyctophilus microtisThomas, 1888[2] The New Guinea long-eared bat (Nyctophilus microtis) is a small species of bat. It is found only in Papua New Guinea. Taxonomy The description of the specie…

Marina BayInggrisMarina BayTionghoa滨海湾– PinyinBīnhǎiwānMelayuTeluk MarinaTamilமரீனா பே Teluk Marina di malam hari dengan Marina Centre di latar belakang. Bangunan-bangunan pusat jasa keuangan di sekeliling Teluk Marina. Teluk Marina atau Marina Bay ialah sebuah teluk dekat Central Area di daerah selatan Singapura dan berada di sebelah timur dari Downtown Core. Teluk Marina direncanakan menjadi tujuan 24 jam 7 hari seminggu dengan kemungkinan tanpa batas bagi pen…

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: The Trust That Went Bust – news · newspapers · books · scholar · JSTOR (June 2019) (Learn how and when to remove this template message) 1983 filmThe Trust That Went BustDirected byAleksandr PavlovskyWritten byIgor ShevtsovProduced byMikhail ByalyStarringRegimantas…

Kembali kehalaman sebelumnya