Difference between revisions of "Orange: Similarity Hashing"

From OnnoWiki
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
  
Computes documents hashes.
+
Widget Similarity Hashing menghitung nilai hash dokumen.
  
Inputs
+
==Input==
  
    Corpus: A collection of documents.
+
Corpus: A collection of documents.
  
Outputs
+
==Output==
  
    Corpus: Corpus with simhash value as attributes.
+
Corpus: Corpus with simhash value as attributes.
  
Similarity Hashing is a widget that transforms documents into similarity vectors. The widget uses SimHash method from from Moses Charikar.
+
Widget Similarity Hashing akan mentranformasikan dokumen menjadi vector similarity. Widget Similarity Hashing menggunakan metoda SimHash dari Moses Charikar.
  
[[File:Similarity-Hashing-stamped.png|center|200px|thumb]]
+
[[File:Similarity-Hashing-stamped.png|center|400px|thumb]]
  
    Set Simhash size (how many attributes will be on the output, corresponds to bits of information) and shingle length (how many tokens are used in a shingle).
+
* Set Simhash size (how many attributes will be on the output, corresponds to bits of information) and shingle length (how many tokens are used in a shingle).
    Commit Automatically output the data automatically. Alternatively, press Commit.
+
* Commit Automatically output the data automatically. Alternatively, press Commit.
  
 
==Contoh==
 
==Contoh==
  
We will use deerwester.tab to find similar documents in this small corpus. Load the data with Corpus and pass it to Similarity Hashing. We will keep the default hash size and shingle length. We can observe what the widget outputs in a Data Table. There are 64 new attributes available, corresponding to the Simhash size parameter.
+
Kita akan menggunakan file deerwester.tab untuk memperoleh dokumen yang sama dari corpus yang kecil ini. Load data menggunakan widget Corpus dan kirim ke widget Similarity Hashing. Kita menggunakan nilai default hash size dan default shingle length. Kita bisa mengamati keluaran widget Similarity Hashing di widget Data Table. Ada 64 atribut baru yang tersedia, terkait dengan Simhash size parameter.
 
 
[[File:Similarity-Hashing-Example.png|center|200px|thumb]]
 
  
 +
[[File:Similarity-Hashing-Example.png|center|600px|thumb]]
  
 
==Referensi==
 
==Referensi==

Latest revision as of 09:54, 12 April 2020

Sumber: https://orange3-text.readthedocs.io/en/latest/widgets/similarityhashing.html


Widget Similarity Hashing menghitung nilai hash dokumen.

Input

Corpus: A collection of documents.

Output

Corpus: Corpus with simhash value as attributes.

Widget Similarity Hashing akan mentranformasikan dokumen menjadi vector similarity. Widget Similarity Hashing menggunakan metoda SimHash dari Moses Charikar.

Similarity-Hashing-stamped.png
  • Set Simhash size (how many attributes will be on the output, corresponds to bits of information) and shingle length (how many tokens are used in a shingle).
  • Commit Automatically output the data automatically. Alternatively, press Commit.

Contoh

Kita akan menggunakan file deerwester.tab untuk memperoleh dokumen yang sama dari corpus yang kecil ini. Load data menggunakan widget Corpus dan kirim ke widget Similarity Hashing. Kita menggunakan nilai default hash size dan default shingle length. Kita bisa mengamati keluaran widget Similarity Hashing di widget Data Table. Ada 64 atribut baru yang tersedia, terkait dengan Simhash size parameter.

Similarity-Hashing-Example.png

Referensi

Charikar, M. (2002) Similarity estimation techniques from rounding algorithms. STOC ‘02 Proceedings of the thirty-fourth annual ACM symposium on Theory of computing, p. 380-388.



Referensi

Pranala Menarik