Difference between revisions of "Orange: Similarity Hashing"

From OnnoWiki
Jump to navigation Jump to search
(Created page with "Sumber: https://orange3-text.readthedocs.io/en/latest/widgets/similarityhashing.html Computes documents hashes. Inputs Corpus: A collection of documents. Outputs...")
 
Line 14: Line 14:
 
Similarity Hashing is a widget that transforms documents into similarity vectors. The widget uses SimHash method from from Moses Charikar.
 
Similarity Hashing is a widget that transforms documents into similarity vectors. The widget uses SimHash method from from Moses Charikar.
  
../_images/Similarity-Hashing-stamped.png
+
[[File:Similarity-Hashing-stamped.png|center|200px|thumb]]
  
 
     Set Simhash size (how many attributes will be on the output, corresponds to bits of information) and shingle length (how many tokens are used in a shingle).
 
     Set Simhash size (how many attributes will be on the output, corresponds to bits of information) and shingle length (how many tokens are used in a shingle).
 
     Commit Automatically output the data automatically. Alternatively, press Commit.
 
     Commit Automatically output the data automatically. Alternatively, press Commit.
  
Example
+
==Contoh==
  
 
We will use deerwester.tab to find similar documents in this small corpus. Load the data with Corpus and pass it to Similarity Hashing. We will keep the default hash size and shingle length. We can observe what the widget outputs in a Data Table. There are 64 new attributes available, corresponding to the Simhash size parameter.
 
We will use deerwester.tab to find similar documents in this small corpus. Load the data with Corpus and pass it to Similarity Hashing. We will keep the default hash size and shingle length. We can observe what the widget outputs in a Data Table. There are 64 new attributes available, corresponding to the Simhash size parameter.
  
../_images/Similarity-Hashing-Example.png
+
[[File:Similarity-Hashing-Example.png|center|200px|thumb]]
References
+
 
 +
 
 +
==Referensi==
  
 
Charikar, M. (2002) Similarity estimation techniques from rounding algorithms. STOC ‘02 Proceedings of the thirty-fourth annual ACM symposium on Theory of computing, p. 380-388.
 
Charikar, M. (2002) Similarity estimation techniques from rounding algorithms. STOC ‘02 Proceedings of the thirty-fourth annual ACM symposium on Theory of computing, p. 380-388.

Revision as of 15:56, 24 January 2020

Sumber: https://orange3-text.readthedocs.io/en/latest/widgets/similarityhashing.html


Computes documents hashes.

Inputs

   Corpus: A collection of documents.

Outputs

   Corpus: Corpus with simhash value as attributes.

Similarity Hashing is a widget that transforms documents into similarity vectors. The widget uses SimHash method from from Moses Charikar.

Similarity-Hashing-stamped.png
   Set Simhash size (how many attributes will be on the output, corresponds to bits of information) and shingle length (how many tokens are used in a shingle).
   Commit Automatically output the data automatically. Alternatively, press Commit.

Contoh

We will use deerwester.tab to find similar documents in this small corpus. Load the data with Corpus and pass it to Similarity Hashing. We will keep the default hash size and shingle length. We can observe what the widget outputs in a Data Table. There are 64 new attributes available, corresponding to the Simhash size parameter.

Similarity-Hashing-Example.png


Referensi

Charikar, M. (2002) Similarity estimation techniques from rounding algorithms. STOC ‘02 Proceedings of the thirty-fourth annual ACM symposium on Theory of computing, p. 380-388.



Referensi

Pranala Menarik