Difference between revisions of "Orange: Duplicate Detection"

From OnnoWiki
Jump to navigation Jump to search
Line 4: Line 4:
 
Detect & remove duplicates from a corpus.
 
Detect & remove duplicates from a corpus.
  
Inputs
+
==Input==
  
    Distances: A distance matrix.
+
Distances: A distance matrix.
  
Outputs
+
==Output==
  
    Corpus Without Duplicated: Corpus with duplicates removed.
+
Corpus Without Duplicated: Corpus with duplicates removed.
    Duplicates Cluster: Documents belonging to selected cluster.
+
Duplicates Cluster: Documents belonging to selected cluster.
    Corpus: Corpus with appended cluster labels.
+
Corpus: Corpus with appended cluster labels.
  
 
Duplicate Detection uses clustering to find duplicates in the corpus. It is great with the Twitter widget for removing retweets and other similar documents.
 
Duplicate Detection uses clustering to find duplicates in the corpus. It is great with the Twitter widget for removing retweets and other similar documents.
Line 20: Line 20:
 
[[File:Duplicate-Detection-stamped.png|center|200px|thumb]]
 
[[File:Duplicate-Detection-stamped.png|center|200px|thumb]]
  
    Information on unique and duplicate documents.
+
* Information on unique and duplicate documents.
    Linkage used for clustering (Single, Average, Complete, Weighted and Ward).
+
* Linkage used for clustering (Single, Average, Complete, Weighted and Ward).
    Distance threshold sets the similarity cutoff. The lower the value, the more similar the data instances have to be to belong to the same cluster. You can also set the cutoff by dragging the vertical line in the plot.
+
* Distance threshold sets the similarity cutoff. The lower the value, the more similar the data instances have to be to belong to the same cluster. You can also set the cutoff by dragging the vertical line in the plot.
    Cluster labels can be appended as attributes, class or metas.
+
* Cluster labels can be appended as attributes, class or metas.
    List of clusters at the selected threshold. They are sorted by size by default. Click on the cluster to observe its content on the output.
+
* List of clusters at the selected threshold. They are sorted by size by default. Click on the cluster to observe its content on the output.
  
 
==Contoh==
 
==Contoh==

Revision as of 11:08, 29 January 2020

Sumber: https://orange3-text.readthedocs.io/en/latest/widgets/duplicatedetection.html


Detect & remove duplicates from a corpus.

Input

Distances: A distance matrix.

Output

Corpus Without Duplicated: Corpus with duplicates removed.
Duplicates Cluster: Documents belonging to selected cluster.
Corpus: Corpus with appended cluster labels.

Duplicate Detection uses clustering to find duplicates in the corpus. It is great with the Twitter widget for removing retweets and other similar documents.

To set the level of similarity, drag the line vertical line left or right in the visualization. The further left the line, the more similar the documents have to be in order to be considered duplicates. You can also set the threshold manually in the control area.

Duplicate-Detection-stamped.png
  • Information on unique and duplicate documents.
  • Linkage used for clustering (Single, Average, Complete, Weighted and Ward).
  • Distance threshold sets the similarity cutoff. The lower the value, the more similar the data instances have to be to belong to the same cluster. You can also set the cutoff by dragging the vertical line in the plot.
  • Cluster labels can be appended as attributes, class or metas.
  • List of clusters at the selected threshold. They are sorted by size by default. Click on the cluster to observe its content on the output.

Contoh

This simple example uses iris data to find identical data instances. Load iris with the File widget and pass it to Distances. In Distances, use Euclidean distance for computing the distance matrix. Pass distances to Duplicate Detection.

It looks like cluster C147 contain three duplicate entries. Let us select it in the widget and observe it in a Data Table. Remember to set the output to Duplicates Cluster. IThe three data instances are identical. To use the data set without duplicates, use the first output, Corpus Without Duplicates.

The same procedure can be used also for corpora. Remember to use the Bag of Words between Corpus and Distances.

Duplicate-Detection-Example.png


Referensi

Pranala Menarik