Difference between revisions of "Orange: The Data"
Onnowpurbo (talk | contribs) |
Onnowpurbo (talk | contribs) |
||
Line 7: | Line 7: | ||
==Data Input== | ==Data Input== | ||
− | Orange dapat membaca file dalam format tab-delimited, atau dapat memuat data dari salah satu jenis file spreadsheet standar utama, seperti CSV dan Excel. Format asli dimulai dengan baris tajuk dengan nama fitur (kolom). Baris tajuk kedua memberikan tipe atribut, yang dapat berupacontinuous, discrete, time, atau string. Baris tajuk ketiga berisi informasi meta untuk mengidentifikasi fitur dependen (class), fitur yang tidak relevan (ignore) atau fitur meta (meta). Spesifikasi lebih rinci tersedia di Loading and saving data (io). Berikut adalah beberapa baris pertama dari sebuah dataset lenses.tab: | + | Orange dapat membaca file dalam format tab-delimited, atau dapat memuat data dari salah satu jenis file spreadsheet standar utama, seperti CSV dan Excel. Format asli dimulai dengan baris tajuk dengan nama fitur (kolom). Baris tajuk kedua memberikan tipe atribut, yang dapat berupacontinuous, discrete, time, atau string. Baris tajuk ketiga berisi informasi meta untuk mengidentifikasi fitur dependen (class), fitur yang tidak relevan (ignore) atau fitur meta (meta). Spesifikasi lebih rinci tersedia di [https://docs.biolab.si//3/data-mining-library/reference/data.io.html Loading and saving data (io)]. Berikut adalah beberapa baris pertama dari sebuah dataset lenses.tab: |
age prescription astigmatic tear_rate lenses | age prescription astigmatic tear_rate lenses |
Revision as of 10:33, 9 January 2020
Sumber: https://docs.biolab.si//3/data-mining-library/tutorial/data.html
Bagian ini menjelaskan cara memuat data di Orange. Kami juga menunjukkan cara menjelajahi data, melakukan beberapa statistik dasar, dan cara mengambil sampel data.
Data Input
Orange dapat membaca file dalam format tab-delimited, atau dapat memuat data dari salah satu jenis file spreadsheet standar utama, seperti CSV dan Excel. Format asli dimulai dengan baris tajuk dengan nama fitur (kolom). Baris tajuk kedua memberikan tipe atribut, yang dapat berupacontinuous, discrete, time, atau string. Baris tajuk ketiga berisi informasi meta untuk mengidentifikasi fitur dependen (class), fitur yang tidak relevan (ignore) atau fitur meta (meta). Spesifikasi lebih rinci tersedia di Loading and saving data (io). Berikut adalah beberapa baris pertama dari sebuah dataset lenses.tab:
age prescription astigmatic tear_rate lenses discrete discrete discrete discrete discrete class young myope no reduced none young myope no normal soft young myope yes reduced none young myope yes normal hard young hypermetrope no reduced none
Values are tab-limited. This dataset has four attributes (age of the patient, spectacle prescription, notion on astigmatism, and information on tear production rate) and an associated three-valued dependent variable encoding lens prescription for the patient (hard contact lenses, soft contact lenses, no lenses). Feature descriptions could use one letter only, so the header of this dataset could also read:
age prescription astigmatic tear_rate lenses d d d d d c
The rest of the table gives the data. Note that there are 5 instances in our table above. For the full dataset, check out or download lenses.tab) to a target directory. You can also skip this step as Orange comes preloaded with several demo datasets, lenses being one of them. Now, open a python shell, import Orange and load the data:
>>> import Orange >>> data = Orange.data.Table("lenses") >>>
Note that for the file name no suffix is needed, as Orange checks if any files in the current directory are of a readable type. The call to Orange.data.Table creates an object called data that holds your dataset and information about the lenses domain:
>>> data.domain.attributes (DiscreteVariable('age', values=['pre-presbyopic', 'presbyopic', 'young']), DiscreteVariable('prescription', values=['hypermetrope', 'myope']), DiscreteVariable('astigmatic', values=['no', 'yes']), DiscreteVariable('tear_rate', values=['normal', 'reduced'])) >>> data.domain.class_var DiscreteVariable('lenses', values=['hard', 'none', 'soft']) >>> for d in data[:3]: ...: print(d) ...: [young, myope, no, reduced | none] [young, myope, no, normal | soft] [young, myope, yes, reduced | none] >>>
The following script wraps-up everything we have done so far and lists first 5 data instances with soft prescription:
import Orange data = Orange.data.Table("lenses") print("Attributes:", ", ".join(x.name for x in data.domain.attributes)) print("Class:", data.domain.class_var.name) print("Data instances", len(data)) target = "soft" print("Data instances with %s prescriptions:" % target) atts = data.domain.attributes for d in data: if d.get_class() == target: print(" ".join(["%14s" % str(d[a]) for a in atts]))
Note that data is an object that holds both the data and information on the domain. We show above how to access attribute and class names, but there is much more information there, including that on feature type, set of values for categorical features, and other.
Saving the Data
Data objects can be saved to a file:
>>> data.save("new_data.tab") >>>
This time, we have to provide the file extension to specify the output format. An extension for native Orange’s data format is “.tab”. The following code saves only the data items with myope perscription:
import Orange data = Orange.data.Table("lenses") myope_subset = [d for d in data if d["prescription"] == "myope"] new_data = Orange.data.Table(data.domain, myope_subset) new_data.save("lenses-subset.tab")
We have created a new data table by passing the information on the structure of the data (data.domain) and a subset of data instances.
Exploration of the Data Domain
Data table stores information on data instances as well as on data domain. Domain holds the names of attributes, optional classes, their types and, and if categorical, the value names. The following code:
import Orange data = Orange.data.Table("imports-85.tab") n = len(data.domain.attributes) n_cont = sum(1 for a in data.domain.attributes if a.is_continuous) n_disc = sum(1 for a in data.domain.attributes if a.is_discrete) print("%d attributes: %d continuous, %d discrete" % (n, n_cont, n_disc)) print( "First three attributes:", ", ".join(data.domain.attributes[i].name for i in range(3)), ) print("Class:", data.domain.class_var.name)
outputs:
25 attributes: 14 continuous, 11 discrete First three attributes: symboling, normalized-losses, make Class: price
Orange’s objects often behave like Python lists and dictionaries, and can be indexed or accessed through feature names:
print("First attribute:", data.domain[0].name) name = "fuel-type" print("Values of attribute '%s': %s" % (name, ", ".join(data.domain[name].values)))
The output of the above code is:
First attribute: symboling Values of attribute 'fuel-type': diesel, gas
Data Instances
Data table stores data instances (or examples). These can be indexed or traversed as any Python list. Data instances can be considered as vectors, accessed through element index, or through feature name.
import Orange data = Orange.data.Table("iris") print("First three data instances:") for d in data[:3]: print(d) print("25-th data instance:") print(data[24]) name = "sepal width" print("Value of '%s' for the first instance:" % name, data[0][name]) print("The 3rd value of the 25th data instance:", data[24][2])
The script above displays the following output:
First three data instances: [5.100, 3.500, 1.400, 0.200 | Iris-setosa] [4.900, 3.000, 1.400, 0.200 | Iris-setosa] [4.700, 3.200, 1.300, 0.200 | Iris-setosa] 25-th data instance: [4.800, 3.400, 1.900, 0.200 | Iris-setosa] Value of 'sepal width' for the first instance: 3.500 The 3rd value of the 25th data instance: 1.900
The Iris dataset we have used above has four continuous attributes. Here’s a script that computes their mean:
average = lambda x: sum(x) / len(x) data = Orange.data.Table("iris") print("%-15s %s" % ("Feature", "Mean")) for x in data.domain.attributes: print("%-15s %.2f" % (x.name, average([d[x] for d in data])))
The above script also illustrates indexing of data instances with objects that store features; in d[x] variable x is an Orange object. Here’s the output:
Feature Mean sepal length 5.84 sepal width 3.05 petal length 3.76 petal width 1.20
A slightly more complicated, but also more interesting, code that computes per-class averages:
average = lambda xs: sum(xs) / float(len(xs)) data = Orange.data.Table("iris") targets = data.domain.class_var.values print("%-15s %s" % ("Feature", " ".join("%15s" % c for c in targets))) for a in data.domain.attributes: dist = [ "%15.2f" % average([d[a] for d in data if d.get_class() == c]) for c in targets ] print("%-15s" % a.name, " ".join(dist))
Of the four features, petal width and length look quite discriminative for the type of iris:
Feature Iris-setosa Iris-versicolor Iris-virginica sepal length 5.01 5.94 6.59 sepal width 3.42 2.77 2.97 petal length 1.46 4.26 5.55 petal width 0.24 1.33 2.03
Finally, here is a quick code that computes the class distribution for another dataset:
import Orange from collections import Counter data = Orange.data.Table("lenses") print(Counter(str(d.get_class()) for d in data))
Orange Datasets and NumPy
Orange datasets are actually wrapped NumPy arrays. Wrapping is performed to retain the information about the feature names and values, and NumPy arrays are used for speed and compatibility with different machine learning toolboxes, like scikit-learn, on which Orange relies. Let us display the values of these arrays for the first three data instances of the iris dataset:
>>> data = Orange.data.Table("iris") >>> data.X[:3] array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2]]) >>> data.Y[:3] array([ 0., 0., 0.])
Notice that we access the arrays for attributes and class separately, using data.X and data.Y. Average values of attributes can then be computed efficiently by:
>>> import np as numpy >>> np.mean(data.X, axis=0) array([ 5.84333333, 3.054 , 3.75866667, 1.19866667])
We can also construct a (classless) dataset from a numpy array:
>>> X = np.array([[1,2], [4,5]]) >>> data = Orange.data.Table(X) >>> data.domain [Feature 1, Feature 2]
If we want to provide meaninful names to attributes, we need to construct an appropriate data domain:
>>> domain = Orange.data.Domain([Orange.data.ContinuousVariable("lenght"), Orange.data.ContinuousVariable("width")]) >>> data = Orange.data.Table(domain, X) >>> data.domain [lenght, width]
Here is another example, this time with the construction of a dataset that includes a numerical class and different types of attributes:
size = Orange.data.DiscreteVariable("size", ["small", "big"]) height = Orange.data.ContinuousVariable("height") shape = Orange.data.DiscreteVariable("shape", ["circle", "square", "oval"]) speed = Orange.data.ContinuousVariable("speed")
domain = Orange.data.Domain([size, height, shape], speed)
X = np.array([[1, 3.4, 0], [0, 2.7, 2], [1, 1.4, 1]]) Y = np.array([42.0, 52.2, 13.4]) data = Orange.data.Table(domain, X, Y) print(data)
Running of this scripts yields:
[[big, 3.400, circle | 42.000], [small, 2.700, oval | 52.200], [big, 1.400, square | 13.400]
Meta Attributes
Often, we wish to include descriptive fields in the data that will not be used in any computation (distance estimation, modeling), but will serve for identification or additional information. These are called meta attributes, and are marked with meta in the third header row:
name hair eggs milk backbone legs type string d d d d d d meta class aardvark 1 0 1 1 4 mammal antelope 1 0 1 1 4 mammal bass 0 1 0 1 0 fish bear 1 0 1 1 4 mammal
Values of meta attributes and all other (non-meta) attributes are treated similarly in Orange, but stored in separate numpy arrays:
>>> data = Orange.data.Table("zoo") >>> data[0]["name"] >>> data[0]["type"] >>> for d in data: ...: print("{}/{}: {}".format(d["name"], d["type"], d["legs"])) ...: aardvark/mammal: 4 antelope/mammal: 4 bass/fish: 0 bear/mammal: 4 >>> data.X array([[ 1., 0., 1., 1., 2.], [ 1., 0., 1., 1., 2.], [ 0., 1., 0., 1., 0.], [ 1., 0., 1., 1., 2.]])) >>> data.metas array([['aardvark'], ['antelope'], ['bass'], ['bear']], dtype=object))
Meta attributes may be passed to Orange.data.Table after providing arrays for attribute and class values:
from Orange.data import Table, Domain from Orange.data import ContinuousVariable, DiscreteVariable, StringVariable import numpy as np X = np.array([[2.2, 1625], [0.3, 163]]) Y = np.array([0, 1]) M = np.array([["houston", 10], ["ljubljana", -1]]) domain = Domain( [ContinuousVariable("population"), ContinuousVariable("area")], [DiscreteVariable("snow", ("no", "yes"))], [StringVariable("city"), StringVariable("temperature")], ) data = Table(domain, X, Y, M) print(data)
The script outputs:
[[2.200, 1625.000 | no] {houston, 10}, [0.300, 163.000 | yes] {ljubljana, -1}
To construct a classless domain we could pass None for the class values.
Missing Values
Consider the following exploration of the dataset on votes of the US senate:
>>> import numpy as np >>> data = Orange.data.Table("voting.tab") >>> data[2] [?, y, y, ?, y, ... | democrat] >>> np.isnan(data[2][0]) True >>> np.isnan(data[2][1]) False
The particular data instance included missing data (represented with ‘?’) for the first and the fourth attribute. In the original dataset file, the missing values are, by default, represented with a blank space. We can now examine each attribute and report on proportion of data instances for which this feature was undefined:
data = Orange.data.Table("voting.tab") for x in data.domain.attributes: n_miss = sum(1 for d in data if np.isnan(d[x])) print("%4.1f%% %s" % (100.0 * n_miss / len(data), x.name))
First three lines of the output of this script are:
2.8% handicapped-infants 11.0% water-project-cost-sharing 2.5% adoption-of-the-budget-resolution
A single-liner that reports on number of data instances with at least one missing value is:
>>> sum(any(np.isnan(d[x]) for x in data.domain.attributes) for d in data) 203
Data Selection and Sampling
Besides the name of the data file, Orange.data.Table can accept the data domain and a list of data items and returns a new dataset. This is useful for any data subsetting:
data = Orange.data.Table("iris.tab") print("Dataset instances:", len(data)) subset = Orange.data.Table(data.domain, [d for d in data if d["petal length"] > 3.0]) print("Subset size:", len(subset))
The code outputs:
Dataset instances: 150 Subset size: 99
and inherits the data description (domain) from the original dataset. Changing the domain requires setting up a new domain descriptor. This feature is useful for any kind of feature selection:
data = Orange.data.Table("iris.tab") new_domain = Orange.data.Domain( list(data.domain.attributes[:2]), data.domain.class_var ) new_data = Orange.data.Table(new_domain, data) print(data[0]) print(new_data[0])
We could also construct a random sample of the dataset:
>>> sample = Orange.data.Table(data.domain, random.sample(data, 3)) >>> sample [[6.000, 2.200, 4.000, 1.000 | Iris-versicolor], [4.800, 3.100, 1.600, 0.200 | Iris-setosa], [6.300, 3.400, 5.600, 2.400 | Iris-virginica] ]
or randomly sample the attributes:
>>> atts = random.sample(data.domain.attributes, 2) >>> domain = Orange.data.Domain(atts, data.domain.class_var) >>> new_data = Orange.data.Table(domain, data) >>> new_data[0] [5.100, 1.400 | Iris-setosa]