Orange: The Data

Sumber: https://docs.biolab.si//3/data-mining-library/tutorial/data.html

Bagian ini menjelaskan cara memuat data di Orange. Kami juga menunjukkan cara menjelajahi data, melakukan beberapa statistik dasar, dan cara mengambil sampel data.

Data Input

Orange dapat membaca file dalam format tab-delimited, atau dapat memuat data dari salah satu jenis file spreadsheet standar utama, seperti CSV dan Excel. Format asli dimulai dengan baris tajuk dengan nama fitur (kolom). Baris tajuk kedua memberikan tipe atribut, yang dapat berupacontinuous, discrete, time, atau string. Baris tajuk ketiga berisi informasi meta untuk mengidentifikasi fitur dependen (class), fitur yang tidak relevan (ignore) atau fitur meta (meta). Spesifikasi lebih rinci tersedia di Loading and saving data (io). Berikut adalah beberapa baris pertama dari sebuah dataset lenses.tab:

age       prescription  astigmatic    tear_rate     lenses
discrete  discrete      discrete      discrete      discrete
                                                    class
young     myope         no            reduced       none
young     myope         no            normal        soft
young     myope         yes           reduced       none
young     myope         yes           normal        hard
young     hypermetrope  no            reduced       none

Nilai adalah tab-limited. Dataset ini memiliki empat atribut (age of the patient, spectacle prescription, notion on astigmatism, dan information on tear production rate) dan resep lensa terkait pada tiga nilai dependent untuk pasien (lhard contact lenses, soft contact lenses, no lenses ). Deskripsi fitur hanya dapat menggunakan satu huruf, jadi tajuk set data ini dapat dibaca:

age       prescription  astigmatic    tear_rate     lenses
d         d             d             d             d
                                                    c

Sisa tabel memberikan data. Perhatikan bahwa ada 5 instance di tabel kita di atas. Untuk dataset penuh, periksa atau download lenses.tab ke direktori. Kita juga dapat melewatkan langkah ini Orange di install dengan beberapa dataset demo, lenses.tab menjadi salah satu dari dataset demo tersebut. Sekarang, buka shell python, impor Orange dan load data:

>>> import Orange
>>> data = Orange.data.Table("lenses")
>>>

Perhatikan bahwa pada nama file, sufiks tidak diperlukan, karena Orange memeriksa apakah ada file dalam direktori saat ini dari tipe yang dapat dibaca. Panggilan ke Orange.data.Table membuat objek yang disebut data yang menyimpan dataset dan informasi tentang lenses domain:

>>> data.domain.attributes
(DiscreteVariable('age', values=['pre-presbyopic', 'presbyopic', 'young']),
 DiscreteVariable('prescription', values=['hypermetrope', 'myope']),
 DiscreteVariable('astigmatic', values=['no', 'yes']),
 DiscreteVariable('tear_rate', values=['normal', 'reduced']))
>>> data.domain.class_var
DiscreteVariable('lenses', values=['hard', 'none', 'soft'])
>>> for d in data[:3]:
   ...:     print(d)
   ...:
[young, myope, no, reduced | none]
[young, myope, no, normal | soft]
[young, myope, yes, reduced | none]
>>>

Skrip berikut merangkum semua yang telah kami lakukan sejauh ini dan mendaftar 5 instance data pertama dengan soft prescription:

import Orange

data = Orange.data.Table("lenses")
print("Attributes:", ", ".join(x.name for x in data.domain.attributes))
print("Class:", data.domain.class_var.name)
print("Data instances", len(data))

target = "soft"
print("Data instances with %s prescriptions:" % target)
atts = data.domain.attributes
for d in data:
    if d.get_class() == target:
        print(" ".join(["%14s" % str(d[a]) for a in atts]))

Perhatikan bahwa data adalah objek yang menyimpan data dan informasi di domain. Kita telah melihat di atas cara mengakses atribut dan nama class, tetapi ada lebih banyak informasi di sana, termasuk pada jenis fitur, set nilai untuk fitur kategorikal, dan lainnya.

Menyimpan Data

Objek data dapat disimpan ke file:

>>> data.save("new_data.tab")
>>>

Kali ini, kami harus menyediakan ekstensi file untuk menentukan format output. Ekstensi untuk format data Orange asli adalah ".tab". Kode berikut hanya menyimpan item data dengan myope perscription:

import Orange

data = Orange.data.Table("lenses")
myope_subset = [d for d in data if d["prescription"] == "myope"]
new_data = Orange.data.Table(data.domain, myope_subset)
new_data.save("lenses-subset.tab")

Kami telah membuat tabel data baru dengan mengirimkan informasi tentang struktur data (data.domain) dan subset instance data.

Explorasi Data Domain

Tabel data menyimpan informasi tentang instance data serta domain data. Domain menyimpan nama atribut, opsional class, jenisnya, dan, jika kategorikal, nama nilai. Kode berikut:

import Orange

data = Orange.data.Table("imports-85.tab")
n = len(data.domain.attributes)
n_cont = sum(1 for a in data.domain.attributes if a.is_continuous)
n_disc = sum(1 for a in data.domain.attributes if a.is_discrete)
print("%d attributes: %d continuous, %d discrete" % (n, n_cont, n_disc))

print(
    "First three attributes:",
    ", ".join(data.domain.attributes[i].name for i in range(3)),
)

print("Class:", data.domain.class_var.name)

output:

25 attributes: 14 continuous, 11 discrete
First three attributes: symboling, normalized-losses, make
Class: price

Objek Orange sering berperilaku seperti Python list atau dictionary, dan dapat diindeks atau diakses melalui nama fitur:

print("First attribute:", data.domain[0].name)
name = "fuel-type"
print("Values of attribute '%s': %s" % (name, ", ".join(data.domain[name].values)))

output code di atas adalah:

First attribute: symboling
Values of attribute 'fuel-type': diesel, gas

Data Instance

Tabel data menyimpan instance data (atau contoh). Ini dapat diindeks atau dilalui (traversed) seperti Python list. Instance data dapat dianggap sebagai vektor, diakses melalui indeks elemen, atau melalui nama fitur.

import Orange

data = Orange.data.Table("iris")
print("First three data instances:")
for d in data[:3]:
    print(d)

print("25-th data instance:")
print(data[24])

name = "sepal width"
print("Value of '%s' for the first instance:" % name, data[0][name])
print("The 3rd value of the 25th data instance:", data[24][2])

Script di atas menghasilkan output berikut:

First three data instances:
[5.100, 3.500, 1.400, 0.200 | Iris-setosa]
[4.900, 3.000, 1.400, 0.200 | Iris-setosa]
[4.700, 3.200, 1.300, 0.200 | Iris-setosa]
25-th data instance:
[4.800, 3.400, 1.900, 0.200 | Iris-setosa]
Value of 'sepal width' for the first instance: 3.500
The 3rd value of the 25th data instance: 1.900

Dataset Iris yang kami gunakan di atas memiliki empat atribut kontinu. Berikut ini skrip yang menghitung rata-ratanya:

average = lambda x: sum(x) / len(x)

data = Orange.data.Table("iris")
print("%-15s %s" % ("Feature", "Mean"))
for x in data.domain.attributes:
    print("%-15s %.2f" % (x.name, average([d[x] for d in data])))

Script di atas juga menggambarkan pengindeksan instance data dengan objek yang menyimpan fitur; dalam d [x] variabel x adalah objek Orange. Inilah hasilnya:

Feature         Mean
sepal length    5.84
sepal width     3.05
petal length    3.76
petal width     1.20

Kode yang sedikit lebih rumit, tetapi juga lebih menarik, yang menghitung rata-rata per-class:

average = lambda xs: sum(xs) / float(len(xs)) 

data = Orange.data.Table("iris")
targets = data.domain.class_var.values
print("%-15s %s" % ("Feature", " ".join("%15s" % c for c in targets)))
for a in data.domain.attributes:
    dist = [
        "%15.2f" % average([d[a] for d in data if d.get_class() == c]) for c in targets
    ]
    print("%-15s" % a.name, " ".join(dist))

Dari empat fitur, kelopak lebar dan panjang terlihat cukup diskriminatif untuk jenis iris:

Feature             Iris-setosa Iris-versicolor  Iris-virginica
sepal length               5.01            5.94            6.59
sepal width                3.42            2.77            2.97
petal length               1.46            4.26            5.55
petal width                0.24            1.33            2.03

Akhirnya, berikut adalah kode cepat yang menghitung distribusi kelas untuk dataset lainnya:

import Orange
from collections import Counter

data = Orange.data.Table("lenses")
print(Counter(str(d.get_class()) for d in data))

Orange Dataset dan NumPy

Orange datasets are actually wrapped NumPy arrays. Wrapping is performed to retain the information about the feature names and values, and NumPy arrays are used for speed and compatibility with different machine learning toolboxes, like scikit-learn, on which Orange relies. Let us display the values of these arrays for the first three data instances of the iris dataset:

>>> data = Orange.data.Table("iris")
>>> data.X[:3]
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2]])
>>> data.Y[:3]
array([ 0.,  0.,  0.])

Notice that we access the arrays for attributes and class separately, using data.X and data.Y. Average values of attributes can then be computed efficiently by:

>>> import np as numpy
>>> np.mean(data.X, axis=0)
array([ 5.84333333,  3.054     ,  3.75866667,  1.19866667])

We can also construct a (classless) dataset from a numpy array:

>>> X = np.array([[1,2], [4,5]])
>>> data = Orange.data.Table(X)
>>> data.domain
[Feature 1, Feature 2]

If we want to provide meaninful names to attributes, we need to construct an appropriate data domain:

>>> domain = Orange.data.Domain([Orange.data.ContinuousVariable("lenght"),
                                 Orange.data.ContinuousVariable("width")])
>>> data = Orange.data.Table(domain, X)
>>> data.domain
[lenght, width]

Here is another example, this time with the construction of a dataset that includes a numerical class and different types of attributes:

size = Orange.data.DiscreteVariable("size", ["small", "big"])
height = Orange.data.ContinuousVariable("height")
shape = Orange.data.DiscreteVariable("shape", ["circle", "square", "oval"])
speed = Orange.data.ContinuousVariable("speed")

domain = Orange.data.Domain([size, height, shape], speed)

X = np.array([[1, 3.4, 0], [0, 2.7, 2], [1, 1.4, 1]])
Y = np.array([42.0, 52.2, 13.4])

data = Orange.data.Table(domain, X, Y)
print(data)

Running of this scripts yields:

[[big, 3.400, circle | 42.000],
 [small, 2.700, oval | 52.200],
 [big, 1.400, square | 13.400]

Meta Attributes

Often, we wish to include descriptive fields in the data that will not be used in any computation (distance estimation, modeling), but will serve for identification or additional information. These are called meta attributes, and are marked with meta in the third header row:

name	hair	eggs	milk	backbone	legs	type
string	d	d	d	d	d	d
meta						class
aardvark	1	0	1	1	4	mammal
antelope	1	0	1	1	4	mammal
bass	0	1	0	1	0	fish
bear	1	0	1	1	4	mammal

Values of meta attributes and all other (non-meta) attributes are treated similarly in Orange, but stored in separate numpy arrays:

>>> data = Orange.data.Table("zoo")
>>> data[0]["name"]
>>> data[0]["type"]
>>> for d in data:
    ...:     print("{}/{}: {}".format(d["name"], d["type"], d["legs"]))
    ...:
aardvark/mammal: 4
antelope/mammal: 4
bass/fish: 0
bear/mammal: 4
>>> data.X
array([[ 1.,  0.,  1.,  1.,  2.],
       [ 1.,  0.,  1.,  1.,  2.],
       [ 0.,  1.,  0.,  1.,  0.],
       [ 1.,  0.,  1.,  1.,  2.]]))
>>> data.metas
array([['aardvark'],
       ['antelope'],
       ['bass'],
       ['bear']], dtype=object))

Meta attributes may be passed to Orange.data.Table after providing arrays for attribute and class values:

from Orange.data import Table, Domain
from Orange.data import ContinuousVariable, DiscreteVariable, StringVariable
import numpy as np

X = np.array([[2.2, 1625], [0.3, 163]])
Y = np.array([0, 1])
M = np.array([["houston", 10], ["ljubljana", -1]])

domain = Domain(
    [ContinuousVariable("population"), ContinuousVariable("area")],
    [DiscreteVariable("snow", ("no", "yes"))],
    [StringVariable("city"), StringVariable("temperature")],
)
data = Table(domain, X, Y, M)
print(data)

The script outputs:

[[2.200, 1625.000 | no] {houston, 10},
 [0.300, 163.000 | yes] {ljubljana, -1}

To construct a classless domain we could pass None for the class values.

Missing Values

Consider the following exploration of the dataset on votes of the US senate:

>>> import numpy as np
>>> data = Orange.data.Table("voting.tab")
>>> data[2]
[?, y, y, ?, y, ... | democrat]
>>> np.isnan(data[2][0])
True
>>> np.isnan(data[2][1])
False

The particular data instance included missing data (represented with ‘?’) for the first and the fourth attribute. In the original dataset file, the missing values are, by default, represented with a blank space. We can now examine each attribute and report on proportion of data instances for which this feature was undefined:

data = Orange.data.Table("voting.tab")
for x in data.domain.attributes:
    n_miss = sum(1 for d in data if np.isnan(d[x]))
    print("%4.1f%% %s" % (100.0 * n_miss / len(data), x.name))

First three lines of the output of this script are:

 2.8% handicapped-infants
11.0% water-project-cost-sharing
 2.5% adoption-of-the-budget-resolution

A single-liner that reports on number of data instances with at least one missing value is:

>>> sum(any(np.isnan(d[x]) for x in data.domain.attributes) for d in data)
203

Data Selection and Sampling

Besides the name of the data file, Orange.data.Table can accept the data domain and a list of data items and returns a new dataset. This is useful for any data subsetting:

data = Orange.data.Table("iris.tab")
print("Dataset instances:", len(data))
subset = Orange.data.Table(data.domain, [d for d in data if d["petal length"] > 3.0])
print("Subset size:", len(subset))

The code outputs:

Dataset instances: 150
Subset size: 99

and inherits the data description (domain) from the original dataset. Changing the domain requires setting up a new domain descriptor. This feature is useful for any kind of feature selection:

data = Orange.data.Table("iris.tab")
new_domain = Orange.data.Domain(
    list(data.domain.attributes[:2]),
    data.domain.class_var
)
new_data = Orange.data.Table(new_domain, data)

print(data[0])
print(new_data[0])

We could also construct a random sample of the dataset:

>>> sample = Orange.data.Table(data.domain, random.sample(data, 3))
>>> sample
[[6.000, 2.200, 4.000, 1.000 | Iris-versicolor],
 [4.800, 3.100, 1.600, 0.200 | Iris-setosa],
 [6.300, 3.400, 5.600, 2.400 | Iris-virginica]
]

or randomly sample the attributes:

>>> atts = random.sample(data.domain.attributes, 2)
>>> domain = Orange.data.Domain(atts, data.domain.class_var)
>>> new_data = Orange.data.Table(domain, data)
>>> new_data[0]
[5.100, 1.400 | Iris-setosa]

Referensi

https://docs.biolab.si//3/data-mining-library/tutorial/data.html

Pranala Menarik

Orange