Datasets

Datasets
- Unlabeled Datasets
- Labeled Datasets

A machine learning dataset are a collection of instances (or samples), each one described by a number of variables. In the case of tabular data, a dataset looks like a database table, where every column is a variable, and each row corresponds to a given instance. However, a dataset can also be non-tabular; for example, each instance can consist of a multivariate time-series, or an image.

When data is composed of different modalities) combining their statistical properties is non-trivial, since they may be quite different in nature one another.

The abstract representation of a multimodal dataset provided by this package is the AbstractMultiDataset.

MultiData.AbstractMultiDataset — Type

Abstract supertype for all multimodal datasets.

A concrete multimodal dataset should always provide accessors data, to access the underlying tabular structure (e.g., DataFrame) and grouped_variables, to access the grouping of variables (a vector of vectors of column indices).

MultiData.grouped_variables — Function

grouped_variables(amd)::Vector{Vector{Int}}

Return the indices of the variables grouped by modality, of an AbstractMultiDataset. The grouping describes how the different modalities are composed from the underlying AbstractDataFrame structure.

See also data, AbstractMultiDataset.

MultiData.data — Function

data(amd)::AbstractDataFrame

Return the structure that underlies an AbstractMultiDataset.

See also grouped_variables, AbstractMultiDataset.

SoleBase.dimensionality — Function

dimensionality(df::AbstractDataFrame)

Return the dimensionality of a dataframe df.

If the dataframe has variables of various dimensionalities :mixed is returned.

If the dataframe is empty (no instances) :empty is returned. This behavior can be controlled by setting the keyword argument force:

:no (default): return :mixed in case of mixed dimensionality
:max: return the greatest dimensionality
:min: return the lowest dimensionality

Unlabeled Datasets

In unlabeled datasets there is no labeling variable, and all of the variables (also called feature variables, or features) have equal role in the representation. These datasets are used in unsupervised learning contexts, for discovering internal correlation patterns between the features. Multimodal unlabeled datasets can be instantiated with MultiDataset.

MultiData.MultiDataset — Type

MultiDataset(df, grouped_variables)

Create a MultiDataset from an AbstractDataFrame df, initializing its modalities according to the grouping in grouped_variables.

grouped_variables is an AbstractVector of variable grouping which are AbstractVectors of integers representing the index of the variables selected for that modality.

Note that the order matters for both the modalities and the variables.

julia> df = DataFrame(
                  :age => [30, 9],
                  :name => ["Python", "Julia"],
                  :stat1 => [[sin(i) for i in 1:50000], [cos(i) for i in 1:50000]],
                  :stat2 => [[cos(i) for i in 1:50000], [sin(i) for i in 1:50000]]
              )
2×4 DataFrame
 Row │ age    name    stat1                              stat2                             ⋯
     │ Int64  String  Array…                             Array…                            ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────
   1 │    30  Python  [0.841471, 0.909297, 0.14112, -0…  [0.540302, -0.416147, -0.989992,… ⋯
   2 │     9  Julia   [0.540302, -0.416147, -0.989992,…  [0.841471, 0.909297, 0.14112, -0…

julia> md = MultiDataset([[2]], df)
● MultiDataset
   └─ dimensionalities: (0,)
- Modality 1 / 1
   └─ dimensionality: 0
2×1 SubDataFrame
 Row │ name
     │ String
─────┼────────
   1 │ Python
   2 │ Julia
- Spare variables
   └─ dimensionality: mixed
2×3 SubDataFrame
 Row │ age    stat1                              stat2
     │ Int64  Array…                             Array…
─────┼─────────────────────────────────────────────────────────────────────────────
   1 │    30  [0.841471, 0.909297, 0.14112, -0…  [0.540302, -0.416147, -0.989992,…
   2 │     9  [0.540302, -0.416147, -0.989992,…  [0.841471, 0.909297, 0.14112, -0…

MultiDataset(df; group = :none)

Create a MultiDataset from an AbstractDataFrame df, automatically selecting modalities.

The selection of modalities can be controlled by the group argument which can be:

:none (default): no modality will be created
:all: all variables will be grouped by their dimensionality
a list of dimensionalities which will be grouped.

Note: :all and :none are the only Symbols accepted by group.

TODO: fix passing a vector of Integer to group

TODO: rewrite examples

Examples

julia> df = DataFrame(
                  :age => [30, 9],
                  :name => ["Python", "Julia"],
                  :stat1 => [[sin(i) for i in 1:50000], [cos(i) for i in 1:50000]],
                  :stat2 => [[cos(i) for i in 1:50000], [sin(i) for i in 1:50000]]
              )
2×4 DataFrame
 Row │ age    name    stat1                              stat2                             ⋯
     │ Int64  String  Array…                             Array…                            ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────
   1 │    30  Python  [0.841471, 0.909297, 0.14112, -0…  [0.540302, -0.416147, -0.989992,… ⋯
   2 │     9  Julia   [0.540302, -0.416147, -0.989992,…  [0.841471, 0.909297, 0.14112, -0…

julia> md = MultiDataset(df)
● MultiDataset
   └─ dimensionalities: ()
- Spare variables
   └─ dimensionality: mixed
2×4 SubDataFrame
 Row │ age    name    stat1                              stat2                             ⋯
     │ Int64  String  Array…                             Array…                            ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────
   1 │    30  Python  [0.841471, 0.909297, 0.14112, -0…  [0.540302, -0.416147, -0.989992,… ⋯
   2 │     9  Julia   [0.540302, -0.416147, -0.989992,…  [0.841471, 0.909297, 0.14112, -0…


julia> md = MultiDataset(df; group = :all)
● MultiDataset
   └─ dimensionalities: (0, 1)
- Modality 1 / 2
   └─ dimensionality: 0
2×2 SubDataFrame
 Row │ age    name
     │ Int64  String
─────┼───────────────
   1 │    30  Python
   2 │     9  Julia
- Modality 2 / 2
   └─ dimensionality: 1
2×2 SubDataFrame
 Row │ stat1                              stat2
     │ Array…                             Array…
─────┼──────────────────────────────────────────────────────────────────────
   1 │ [0.841471, 0.909297, 0.14112, -0…  [0.540302, -0.416147, -0.989992,…
   2 │ [0.540302, -0.416147, -0.989992,…  [0.841471, 0.909297, 0.14112, -0…


julia> md = MultiDataset(df; group = [0])
● MultiDataset
   └─ dimensionalities: (0, 1, 1)
- Modality 1 / 3
   └─ dimensionality: 0
2×2 SubDataFrame
 Row │ age    name
     │ Int64  String
─────┼───────────────
   1 │    30  Python
   2 │     9  Julia
- Modality 2 / 3
   └─ dimensionality: 1
2×1 SubDataFrame
 Row │ stat1
     │ Array…
─────┼───────────────────────────────────
   1 │ [0.841471, 0.909297, 0.14112, -0…
   2 │ [0.540302, -0.416147, -0.989992,…
- Modality 3 / 3
   └─ dimensionality: 1
2×1 SubDataFrame
 Row │ stat2
     │ Array…
─────┼───────────────────────────────────
   1 │ [0.540302, -0.416147, -0.989992,…
   2 │ [0.841471, 0.909297, 0.14112, -0…

MultiData._empty — Method

_empty(md)

Return a copy of a multimodal dataset with no instances.

Note: since the returned AbstractMultiDataset will be empty its columns types will be Any.

Labeled Datasets

In labeled datasets, one or more variables are considered to have special semantics with respect to the other variables; each of these labeling variables (or target variables) can be thought as assigning a label to each instance, which is typically a categorical value (classification label) or a numerical value (regression label). Supervised learning methods can be applied on these datasets for modeling the target variables as a function of the feature variables.

As an extension of the AbstractMultiDataset, AbstractLabeledMultiDataset has an interface that can be implemented to represent multimodal labeled datasets.

MultiData.AbstractLabeledMultiDataset — Type

Abstract supertype for all labeled multimodal datasets (used in supervised learning).

As any multimodal dataset, any concrete labeled multimodal dataset should always provide the accessors data, to access the underlying tabular structure (e.g., DataFrame) and grouped_variables, to access the grouping of variables. In addition to these, implementations are required for labeling_variables, to access the indices of the labeling variables.

See also AbstractMultiDataset.

MultiData.labeling_variables — Function

labeling_variables(almd)::Vector{Int}

Return the indices of the labelling variables, of the AbstractLabeledMultiDataset. with respect to the underlying AbstractDataFrame structure (see data).

See also grouped_variables, AbstractLabeledMultiDataset.

Missing docstring.

Missing docstring for dataset. Check Documenter's build log for details.

Multimodal labeled datasets can be instantiated with LabeledMultiDataset.

MultiData.LabeledMultiDataset — Type

LabeledMultiDataset(md, labeling_variables)

Create a LabeledMultiDataset by associating an AbstractMultiDataset with some labeling variables, specified as a column index (Int) or a vector of column indices (Vector{Int}).

Arguments

md is the original AbstractMultiDataset;
labeling_variables is an AbstractVector of integers indicating the indices of the variables that will be set as labels.

Examples

julia> lmd = LabeledMultiDataset(MultiDataset([[2],[4]], DataFrame(
           :id => [1, 2],
           :age => [30, 9],
           :name => ["Python", "Julia"],
           :stat => [[sin(i) for i in 1:50000], [cos(i) for i in 1:50000]]
       )), [1, 3])
● LabeledMultiDataset
   ├─ labels
   │   ├─ id: Set([2, 1])
   │   └─ name: Set(["Julia", "Python"])
   └─ dimensionalities: (0, 1)
- Modality 1 / 2
   └─ dimensionality: 0
2×1 SubDataFrame
 Row │ age
     │ Int64
─────┼───────
   1 │    30
   2 │     9
- Modality 2 / 2
   └─ dimensionality: 1
2×1 SubDataFrame
 Row │ stat
     │ Array…
─────┼───────────────────────────────────
   1 │ [0.841471, 0.909297, 0.14112, -0…
   2 │ [0.540302, -0.416147, -0.989992,…

MultiData.joinlabels! — Method

joinlabels!(lmd, [lbls...]; delim = "_")

On a labeled multimodal dataset, collapse the labeling variables identified by lbls into a single labeling variable of type String, by means of a join that uses delim for string delimiter.

If not specified differently this function will join all labels.

lbls can be an Integer indicating the index of the label, or a Symbol indicating the name of the labeling variable.

!!! note

The resulting labels will always be of type String.

Note

The resulting labeling variable will always be added as last column in the underlying DataFrame.

Examples

julia> lmd = LabeledMultiDataset(
           MultiDataset(
               [[2],[4]],
               DataFrame(
                   :id => [1, 2],
                   :age => [30, 9],
                   :name => ["Python", "Julia"],
                   :stat => [[sin(i) for i in 1:50000], [cos(i) for i in 1:50000]]
               )
           ),
           [1, 3],
       )
● LabeledMultiDataset
   ├─ labels
   │   ├─ id: Set([2, 1])
   │   └─ name: Set(["Julia", "Python"])
   └─ dimensionalities: (0, 1)
- Modality 1 / 2
   └─ dimensionality: 0
2×1 SubDataFrame
 Row │ age
     │ Int64
─────┼───────
   1 │    30
   2 │     9
- Modality 2 / 2
   └─ dimensionality: 1
2×1 SubDataFrame
 Row │ stat
     │ Array…
─────┼───────────────────────────────────
   1 │ [0.841471, 0.909297, 0.14112, -0…
   2 │ [0.540302, -0.416147, -0.989992,…


julia> joinlabels!(lmd)
● LabeledMultiDataset
   ├─ labels
   │   └─ id_name: Set(["1_Python", "2_Julia"])
   └─ dimensionalities: (0, 1)
- Modality 1 / 2
   └─ dimensionality: 0
2×1 SubDataFrame
 Row │ age
     │ Int64
─────┼───────
   1 │    30
   2 │     9
- Modality 2 / 2
   └─ dimensionality: 1
2×1 SubDataFrame
 Row │ stat
     │ Array…
─────┼───────────────────────────────────
   1 │ [0.841471, 0.909297, 0.14112, -0…
   2 │ [0.540302, -0.416147, -0.989992,…

MultiData.label — Method

label(lmd, j, i)

Return the value of the i-th labeling variable for instance at index i_instance in a labeled multimodal dataset.

MultiData.labeldomain — Method

labeldomain(lmd, i)

Return the domain of i-th label of a labeled multimodal dataset.

MultiData.labels — Method

labels(lmd, i_instance)
labels(lmd)

Return the labels of instance at index i_instance in a labeled multimodal dataset. A dictionary of type labelname => value is returned.

If only the first argument is passed then the labels for all instances are returned.

MultiData.nlabelingvariables — Method

nlabelingvariables(lmd)

Return the number of labeling variables of a labeled multimodal dataset.

MultiData.setaslabeling! — Method

setaslabeling!(lmd, i)
setaslabeling!(lmd, var_name)

Set i-th variable as label.

The variable name can be passed as second argument instead of its index.

MultiData.unsetaslabeling! — Method

unsetaslabeling!(lmd, i)
unsetaslabeling!(lmd, var_name)

Remove i-th labeling variable from labels list.

The variable name can be passed as second argument instead of its index.