Datasets
A machine learning dataset are a collection of instances (or samples), each one described by a number of variables. In the case of tabular data, a dataset looks like a database table, where every column is a variable, and each row corresponds to a given instance. However, a dataset can also be non-tabular; for example, each instance can consist of a multivariate time-series, or an image.
When data is composed of different modalities) combining their statistical properties is non-trivial, since they may be quite different in nature one another.
The abstract representation of a multimodal dataset provided by this package is the AbstractMultiDataset
.
MultiData.AbstractMultiDataset
— TypeAbstract supertype for all multimodal datasets.
A concrete multimodal dataset should always provide accessors data
, to access the underlying tabular structure (e.g., DataFrame
) and grouped_variables
, to access the grouping of variables (a vector of vectors of column indices).
MultiData.grouped_variables
— Functiongrouped_variables(amd)::Vector{Vector{Int}}
Return the indices of the variables grouped by modality, of an AbstractMultiDataset
. The grouping describes how the different modalities are composed from the underlying AbstractDataFrame
structure.
See also data
, AbstractMultiDataset
.
MultiData.data
— Functiondata(amd)::AbstractDataFrame
Return the structure that underlies an AbstractMultiDataset
.
See also grouped_variables
, AbstractMultiDataset
.
SoleBase.dimensionality
— Functiondimensionality(df::AbstractDataFrame)
Return the dimensionality of a dataframe df
.
If the dataframe has variables of various dimensionalities :mixed
is returned.
If the dataframe is empty (no instances) :empty
is returned. This behavior can be controlled by setting the keyword argument force
:
:no
(default): return:mixed
in case of mixed dimensionality:max
: return the greatest dimensionality:min
: return the lowest dimensionality
Unlabeled Datasets
In unlabeled datasets there is no labeling variable, and all of the variables (also called feature variables, or features) have equal role in the representation. These datasets are used in unsupervised learning contexts, for discovering internal correlation patterns between the features. Multimodal unlabeled datasets can be instantiated with MultiDataset
.
MultiData.MultiDataset
— TypeMultiDataset(df, grouped_variables)
Create a MultiDataset
from an AbstractDataFrame
df
, initializing its modalities according to the grouping in grouped_variables
.
grouped_variables
is an AbstractVector
of variable grouping which are AbstractVector
s of integers representing the index of the variables selected for that modality.
Note that the order matters for both the modalities and the variables.
julia> df = DataFrame(
:age => [30, 9],
:name => ["Python", "Julia"],
:stat1 => [[sin(i) for i in 1:50000], [cos(i) for i in 1:50000]],
:stat2 => [[cos(i) for i in 1:50000], [sin(i) for i in 1:50000]]
)
2×4 DataFrame
Row │ age name stat1 stat2 ⋯
│ Int64 String Array… Array… ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────
1 │ 30 Python [0.841471, 0.909297, 0.14112, -0… [0.540302, -0.416147, -0.989992,… ⋯
2 │ 9 Julia [0.540302, -0.416147, -0.989992,… [0.841471, 0.909297, 0.14112, -0…
julia> md = MultiDataset([[2]], df)
● MultiDataset
└─ dimensionalities: (0,)
- Modality 1 / 1
└─ dimensionality: 0
2×1 SubDataFrame
Row │ name
│ String
─────┼────────
1 │ Python
2 │ Julia
- Spare variables
└─ dimensionality: mixed
2×3 SubDataFrame
Row │ age stat1 stat2
│ Int64 Array… Array…
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ 30 [0.841471, 0.909297, 0.14112, -0… [0.540302, -0.416147, -0.989992,…
2 │ 9 [0.540302, -0.416147, -0.989992,… [0.841471, 0.909297, 0.14112, -0…
MultiDataset(df; group = :none)
Create a MultiDataset
from an AbstractDataFrame
df
, automatically selecting modalities.
The selection of modalities can be controlled by the group
argument which can be:
:none
(default): no modality will be created:all
: all variables will be grouped by theirdimensionality
- a list of dimensionalities which will be grouped.
Note: :all
and :none
are the only Symbol
s accepted by group
.
TODO: fix passing a vector of Integer to group
TODO: rewrite examples
Examples
julia> df = DataFrame(
:age => [30, 9],
:name => ["Python", "Julia"],
:stat1 => [[sin(i) for i in 1:50000], [cos(i) for i in 1:50000]],
:stat2 => [[cos(i) for i in 1:50000], [sin(i) for i in 1:50000]]
)
2×4 DataFrame
Row │ age name stat1 stat2 ⋯
│ Int64 String Array… Array… ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────
1 │ 30 Python [0.841471, 0.909297, 0.14112, -0… [0.540302, -0.416147, -0.989992,… ⋯
2 │ 9 Julia [0.540302, -0.416147, -0.989992,… [0.841471, 0.909297, 0.14112, -0…
julia> md = MultiDataset(df)
● MultiDataset
└─ dimensionalities: ()
- Spare variables
└─ dimensionality: mixed
2×4 SubDataFrame
Row │ age name stat1 stat2 ⋯
│ Int64 String Array… Array… ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────
1 │ 30 Python [0.841471, 0.909297, 0.14112, -0… [0.540302, -0.416147, -0.989992,… ⋯
2 │ 9 Julia [0.540302, -0.416147, -0.989992,… [0.841471, 0.909297, 0.14112, -0…
julia> md = MultiDataset(df; group = :all)
● MultiDataset
└─ dimensionalities: (0, 1)
- Modality 1 / 2
└─ dimensionality: 0
2×2 SubDataFrame
Row │ age name
│ Int64 String
─────┼───────────────
1 │ 30 Python
2 │ 9 Julia
- Modality 2 / 2
└─ dimensionality: 1
2×2 SubDataFrame
Row │ stat1 stat2
│ Array… Array…
─────┼──────────────────────────────────────────────────────────────────────
1 │ [0.841471, 0.909297, 0.14112, -0… [0.540302, -0.416147, -0.989992,…
2 │ [0.540302, -0.416147, -0.989992,… [0.841471, 0.909297, 0.14112, -0…
julia> md = MultiDataset(df; group = [0])
● MultiDataset
└─ dimensionalities: (0, 1, 1)
- Modality 1 / 3
└─ dimensionality: 0
2×2 SubDataFrame
Row │ age name
│ Int64 String
─────┼───────────────
1 │ 30 Python
2 │ 9 Julia
- Modality 2 / 3
└─ dimensionality: 1
2×1 SubDataFrame
Row │ stat1
│ Array…
─────┼───────────────────────────────────
1 │ [0.841471, 0.909297, 0.14112, -0…
2 │ [0.540302, -0.416147, -0.989992,…
- Modality 3 / 3
└─ dimensionality: 1
2×1 SubDataFrame
Row │ stat2
│ Array…
─────┼───────────────────────────────────
1 │ [0.540302, -0.416147, -0.989992,…
2 │ [0.841471, 0.909297, 0.14112, -0…
MultiData._empty
— Method_empty(md)
Return a copy of a multimodal dataset with no instances.
Note: since the returned AbstractMultiDataset will be empty its columns types will be Any
.
Labeled Datasets
In labeled datasets, one or more variables are considered to have special semantics with respect to the other variables; each of these labeling variables (or target variables) can be thought as assigning a label to each instance, which is typically a categorical value (classification label) or a numerical value (regression label). Supervised learning methods can be applied on these datasets for modeling the target variables as a function of the feature variables.
As an extension of the AbstractMultiDataset
, AbstractLabeledMultiDataset
has an interface that can be implemented to represent multimodal labeled datasets.
MultiData.AbstractLabeledMultiDataset
— TypeAbstract supertype for all labeled multimodal datasets (used in supervised learning).
As any multimodal dataset, any concrete labeled multimodal dataset should always provide the accessors data
, to access the underlying tabular structure (e.g., DataFrame
) and grouped_variables
, to access the grouping of variables. In addition to these, implementations are required for labeling_variables
, to access the indices of the labeling variables.
See also AbstractMultiDataset
.
MultiData.labeling_variables
— Functionlabeling_variables(almd)::Vector{Int}
Return the indices of the labelling variables, of the AbstractLabeledMultiDataset
. with respect to the underlying AbstractDataFrame
structure (see data
).
See also grouped_variables
, AbstractLabeledMultiDataset
.
Missing docstring for dataset
. Check Documenter's build log for details.
Multimodal labeled datasets can be instantiated with LabeledMultiDataset
.
MultiData.LabeledMultiDataset
— TypeLabeledMultiDataset(md, labeling_variables)
Create a LabeledMultiDataset
by associating an AbstractMultiDataset
with some labeling variables, specified as a column index (Int
) or a vector of column indices (Vector{Int}
).
Arguments
md
is the originalAbstractMultiDataset
;labeling_variables
is anAbstractVector
of integers indicating the indices of the variables that will be set as labels.
Examples
julia> lmd = LabeledMultiDataset(MultiDataset([[2],[4]], DataFrame(
:id => [1, 2],
:age => [30, 9],
:name => ["Python", "Julia"],
:stat => [[sin(i) for i in 1:50000], [cos(i) for i in 1:50000]]
)), [1, 3])
● LabeledMultiDataset
├─ labels
│ ├─ id: Set([2, 1])
│ └─ name: Set(["Julia", "Python"])
└─ dimensionalities: (0, 1)
- Modality 1 / 2
└─ dimensionality: 0
2×1 SubDataFrame
Row │ age
│ Int64
─────┼───────
1 │ 30
2 │ 9
- Modality 2 / 2
└─ dimensionality: 1
2×1 SubDataFrame
Row │ stat
│ Array…
─────┼───────────────────────────────────
1 │ [0.841471, 0.909297, 0.14112, -0…
2 │ [0.540302, -0.416147, -0.989992,…
MultiData.joinlabels!
— Methodjoinlabels!(lmd, [lbls...]; delim = "_")
On a labeled multimodal dataset, collapse the labeling variables identified by lbls
into a single labeling variable of type String
, by means of a join
that uses delim
for string delimiter.
If not specified differently this function will join all labels.
lbls
can be an Integer
indicating the index of the label, or a Symbol
indicating the name of the labeling variable.
!!! note
The resulting labels will always be of type String
.
The resulting labeling variable will always be added as last column in the underlying DataFrame
.
Examples
julia> lmd = LabeledMultiDataset(
MultiDataset(
[[2],[4]],
DataFrame(
:id => [1, 2],
:age => [30, 9],
:name => ["Python", "Julia"],
:stat => [[sin(i) for i in 1:50000], [cos(i) for i in 1:50000]]
)
),
[1, 3],
)
● LabeledMultiDataset
├─ labels
│ ├─ id: Set([2, 1])
│ └─ name: Set(["Julia", "Python"])
└─ dimensionalities: (0, 1)
- Modality 1 / 2
└─ dimensionality: 0
2×1 SubDataFrame
Row │ age
│ Int64
─────┼───────
1 │ 30
2 │ 9
- Modality 2 / 2
└─ dimensionality: 1
2×1 SubDataFrame
Row │ stat
│ Array…
─────┼───────────────────────────────────
1 │ [0.841471, 0.909297, 0.14112, -0…
2 │ [0.540302, -0.416147, -0.989992,…
julia> joinlabels!(lmd)
● LabeledMultiDataset
├─ labels
│ └─ id_name: Set(["1_Python", "2_Julia"])
└─ dimensionalities: (0, 1)
- Modality 1 / 2
└─ dimensionality: 0
2×1 SubDataFrame
Row │ age
│ Int64
─────┼───────
1 │ 30
2 │ 9
- Modality 2 / 2
└─ dimensionality: 1
2×1 SubDataFrame
Row │ stat
│ Array…
─────┼───────────────────────────────────
1 │ [0.841471, 0.909297, 0.14112, -0…
2 │ [0.540302, -0.416147, -0.989992,…
MultiData.label
— Methodlabel(lmd, j, i)
Return the value of the i
-th labeling variable for instance at index i_instance
in a labeled multimodal dataset.
MultiData.labeldomain
— Methodlabeldomain(lmd, i)
Return the domain of i
-th label of a labeled multimodal dataset.
MultiData.labels
— Methodlabels(lmd, i_instance)
labels(lmd)
Return the labels of instance at index i_instance
in a labeled multimodal dataset. A dictionary of type labelname => value
is returned.
If only the first argument is passed then the labels for all instances are returned.
MultiData.nlabelingvariables
— Methodnlabelingvariables(lmd)
Return the number of labeling variables of a labeled multimodal dataset.
MultiData.setaslabeling!
— Methodsetaslabeling!(lmd, i)
setaslabeling!(lmd, var_name)
Set i
-th variable as label.
The variable name can be passed as second argument instead of its index.
MultiData.unsetaslabeling!
— Methodunsetaslabeling!(lmd, i)
unsetaslabeling!(lmd, var_name)
Remove i
-th labeling variable from labels list.
The variable name can be passed as second argument instead of its index.