Main manual

NAME

Swift – Relational Data Converter

SYNOPSIS

swift-cli.py [source]
[-h] [-ss source_separator] [-ta target_attributes] [-i]
[-mv missing_value] [-snh] [-tnh] [-t [target]]
[-ts target_separator] [-to target_objects] [-n name]
[-cls classes] [-sf {csv,arff,dat,data,cxt,dtl}]
[-tf {csv,arff,dat,data,cxt,dtl}] [-c [rows_count]] [-p [rows_count]]
[-sl skipped_lines] [-se] [-scs source_cls_separator]
[-tcs target_cls_separator]
swift.py

DESCRIPTION

Swift – Relational Data Converter is a program for converting data files in six different formats. All accepted formats are text and can be converted with each other, it means 36 possible conversions. Swift is focused on working with a table data where rows represent objects (instances) and columns their attributes (properties). Beside the data conversion, program supports also: All operations work with data of any size, only limitation is a space of hard drive (not RAM).

Swift provides Command-line interface (swift-cli.py) and also Graphical user interface (swift.py), which is described in a separate GUI Manual. For a quick usage without useless reading, go to examples section, which contains examples of common use cases.

Positional arguments

source
The name of the file (program reads and processes this file) in one of the supported formats. The source must end with a valid format extension or the optional --source_format argument must be used. If the source is omitted or if it equals to "-", the program reads the input from the stdin.

Optional arguments

-h, --help
Print the help message to the stdout and exit.
-t target, --target target
The name of the file (program writes to this file) in one of the supported format. The --target must end with a valid format extension or the optional --target_format argument must be used. If the --target is omitted or if it equals to "-", the program writes the output to the stdout.
-ta target_attributes, --target_attributes target_attributes
The list of formulas separated by ";",
formulas ::= formula | (formula ";" formulas)
where formula is a definition of attribute in the target file. The formula is of the form:
formula ::= (new-names "=")? old-names ((":" type ("[" scale "]")?) | "[]")?
The first part of the formula:
(new-names "=")? old-names
where new-names and old-names are the lists of names separated by ",":
names ::= name | (name "," names)
where name is a word or an interval:
name ::= \w+ | ((\d+)? "-" (\d+)?) | "*"
has the following meaning: old-names refer to attributes in the source file by using their names (if are available) or indexes. New-names define new names of attributes used in the target file. If new-names are omitted, attributes in the target file have same names as in the source file. New-names and old-names must have the same length, otherwise error is produced.

The interval determines a range of indexes. If the lower bound is omitted, indexes range is from zero to upper bound. If the upper bound is omitted, indexes range is from lower bound to the maximum index of attribute in the source file. If both upper and lower bounds are omitted, indexes range is from zero to the maximum index of the attribute in the source file (for this case has been added alias "*", which has the same meaning as "-").

Examples (seven attributes in source file):
4-6 produces: 4, 5, 6
-5 produces: 0, 1, 2, 3, 4, 5
4- produces: 4, 5, 6, 7
- produces: 0, 1, 2, 3, 4, 5, 6, 7
* produces: 0, 1, 2, 3, 4, 5, 6, 7

The second optional part of the formula

((":" type ("[" scale "]")?) | "[]")?
is composed of the attribute type,
type ::= "n" | "e" | "s" | ("d" ("/" date_format)?)
date_format ::= "F="? "'" .+ "'"
where characters are aliases of data types:
  • n = numeric - all real numbers
  • e = enumeration (nominal) - finite set of named values
  • s = string - word
  • d = date - the default date format is ISO-8601 which combines date and time: YYYY-MM-DDThh:mm:ss. For using different date formats must be used python format codes,

optional definition of the scale and the binary unpacking.

scale ::= num_scale | enum_scale | str_scale | date_scale | bin_vals
The scale is an expression which is evaluated with a value of the attribute. Result of evaluation is True or False represented as 0 and 1.

In the numeric scale,

num_scale ::= (var op num_val) | (num_val op var) | (num_val op var op num_val)
the numeric value is of the form:
num_val ::= int ("." int?)? ("e" int)?
int ::= ("+" | "-")? \d+
Examples of numeric value: 5, -10, +47, 2.46, -2.31, +98.31, -78.4e-48, 3e8
var ::= [a-z_]+
Variable (the form above) represents a value of the scale evaluation. Operations have the same meaning as in many programming languages such as C or Python.
op ::= "<" | ">" | "<=" | ">=" | "==" | "!="
Examples of numeric scales (variable is x): x!=10.3, -5<=x<=10, 50==x

The enumeration scale

enum_scale ::= "'" \w+ "'"
is a one of enumeration values, which must be quoted. If the value of the attribute is exactly the same as the scale, the result of the scaling is True, otherwise False. Embeded quotes must be escaped with the backslash or doubled. For example to scale value 'foo' (with quotes) the expression: '\'foo\'' or '''foo''' must be used.

The string scale

str_scale ::= "'" .+ "'"
is a quoted python regular expression. If the regular expression matches any substring of the scaled value, result of scaling is True, otherwise False. Embeded quotes must be escaped with the backslash or doubled (same as an enumeration scale described above).

Example of the string scale: "foo[+-*]*bar", which matches "hellfoobar", "foo--barxyz" ... but doesn't match "foo/bar" ...

The date scale

date_scale ::= ((var op date_val) | (date_val op var) | (date_val op var op date_val))
is exactly the same expression as numeric scale, but values must be dates in valid date format.

Binary values

bin_vals ::= ("0="? "'" .* "'" ",")? "1="? "'" .+ "'"
allow to define new binary values for some bivalent attribute. The default values are 1 (True) and 0 (False). This can be useful for the conversion from multivalent format (which contains bivalent attribute, but values are different from 0 and 1) to bivalent format, without useless scaling (because values are bivalent already). Note that, this way is semantically the same as using enumeration for bivalent attribute, but use of binary values is prefered because is more time-efficient.

The alternative to the scale is a binary unpacking (total binarization) of an attribute, which can be used by "[]", written just behind the names. This creates new attribute for every single value of the old attribute. Every new attribute is of a type enumeration with the scale: value. The order of new attributes is the same as the order of values in template attribute. If binary unpacking is used for more attributes, for example for two attributes, then every new attribute of the first old attribute has lower index then any of new attribute of the second old attribute.

Example - binary unpacking of two attributes:

in
a x
b y
b z
a x
c y
out
1 0 0  1 0 0
0 1 0  0 1 0
0 1 0  0 0 1
1 0 0  1 0 0
0 0 1  0 1 0

Complete grammar of the --target_attributes argument:

formulas ::= formula | (formula ";" formulas)
formula ::= (names "=")? names ((":" type ("[" scale "]")?) | ("[" bin_vals "]") |"[]")?
names ::= name | (name "," names)
type ::= "n" | "e" | "s" | ("d" ("/" date_format)?)
scale ::= num_scale | enum_scale | str_scale | date_scale
name ::= \w+ | ((\d+)? "-" (\d+)?) | "*"
date_format ::= "F="? "'" .+ "'"
num_scale ::= (var op num_val) | (num_val op var) |
              (num_val op var op num_val)
enum_scale ::= "'" \w+ "'"
str_scale ::= "'" .+ "'"
date_scale ::= ((var op date_val) | (date_val op var) |
                (date_val op var op date_val))
bin_vals ::= ("0="? "'" .* "'" ",")? "1="? "'" .+ "'"
var ::= [a-zA-Z_]+
op ::= "<" | ">" | "<=" | ">=" | "==" | "!="
date_val ::= "'" .+ "'"
num_val ::= int ("." int?)? ("e" int)?
int ::= ("+" | "-")? \d+
Notes:

If the --target_attributes argument is omitted, all attributes from the source are processed.

If the --target_attributes argument beginning with "-" (e.g '-4,8,9'), the --target_attributes must be specified as: -ta='-4,8,9'.

-c [rows_count], --convert [rows_count]
Default action. Converts the source to the --target. The optional argument rows_count defines how many rows of data should be processed, with the default to be all rows of data.
-p [rows_count], --preview [rows_count]
Alternative action. Prints desired count of rows from the source table data in to the stdout. The default amount of printed rows is 20, but this value can be changed by using the rows_count optional argument.
-i, --info
Prints a statistics of the source and for its each processed attribute. Statistics is of the form:
Relation name:
Objects count:
Attributes count:
====================

name:
index:
type: string/enumeration
values appearance:
    value: occurrences-count/total-count = %
	.
	.
	.

name:
index:
type: numeric/date
max: , min:
values appearance:
    value: occurrences-count/total-count = %
	.
	.
	.

.
.
.
The --info may be used as single action (the statistics is printed to the stdout) or in parallel with a conversion (if --target is stdout, the new file, named same as the source, but with extension .info is produced). The usage --info in parallel with conversion is useful for the saving time if both actions are required.
-sf {csv,arff,dat,data,cxt,dtl}, --source_format {csv,arff,dat,data,cxt,dtl}
Specifies the source format using the extension. If the source doesn't have the extension, the --source_format is required. If the source has the extension and --source_format is also used, the --source_format overwrites the source extension.
-tf {csv,arff,dat,data,cxt,dtl}, --target_format {csv,arff,dat,data,cxt,dtl}
Specifies the --target format using the extension. If the --target doesn't have the extension, the --target_format is required. If the --target has the extension and --target_format is also used, the --target_format overwrites the --target extension.
-cls classes, --classes classes
Selects attributes from the source, to be used as classes in the --target. It can be specified using the interval of attribute indexes, or using names of attributes, or both in combination. --classes is of the form:
classes ::= element | (element "," classes)
element = interval | key
interval ::= ((\d+)? "-" (\d+)?)
key = \w+
This argument is relevant only in the case, when the --target format is C4.5 or DTL.
-ss source_separator, --source_separator source_separator
Specifies the separator of attributes in the source (it affects only the current action). --source_separator must be specified, when the separator used in the source is a different from a default file format separator.
-ts target_separator, --target_separator target_separator
Specifies the separator of attributes in the --target (it affects only the current action). Attributes in the --target will be separated with this new value.
-scs source_classes_separator, --source_cls_separator source_classes_separator
Specifies the separator of attributes and classes in the source (it affects only the current action). --source_cls_separator must be specified, when the classes/attributes separator used in the source is a different from a default file format classes/attributes separator. This argument is relevant only in the case, when the source format is DTL.
-tcs target_classes_separator, --target_cls_separator target_classes_separator
Specifies the separator of attributes and classes in the --target (it affects only the current action). Attributes and classes in the --target will be separated with this new value. This argument is relevant only in the case, when the --target format is DTL.
-sl skip_lines, --skip_lines skip_lines
Intervals of the form

intervals ::= interval | (interval "," intervals)
interval ::= ((\d+)? "-" (\d+)?) | \d+
determine source line indices, which will be skipped in any operation.
-se, --skip_errors
Errors produced by invalid lines in the source are skipped. Program continues and skipped errors are printed to the stderr.
-n name, --name name
Specifies a new name of relation (data).
-mv missing_value, --missing_value missing_value
Specifies the value, which will be interpreted as an undefined value (None/NULL). The result of scaling the --missing_value is always False (0).
-o objects, --objects objects
The list of object names separated by comma:
objects ::= name | (name "," objects)
This argument is relevant only in the case, when --target format is Burmeister. If --objects is omitted, indexes of objects are used.
-snh source_no_header, --source_no_header source_no_header
The first row in the CSV source is interpreted as an object (not header).
-tnh target_no_header, --target_no_header target_no_header
The CSV --target will have an object on the first line (not header).

Errors

The errors below are produced by the program. Each error specification is of the form:
error_code: error_name
	error_description
When an error is raised, program ends and returns an error code.
1: Swift Unknown Error
If you get this error, please report a bug with an error message printed below. Thank you.
2: Argument Error
Some of required arguments are missing or aren't specified correctly.
3: ARFF Header Error
The syntax error in the header of the ARFF source file.
4: DATA Header Error
The syntax error in the header of the DATA source file.
5: CSV Header Error
The syntax error in the header of the CSV source file.
6: DAT header Error
The syntax error in the header of the DAT source file.
7: CXT Header Error
The syntax error in the header of the CXT source file.
8: ARFF Line Error
The syntax error in the line of the ARFF source file.
9: DATA Line Error
The syntax error in the line of the DATA source file.
10: CSV Line Error
The syntax error in the line of the CSV source file.
11: DAT Line Error
The syntax error in the line of the DAT source file.
12: CXT Line Error
The syntax error in the line of the CXT source file.
13: DTL Line Error
The syntax error in the line of the DTL source file.
14: Formula Error
The syntax error in some formula of the --target_attributes argument.
15: Formula Names Error
The count of new names and the count of old names aren't equal.
16: Sequence Error
The syntax error in a sequence (interval).
17: DATE Value Error
The invalid value of a date attribute.
18: NUMERIC Value Error
The invalid value of a numeric attribute.
19: STRING Value Error
The invalid value of a string attribute.
20: NOMINAL Value Error
The invalid value of a nominal (enumeration) attribute.
21: DATE Error
The invalid syntax of a date.
22: Formula Date Value/Format Error
The value and the format of the date doesn't match.
23: Formula Regular Expression Error
The invalid regular expression in the formula.
24: Formula Attribute Key Error
The key of the attribute doesn't exist.
25: Keyboard Interrupt Error
26: Bivalent Error
The invalid bivalent value in data.
27: Broken Pipe Error
28: Names File Error
Files: .names and .data (C4.5 format) aren't in the same directory.
29: DTL Header Error
The syntax error in the header of the DTL source file.
30: Not Enough Lines Error
The source file doesn't have enough lines. Some required part of the file is missing or the file is empty.
31: Class Key Error
The key of the class doesn't exist.

SUPPORTED FORMATS

Swift supports following six formats: CSV, ARFF, DATA, CXT, DAT and DTL.

Comma-separated values (.csv)

The description of the format with examples can be found here. The default file format (class/attributes) separator is a comma.

Notes: The attributes separator inside the value must be always escaped by the backslash.

Attribute-Relation File Format (.arff)

The description of the format with examples can be found here. The default file format (class/attributes) separator is a comma.

Notes: The optional date format for the date attribute in the header, must be specified with python format codes (not in the Java SimpleDateFormat as specified in the official documentation).

Example:
@relation birthdays
@attribute birthday date %Y-%m-%d
@data
1990-04-23
1993-12-03
1989-03-31

When converting from an arff file with some relational attributes to some other format, relational attributes are linearized (using dot notation, see example).

Example:
@attribute humidity relational
    @attribute absolute relational
        @attribute day numeric
        @attribute night numeric
    @end absolute
@end humidity
The relational attribute humidity (above) will be converted to attributes (e.g for csv target):
humidity.absolute.day, humidity.absolute.night

C4.5 File Format (.data .names)

The description of the format with examples can be found here. The default file format (class/attributes) separator is a comma.

Notes: With a class, must be worked exactly the same as with an attribute at any conversion (class can be scaled, total binarized, ...). The key used in the --target_attributes is the class index or the name: "class".

Burmeister (.cxt)

The description of the format with examples can be found here.

FIMI File Format (.dat)

The description of the format with examples can be found here. The default file format (class/attributes) separator is a white space.

Notes: The blank lines are interpreted (from FIMI source) as objects, with all attributes of the value 0 (False). And conversely objects with all attributes of the value 0 (False), are written (to FIMI target) as blank lines. It differs from the official format documentation, which ignore blank lines and objects with all attributes of value 0.

DTL File Format (.dtl)

This format is very similar to FIMI, but it supports specification of classes for each object. The file consists of rows, which are of the form:

attributes "|" classes
where attributes part is exactly the same as attributes in the FIMI format (values are indexes of attributes with value 1) and "|" separates attributes and classes as the default class/attributes separator. The classes part consists of various values, separated with the same separator as attributes are.

Example (class1={a,b}, class2={aa,bb}):
0 1 2 3 4|a bb
1 2 3 4|a aa
2 3 4|b bb
3 4|a bb
4|b bb

With classes must be worked exactly the same as with the attributes at any conversion (class can be scaled, total binarized, ...). The key used in the --target_attributes is the class index or the name: "class1", "class2"... .

EXAMPLES

All examples is this section use following sample data.

Sample data

CSV - example.csv
name,   birth_date, credits, study, sex
George, 1991-06-13, 54,      true,  man
Monica, 1990-04-23, 98,      false, woman
Mia,    ?,          87,      true,  woman
John,   1989-11-11, 91,      true,  man
DTL - example.dtl
0 1 2 3 4|a bb
1 2 3 4|a aa
2 3 4|b bb
3 4|a bb
4|b bb
DAT - example.dat
0
1
2
3
4
DATA
example.names
foo, bar.
age: continuous.
job: teacher, pilot, doctor.
work: discrete 2.
sport: ignore.
example.data
44, doctor,  1, foo
30, teacher, 0, bar
35, ?,       1, foo
31, pilot,   0, foo

Convert

This section is divided into nine subsections according arguments, which are required in the particular conversion. Each of the subsections contains the list of the required arguments, one illustrative example and the list of next conversions, which can be used similarly.

For the quick navigation, you can use the following table of all possible conversions.

Simple Conversion

Required arguments:
DTL (example.dtl) to CSV
swift-cli.py example.dtl -t result.csv
result.csv

0,1,2,3,4,class1,class2
1,1,1,1,1,a,bb
0,1,1,1,1,a,aa
0,0,1,1,1,b,bb
0,0,0,1,1,a,bb
0,0,0,0,1,b,bb
Further possible conversions:
  • CSV to CSV
  • ARFF to ARFF
  • ARFF to CSV
  • DATA to CSV
  • DATA to ARFF
  • DAT to DAT
  • DAT to CSV
  • DAT to ARFF
  • CXT to CXT
  • CXT to CSV
  • CXT to DAT
  • CXT to ARFF
  • DTL to ARFF

Types specification

Required arguments:
CSV (example.csv) to ARFF
swift-cli.py example.csv -t result.arff -mv '?' -ta "name:s; 1:d/'%Y-%m-%d'; credits:n; work,gender=3,4:e" -n people
result.arff
@relation people

@attribute name string
@attribute birth_date date %Y-%m-%d
@attribute credits numeric
@attribute work { true,false }
@attribute gender { man,woman }

@data
George,1991-06-13,54,true,man
Monica,1990-04-23,98,false,woman
Mia,?,87,true,woman
John,1989-11-11,91,true,man

Classes specification

Required arguments:
DATA (example.data) to DATA
swift-cli.py example.data -t result.data -cls class 
result.names
foo,bar.
age: continuous.
job: teacher,pilot,doctor.
work: 1,0.
class_prev: foo,bar.
result.data
44,doctor,1,foo,foo
30,teacher,0,bar,bar
35,?,1,foo,foo
31,pilot,0,foo,foo
Further possible conversions:
  • ARFF to DATA
  • DATA to DATA
  • DAT to DTL
  • DAT to DATA
  • CXT to DATA
  • CXT to DTL
  • DTL to DATA

Types, classes specification

Required arguments:
CSV (example.csv) to DATA
swift-cli.py example.csv -t result.data -mv '?' -ta "name:s; 1:d/'%Y-%m-%d'; credits:n; 3,4:e" -cls sex
result.names
man,woman.
name: discrete n.
birth_date: discrete n.
credits: continuous.
study: true,false.
sex: man,woman.
result.data
George,1991-06-13,54,true,man,man
Monica,1990-04-23,98,false,woman,woman
Mia,?,87,true,woman,woman
John,1989-11-11,91,true,man,man

Object names and attribute names specification

Required arguments:
DAT (example.dat) to CXT
swift-cli.py example.dat -t result.cxt -o foo,bar,foobar,barfoo -ta 'a=0;b=1;c=2;d=3'
result.cxt
B

4
4
foo
bar
foobar
barfoo
a
b
c
d
X...
.X..
..X.
...X

Scale

Required arguments:
CSV (example.csv) to DAT
swift-cli.py example.csv -mv '?' -t result.dat -ta "name:s['M.+a']; birth_date:d/'%Y-%m-%d'[x>='1991-01-01'];
             credits:n[50<=x<=90]; study[0='false', 1='true']; sex:e['man']"
result.dat
1 2 3 4
0
0 2 3
3 4
Further possible conversions:
  • ARFF to DAT
  • DATA to DAT
  • DTL to DAT

Scale and object names specification

Required arguments:
CSV (example.csv) to CXT
swift-cli.py example.csv -mv '?' -t result.cxt -o a,b,c,d -ta "name:s['M.+a'];
	     birth_date:d/'%Y-%m-%d'[x>='1991-01-01']; credits:n[50<=x<=90]; study[0='false', 1='true']; sex:e['man']"
result.cxt
B

4
5
a
b
c
d
name
birth_date
credits
study
sex
.XXXX
X....
X.XX.
...XX
Further possible conversions:
  • ARFF to CXT
  • DATA to CXT

Scale and classes specification

Required arguments:
CSV (example.csv) to DTL
swift-cli.py example.csv -mv '?' -t result.dtl -cls 3,4 -ta "name:s['M.+a'];
	     birth_date:d/'%Y-%m-%d'[x>='1991-01-01']; credits:n[50<=x<=90]; study[0='false', 1='true']; sex:e['man']"
result.dtl
1 2 3 4|true man
0|false woman
0 2 3|true woman
3 4|true man
Further possible conversions:
  • ARFF to DTL
  • DATA to DTL
  • DTL to DTL

Scale, object names and attribute names specification

Required arguments:
DTL (example.dtl) to CXT
swift-cli.py example.dtl -t result.cxt -ta "b1,b2,b3,b4,b5=0-4;class1:e['b'];class2:e['bb']" -o a,b,c,d,e
B

5
7
a
b
c
d
e
b1
b2
b3
b4
b5
class1
class2
XXXXX.X
.XXXX..
..XXXXX
...XX.X
....XXX

Binary Unpacking

CSV (example.csv) to CSV
swift-cli.py example.csv -t result.csv -ta "*[]" -tnh
result.csv
1,0,0,0,1,0,0,0,1,0,0,0,1,0,1,0
0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,1
0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,1
0,0,0,1,0,0,0,1,0,0,0,1,1,0,1,0

Filter

CSV (example.csv) to CSV
swift-cli.py example.csv -t result.csv -ta "study;name" -sl "2-"
study,name
true,George
false,Monica

Info

DATA (example.data)
swift-cli.py example.data -i
Relation name:
Objects count: 4
Attributes count: 4
====================

name: age
index: 0
type: numeric
max: 44.0, min: 30.0
values appearance:
    44.0: 1/4 = 25.00%
    30.0: 1/4 = 25.00%
    35.0: 1/4 = 25.00%
    31.0: 1/4 = 25.00%

name: job
index: 1
type: nominal
values appearance:
    ?: 1/4 = 25.00% (none value)
    doctor: 1/4 = 25.00%
    teacher: 1/4 = 25.00%
    pilot: 1/4 = 25.00%

name: work
index: 2
type: nominal
values appearance:
    1: 2/4 = 50.00%
    0: 2/4 = 50.00%

name: class
index: 3
type: nominal
values appearance:
    bar: 1/4 = 25.00%
    foo: 3/4 = 75.00%

Preview

DAT (example.dat)
swift-cli.py example.dat -p
0 1 2 3
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1

INSTALLATION

Requirements

Make sure that following dependencies are installed in your computer:

Download

This program can be obtained from the repository using git:
git clone git@github.com:gnovis/swift.git
or by the direct link.

Unpack

If you download the source through the direct link, you need to unpack the ZIP archive. For example by using the unzip program (on Linux):
cd ~/Downloads
unzip swift-master
mv swift-master swift

Finish

Go to the swift directory cd path/to/swift/ and start using swift with swift-cli.py or swift.py scripts.

REPORTING BUGS

If you find a bug, please create an issue with a description via the issue tracker.

CONTRIBUTING

You are welcome to participate in development of this project. Join us on github.

AUTHORS

Created by Jan Nováček <novacekj5@gmail.com> and Jan Outrata <jan.outrata@upol.cz>.

Proofreaders

Veronika Nováčková <veronika.novackovaa@gmail.com>

WEBSITE

The project website: http://gnovis.github.io/swift/.

LICENSE

Swift is distributed under the GNU GPL v3.