Organization rules¶
To organize an unformatted dataset according to a specification, user
instructions are required. These can be mentioned as rules using
supported keywords. This document will list all supported rules that
will support the organization task. Multiple dataset rules can be
mentioned in a single config by separating them with ---
.
A minimal rule config to copy all files but one looks like this:
source: "/path/to/data"
destination: "/path/to/destination"
pattern: ".*"
tag_rules:
- name: filename
pattern: "(.*)\\."
- name: extension
pattern: "(\\.[^/\\\\]+)$"
skip:
- ignore_this_file
The above rules can be used to organize a dataset with organize()
:
# Organize according to a specification
specification.organize(rules_loaded_as_dict)
Top-level keys¶
source
¶
The root directory where data is present.
destination
¶
The directory in which to store the organized data.
pattern
¶
All source directory contents matching this will be considered for organization.
overwrite
¶
Boolean value indicating whether to overwrite data if exists.
add
¶
Add common files directly to each matched source directory
content. The add
key consists of a sequence of path
and
position
entries.
path
Path to the common file.
position
Position at which to add the common file.
Valid values are
content
andfellow
. They place the file to be added either inside or at the same level as the matched source directory content respectively.
copy_fellows
¶
Boolean value indicating whether to copy other files present at the same directory level.
skip
¶
Sequence of paths to ignore.
tag_rules
¶
All details regarding inferring tags from filenames in the source
directory sit inside the tag_rules
key. The tags inferred are used
for building the path according to the specification. The
tag_rules
key consists of a sequence of tag name
and
pattern
entries along with valid rules for each tag.
name
The name of the tag.
pattern
Regex pattern to retreive the tag value from file path.
It is required to have a mandatory single capturing group in the pattern. The captured value is used as the tag value if it passes the rules.
The valid rules that can be used are:
default
The default value to use if no value is captured via
pattern
.case
Accepts either
upper
orlower
and accordingly changes the case of the value captured.length
Mentions the valid length of captured string.
If
iffy_prepend
is provided, the value is validated again post prefix addition.iffy_prepend
If the length of the captured string does not equal the
length
rule, then theiffy_prepend
value is added as a prefix to the captured value.pad
Left pad the captured string using
character
till it is oflength
.Takes a mapping of scalars to scalars with
character
andlength
as keys.value
The value to use for the tag. Overrides captured values if necessary.
replace
If provided, the tag value is replaced by looking up a one-to-one mapping from a csv file.
Takes a sequence of mappings with keys
col
,with
, andfrom
to form the rule.col
Column name in the mapping that represents the tag. This will be looked up.
with
Column name in the mapping that will be used to replace the value.
from
Path to the mapping csv file.
Important
If a mandatory tag cannot be inferred from the file path, then path building fails and the file is not added to the organized dataset. More info on the tag that could not be inferred can be found in logs.