Organization rules¶
To organize an unformatted dataset according to a specification, user
instructions are required. These can be mentioned as rules using
supported keywords. This document will list all supported rules that
will support the organization task. Multiple dataset rules can be
mentioned in a single config by separating them with ---.
A minimal rule config to copy all files but one looks like this:
source: "/path/to/data"
destination: "/path/to/destination"
pattern: ".*"
tag_rules:
- name: filename
pattern: "(.*)\\."
- name: extension
pattern: "(\\.[^/\\\\]+)$"
skip:
- ignore_this_file
The above rules can be used to organize a dataset with organize():
# Organize according to a specification
specification.organize(rules_loaded_as_dict)
Top-level keys¶
source¶
The root directory where data is present.
destination¶
The directory in which to store the organized data.
pattern¶
All source directory contents matching this will be considered for organization.
overwrite¶
Boolean value indicating whether to overwrite data if exists.
add¶
Add common files directly to each matched source directory
content. The add key consists of a sequence of path and
position entries.
pathPath to the common file.
positionPosition at which to add the common file.
Valid values are
contentandfellow. They place the file to be added either inside or at the same level as the matched source directory content respectively.
copy_fellows¶
Boolean value indicating whether to copy other files present at the same directory level.
skip¶
Sequence of paths to ignore.
tag_rules¶
All details regarding inferring tags from filenames in the source
directory sit inside the tag_rules key. The tags inferred are used
for building the path according to the specification. The
tag_rules key consists of a sequence of tag name and
pattern entries along with valid rules for each tag.
nameThe name of the tag.
patternRegex pattern to retreive the tag value from file path.
It is required to have a mandatory single capturing group in the pattern. The captured value is used as the tag value if it passes the rules.
The valid rules that can be used are:
defaultThe default value to use if no value is captured via
pattern.caseAccepts either
upperorlowerand accordingly changes the case of the value captured.lengthMentions the valid length of captured string.
If
iffy_prependis provided, the value is validated again post prefix addition.iffy_prependIf the length of the captured string does not equal the
lengthrule, then theiffy_prependvalue is added as a prefix to the captured value.padLeft pad the captured string using
charactertill it is oflength.Takes a mapping of scalars to scalars with
characterandlengthas keys.valueThe value to use for the tag. Overrides captured values if necessary.
replaceIf provided, the tag value is replaced by looking up a one-to-one mapping from a csv file.
Takes a sequence of mappings with keys
col,with, andfromto form the rule.colColumn name in the mapping that represents the tag. This will be looked up.
withColumn name in the mapping that will be used to replace the value.
fromPath to the mapping csv file.
Important
If a mandatory tag cannot be inferred from the file path, then path building fails and the file is not added to the organized dataset. More info on the tag that could not be inferred can be found in logs.