NAME File::Name::Sanitize - Clean odd characters from file names SYNOPSIS use File::Basename; # for fileparse() use File::Copy; # for move() use File::Name::Sanitize; # Make an object & set to lower case any|all the alphabets. $sane = File::Name::Sanitize->make( case => 'lower'); # Set the non-alpha-number character (class) to keep. $sane->wanted_non_alphanum( '-._' ); # Set to keep only the digits 0-7. $sane->wanted_num( '0-7' ); # Set to keep only the lower case letters a-z. $sane->wanted_alpha( 'a-z' ); # Names to be sanitized @names = ( '--some-file' , 'ANOTHER\tfile' , 'An ObsCeNe \nAme' ); # Now, is the time do some cleaning. foreach my $old ( @names ) { ($dir , $file) = fileparse( $old , '' ); $sane->change_case( $file ); $sane->clean_unwanted( $file ); $sane->hyphenate( $file ); $sane->clean_wanted_non_alphanum( $file ); $new = $dir . $file; printf "old: %s -> new: %s\n" , $old , $new ; move( $old , $new) or die "Couldn't move '$old' to '$new': $!\n"; } DESCRIPTION File::Name::Sanitize generates cleaned up version of a given string. It does not do any actual file moving or renaming. It is an OO module. There is only a class constructor method (make()), no copier or cloner exist. Using default settings and sanitize() method, this module cleanses a (file) name in following (loose) order ... * All the characters not matching "[-.[:alpha:]\d]" are converted to *-*. * Multiple occurrences of [-.] are changed to one. * *-.* sequence is changed to single *-*. * Any sequence of *-.* is removed from the ends. If the name starts with *.*, dot is preserved. METHODS case() if *no input* is given, returns currently set case for the alphabets. Otherwise, records the case change specification for future. Valid values are *any*, *lower*, and *upper*, default is *lower*. It returns nothing in this situation. change_case() Input is a string to change the case of. It changes the case of the string in place; returns nothing. clean_wanted_non_alphanum() (Functionality subject to change) Input is a string to operate on. Trims sequences as represented by the string returned by "wanted_non_alphanum()" in place; returns nothing. clean_unwanted() Input is a string to clean. Optionally a substitute string for unwanted characters can be given; default is *-*. Removes|Substitutes unwanted characters of a given string in place; returns nothing. See "wanted_non_alphanum()", "wanted_num()", and "wanted_alpha()" methods. exec_proc() Input is a hash of code reference, directory and file names ... $sane->exec_proc ( 'dir' => '/some/directory' , 'file' => 'some file' , 'code' => sub { ... } # See below. ) ; Directory (in $dir with trailing '/'), file (in $file), and $_ are available to the code being executed. $_ is a (localised) copy of $file. Given code is executed under "no strict;". A directory, with trailing slash, and file name are expected to be returned after the code execution. That is SIMILAR to ... sub { my ($dir , $file) = @_; my $old = $file; { no strict; < GIVEN CODE IS EXECUTED HERE > } # Returns the changed file name return ($dir , $old ne $_ ? $_ : $file ); } hyphenate() Input is a string to hyphenate. Puts a hyphen|dash bewtween any consecutive sequences of alphabets & digits in the given string in place; returns nothing. make() -- Constructor returns File::Name::Sanitize object. Input can be optionally passed as a list/hash, but not a hash reference. These are ... *case* Desired case for a file name. A value of 'any' does not touches the case of the name. There is no provision for capitalization (e.g. Name-Like-This). *want-hyphenation* < 0 | 1 > Set to true to hyphenate sequences of alphabets and digits (base 10 numbers); default is *0*, false. If set, 'rot13tor' will be changed to 'rot-13-tor'. *wanted-non-alphanum* Specifies non alphabets & non digits characters to be kept in a file name; default is *-.*. sanitize() Inputs are a string to be cleaned, and optional replacement string to be substituted for unwanted characters; see *clean_unwanted()* for details on the replacement. It calls the following in some order: "change_case()", "hyphenate()", "clean_unwanted()", "clean_wanted_non_alphanum()"; returns the cleaned string. wanted_alpha() If *no input* is given, returns the alphabet characters to be kept in a file name. Otherwise, records the given string a alphabets to keep in a file name; returns nothing. wanted_alpha_bycase() If *no input* is given, returns the alphabet characters to be kept in a file name based on the value returned by "case()". Otherwise, records the given string a alphabets to keep in a file name; returns nothing. wanted_non_alphanum() If *no input* is given, returns non alphanum characters sequenc to be kept in s file name. Otherwise, records the non-alphanum characters to keep in a file name; returns nothing. wanted_num() If *no input* is given, returns digit characters to be kept in s file name. Otherwise, records the digits to keep in a file name; returns nothing. want_hyphenation() if *no input* is given, returns a truth value indicating if hyphenation is requested. Otherwise, records the given truth value; returns nothing. FUTURE PLANS I have converted the module to use POSIX character classes, i.e. changed *[a-z]* to *[[:alpha:]]*. I am planning to add the default locale of 'en_US.US_ASCII' to get the original behaviour for myself, and to allow to change the locale per user desire. I am not set to use only locale to change the character classes. If in the mean time i find something else which works equally well, i will use that. BUGS Currently, it is not possible to specify AND use fully formed character classes, say "[[:alpha:]]" or "[-.[:alpha:]]", only the string enclosed by *[* and *]*, as in "[:alpha:]" or "-.[:alpha:]". This is due the need to use the same string to use for both "tr//" and "s///" for wanted non-alphabets and non-numbers, and to deduce unwanted characters from the given wanted characters. For wanted characters specified as "-.[:alpha:]\d", the regular expression for (wanted characters becomes "[-.[:alpha:]\d]" and for) unwanted characters becomes "[^-.[:alpha:]\d]". That would not be possible if wanted characters had been specified as a valid regular expression character class. SEE ALSO mmv(1), rename VERSION and AUTHOR Parv, parv(at)pair(dot)com Version: 6.00 Modified: Jul 30 2006 DISTRIBUTION AND SUCH This software is free to be used in any form only if proper credit is given. I am not responsible for any kind of damage or loss. Use it at your own risk.