#regex #sed #tool #command-line #command-line-tool #source-file #file-path

app spidior

A tool for handling sed-like substitution tasks where pesky source code semantics are getting in the way

2 releases

0.2.2 Feb 22, 2022
0.2.1 Feb 22, 2022

#2248 in Development tools

GPL-3.0 license

77KB
1.5K SLoC

spidior

spidior is a command line utility for performing sed-like substitutions on source code files that aims to augment regular expressions with semantic information parsed from the code it is operating on.

Status

Build Status

Building

Install a recent stable rust, clone this repo, and run cargo build.

Running

The following is the --help output for spidior, which shows you how to run it.

spidior 0.1.1
John Westhoff <johnjwesthoff@gmail.com>

USAGE:
    spidior [FLAGS] [OPTIONS]

FLAGS:
    -d, --dump        Whether we should just dump info without replacing
    -h, --help        Prints help information
    -i, --in-place    Whether we should edit files in place or print to stdout
    -I, --interactive Whether we are are interactively replacing things or not
    -n, --nfa         Whether we should print info about the regex nfa
    -r, --recursive   Whether we should search recursively
    -V, --version     Prints version information

OPTIONS:
    -p, --path <path>    The path to the files we are reading [default: .]
    -q, --query <query>  The query string for find/replace for each file we find in the input, required if `dump` is not set

If the --dump argument is used, rather than make any replacements, spidior will simply print out the findings of its lightwight parses from running on the files in the specified path. Otherwise, a query must be specified with either -q or --query.

Queries

Queries are very similar to sed s commands, and take the form ${LOCATION}${COMMAND}/${FIND}/${REPLACE}/${END} Where ${LOCATION} is where replacements should be allowed to take place (more on that below), ${COMMAND} is always s, $(FIND) is a regular expression, ${REPLACE} is a replacement, and ${END} is either nothing or the letter 'g' to allow multiple replacements on a given line.

Locations

A location can be one of several things:

  • % - anywhere in any file the path specifier includes
  • <path_suffix> - anywhere in any file whose path ends in path_suffix
  • {function} - anywhere in any file within a function named function
  • cA-B - anywhere in any file between the Ath (inclusive) and Bth (exclusive) character in the file
  • lA-B - anywhere in any file between the Ath (inclusive) and Bth (exclusive) line in the file

Locations can also be grouped using parens, unioned with |, intersected with &, and negated with ^. Why ^ instead of !? Well I figured since sets in most regex interpreters use ^ for negation it made sense here.

Regex

Regexes follow standard sedlike syntax, and support the following operations:

  • Basic regex operations (concatenation, conjunction, and star [and also plus])
  • Grouping with parens
  • Sets and negative sets, but only ranges and explicit characters (e.g. [a-z] or [^xyz] but not \w or [[:upper:]])
  • And most importantly, special queries about identifiers within input programs
    • Currently these queries are put between double square brackets, with a comma separate list of criteria
      • The supported criteria are name=$NAME where $NAME is the name of the identifier you are grepping for, type=$TYPE where $TYPE is the type of the identifier you are grepping for, and pos=$POS:$LEN where $POS is the position into the string to match on for length $LEN.

Replacements

A replacement is a string literal that may include backreferences to groups using a backslash followed by a number.

Example

Consider this input file, identifiers.java:

public class LightningOvercharge extends Lightning {
    int charge = 0;
    public LightningOvercharge() {
        charge = 0;
    }

    double number;
    @Override
    public void onSpawn(Session me) {
        number = 1;
        me.x = 0;
    }
}

In the onSpawn method, the Session input parameter should not be named me, so let's fix that and change it to sess. We can run the following command: spidior -p identifiers.java '%s/[[type=Session]]/sess/g

The result of this command will be:

public class LightningOvercharge extends Lightning {
    int charge = 0;
    public LightningOvercharge() {
        charge = 0;
    }

    double number;
    @Override
    public void onSpawn(Session sess) {
        number = 1;
        sess.x = 0;
    }
}

Similarly we can run: spidior -p identifiers.java '%s/[[type=double,name=number]]/spawnFlag/g to change the previous result into:

public class LightningOvercharge extends Lightning {
    int charge = 0;
    public LightningOvercharge() {
        charge = 0;
    }

    double spawnFlag;
    @Override
    public void onSpawn(Session sess) {
        spawnFlag = 1;
        sess.x = 0;
    }
}

Note that this changed both the declaration and the usage of the variable number.

Lightweight Parsers

Powering spidior is a set of language-specific lightweight parsers. Currently, spidior requires the ability to parse function declarations, and identifier declaration and usage in order to support operating a language. Right now only a "C-like" parser is written, and it is very overly-enthusiastic - it identifies many things as identifiers that are, in fact, not identifiers. In practice this ends up being OK, because its mistakes end up including keywords as either the type of the name of the identifier, so no real-world replace operation would be foiled by this overzealousness.

As an example, here is the result of running spidior --dump -p identifiers.java:

[{"filename":"identifiers.java","functions":[{"name":"LightningOvercharge","start":507,"end":534},{"name":"onSpawn","start":605,"end":671}],"identifiers":[{"name":"com","type_name":"static","start":67,"end":70},{"name":"com","type_name":"static","start":232,"end":235},{"name":"com","type_name":"static","start":273,"end":276},{"name":"com","type_name":"static","start":316,"end":319},{"name":"com","type_name":"static","start":361,"end":364},{"name":"LightningOvercharge","type_name":"class","start":414,"end":433},{"name":"charge","type_name":"int","start":462,"end":468},{"name":"charge","type_name":"int","start":517,"end":523},{"name":"number","type_name":"double","start":547,"end":553},{"name":"me","type_name":"Session","start":601,"end":603},{"name":"number","type_name":"double","start":615,"end":621},{"name":"me","type_name":"Session","start":635,"end":637},{"name":"me","type_name":"Session","start":635,"end":637}]}]

It correctly identifies the two functions in the source file, but it finds far many variables than actually are real - it found quite a few uses of the "variable" com of the "type" static. Again in reality you would never try to replace on identifiers of type static since that isn't a type, so this isn't an immediate issue.

Dependencies

~5–15MB
~176K SLoC