SAS Parser

Text Parser to Normalized CSV in Python.

Problem • Approach • Solution • Requirements • How To Run • Technical • Notes • To-Do

Problem

Problem frame: No comprehensive inventory of SAS scripts; both inventory of files as well as inventory of contents. For example, cannot answer questions like:

How many SAS scripts are there across multiple directories, when were they created and last accessed?
How can I find out which SAS scripts reference a specific data table?
How can I find what SAS scripts reference / import other SAS scripts, and which ones?
How can I find SAS scripts that reference a particular SQL table column?
How can I find SAS scripts that have hardcoded dates?
and so on...

Approach

Attempting to predict all code content questions is an intractable problem. An approach to facilitate all manner of information needs would be to create a generalized approach that allows a developer to create a parser that can generate queryable results to answer specific questions.

One way this could be approached is by using specific commands already provided by the OS. For example, if you wanted to know the line count for each SAS script in a directory, you might use the WC command. Or if you wanted to know which scripts contained hardcoded dates, you could use the SED and regex. But this approach could get messy and the results would not be a uniform format for doing aggregate queries.

The approach used here is more generalized and consistent - one that can work on a range of OS's, requiring only knowledge of Python and SQL.

main routine that executes any number of predefined [Python] parser functions on a directory (and sub-directories). This main routine sequentially executes the list of parsers on each specified file type in the directory and then outputs the results to csv files in a directory, also specified on the command line.
parser functions that mine information from one or more [text] scripts - returning it in a format (CSV) that can be ingested by most databases and then queried for various aspects of that parsed information.

Solution

Architecture

The solution consists of 2 python scripts:

sas_parser2.py
parse_functions.py

where sas_parser2.py contains the main routine for:

ingesting the cli parameters,
inserting the list of parse functions that will be executed to the dim_func dataframe
reading the directory and creating the list of files to be inspected
inserting the file list into the summary dataframe
executing the parse functions one at a time on each of the files
inserting the results of each parse function to the detail dataframe
writing out each of the dataframes to their respective CSV files

and parse_functions.py contains the individual parse functions

CSV Output File Structure and Relationship

summary

column name	type
f_name	integer
dir_path	text
create_dt	datetime w tz
modified_dt	datetime w tz

detail

column name	type
summ_idx	integer
func_idx	integer
func_value	text

dim_func

column name	type
func_idx	integer
func_name	text

Relationship

Requirements

Python 3.x Python environment with the following packages installed:

pandas
tqdm

How To Run

There are 2 highlevel steps to working with the parser:

run the parser code on the directory of files
import the parser output CSVs into a database for analysis

Note Analysis of the output is beyond the scope of this document

Parsing

To run the code with pre-selected parsers, first clone the repository, activate the python environment and then perform any of the following:

# command and example command line arguments for sas files
# -i is the input directory, -t is the file extension, -o is the output directory
> python sas_python2.py -i 'some_directory_of_scripts' -t 'sas' -o 'results'

On execution, the python script will output a status bar as it performs each of the parsers on the file directory specified.

Note If you only want specific parsers to execute, you will need to edit the sas_parser2.py file and modify the variable functions_to_apply

# List of functions to apply
functions_to_apply = [
    get_file_info,
    count_lines, 
    count_sql, 
    get_sql_code, 
    get_libname_lines, 
    count_exports, 
    count_null_ds, 
    find_date_lines,
    find_file_references]

Output

If there are no errors,three CSV files are output into the output directory in the following format:

dim_func_yyyymmddhhmmss.csv
summary_yyyymmddhhmmss.csv
detail_yyyymmddhhmmss.csv

where yyyymmddhhmmss is the datetime the command was run.

Querying

The easiest way to work with the sas_parser2.py output CSV files is to import them into a database and query for the particular details you are interested in. Refer to the relationship of the CSV file in the Solution section

Assuming you have already imported the CSV files into a database, with table names corresponding to the csv file names, here is an example query:

Q: List all the files in the directory (and sub-directories) sorted by lines-of-code count (descending)

select summary.f_name as file_name, summary.dir_path, cast(detail.func_value as int) as line_ct
from summary 
join detail on summary.summ_idx = detail.summ_idx
join dim_func on detail.func_idx = dim_func.func_idx
where dim_func.func_name = 'count_lines'
order by cast(detail.func_value as int)

Q: List files referencing an external file (in the input directory) named: master_script.sas

select 
  summary.f_name as file_name, 
  summary.dir_path, 
  summary.create_dt, 
  summary.modified_dt, 
  dim_func.func_name, 
  detail.func_value 
from summary 
join detail on summary.summ_idx = detail.summ_idx
join dim_func on detail.func_idx = dim_func.func_idx
where 
  dim_func.func_name in ('find_file_references')
  and
  upper(detail.func_value) like '%MASTER_SCRIPT.SAS%';

Technical

The parsing function is conducted using two Python scripts:

sas_parser2.py - directory reader, csv writer and main routine
parse_functions.py - module of parsing functions called from the main function in sas_parser.py

sas_parser.py

This script uses 3 pandas data frames: dim_func_df, summary_df and detail_df to store the data that will, at the end, be written out to the corresponding csv files.

Highlevel, the script performs the following operations:

parse the cli arguments
insert the parsing functions into the dim_func_df data frame
read the directory file into a list
iterate over the file list 4a. insert file info into the summary_df 4b. for each parsing function
- evaluate the file with the parsing function
- insert the parsing function results into the detail_df data frame
write out the CSV files

parse_functions.py

This script contains all the parse functions called from sas_parser.py The general pattern used by the parse functions are:

call the parse function with the file path
perform the 'parse' code
return the function name [string] and corresponding value [string, int, tuple, list, etc]

def count_lines(file_path):
    """Count the number of lines in a file

    Args:
        file_path (string): full path to the file

    Returns:
        integer: the number of lines in the file
    """
    with open(file_path, 'r', encoding='cp1252') as file:  # Open the file
        lines = file.readlines()  # Read all lines into a list
    return ("line_count", [len(lines)])  # Return the number of lines

Of course, you could call a parse function with any number of arguments. But be aware that it would need to be addressed in the sas_parser.py main function.

For example, the find_file_references function needs two arguments, and thus, there is a conditional statement that checks if the function has one or two arguments and calls accordingly.

Notes

Current performance, with 9 parser functions evaluated, is about 600 files per minute

To-Do

Things to consider adding or improving on:

parallelizing (multiprocess python package) the main parse routine
pass a copy of the script (text) to each parser instead of the filename - so as to avoid opening the file for each parser

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
__pycache__		__pycache__
archive		archive
images		images
test_data		test_data
LICENSE		LICENSE
README.pdf		README.pdf
parse_functions.py		parse_functions.py
readme.md		readme.md
sas_parser.py		sas_parser.py
sas_parser2.py		sas_parser2.py
sas_parser2_mp.py		sas_parser2_mp.py
sas_testgen.py		sas_testgen.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAS Parser

Text Parser to Normalized CSV in Python.

Problem

Approach

Solution

Architecture

CSV Output File Structure and Relationship

Requirements

How To Run

Parsing

Output

Querying

Technical

sas_parser.py

parse_functions.py

Notes

To-Do

About

Releases

Packages

Languages

License

gcatabr1/sas_parser

Folders and files

Latest commit

History

Repository files navigation

SAS Parser

Text Parser to Normalized CSV in Python.

Problem

Approach

Solution

Architecture

CSV Output File Structure and Relationship

Requirements

How To Run

Parsing

Output

Querying

Technical

sas_parser.py

parse_functions.py

Notes

To-Do

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages