Recently I have been playing with some ideas about applying static analysis to Python and building a Python editor in Jetbrains MPS.

To do any of this I would need to first build a model of Python code. Recently we have seen how to parse Python code, however we still need to consider all the packages our code use. Some of those could be builtin or be implemented through C extensions. That means we do not have python code for them. In this post I look into retrieving a list of all modules and then inspect their contents.

My strategy is to use reflection writing scripts in Python. I will then invoke those scripts from inside Jetbrains MPS (and so from Java code). However this is the topic of a future post.

Listing modules

Listing top modules is relatively easy if you know how to do it. This script prints a list of all top level modules:

import pkgutil

for p in pkgutil.iter_modules():
    print(p[1])

Now we need to look inside modules to find sub-modules. For performance reasons I want to do that only when it is needed:

import pkgutil
import sys

def explore_package(module_name):    
    loader = pkgutil.get_loader(module_name)
    for sub_module in pkgutil.walk_packages([loader.filename]):
        _, sub_module_name, _ = sub_module
        qname = module_name + "." + sub_module_name
        print(qname)
        explore_package(qname)

explore_package(sys.argv[1])

For example for xml I get:

xml.dom
xml.dom.NodeFilter
xml.dom.domreg
xml.dom.expatbuilder
xml.dom.minicompat
xml.dom.minidom
xml.dom.pulldom
xml.dom.xmlbuilder
xml.etree
xml.etree.ElementInclude
xml.etree.ElementPath
xml.etree.ElementTree
xml.etree.cElementTree
xml.parsers
xml.parsers.expat
xml.sax
xml.sax._exceptions
xml.sax.expatreader
xml.sax.handler
xml.sax.saxutils
xml.sax.xmlreader

Examining module contents and recognizing functions

Now given a module I need to list all its contents. I can load the module by name and iterate over it, printing information about the elements found.
I want to distinguish between classes, submodules (which I will ignore for now), functions and simple values.
Builtin functions need to be treated differently: to access their information I need to parse their documentation. Not cool, not cool at all.

import sys
import inspect

def describe_builtin(obj):
    """ Describe a builtin function """
    # Built-in functions cannot be inspected by
    # inspect.getargspec. We have to try and parse
    # the __doc__ attribute of the function.
    docstr = obj.__doc__
    args = ''
    if docstr:
        items = docstr.split('n')
        if items:
            func_descr = items[0]
            s = func_descr.replace(obj.__name__,'')
            idx1 = s.find('(')
            idx2 = s.find(')',idx1)
            if idx1 != -1 and idx2 != -1 and (idx2>idx1+1):
                args = s[idx1+1:idx2]
    return args

package_name = sys.argv[1].strip()
mymodule = __import__(package_name, fromlist=['foo'])

for element_name in dir(mymodule):
    element = getattr(mymodule, element_name)
    if inspect.isclass(element):
        print("class %s" % element_name)
    elif inspect.ismodule(element):
        pass        
    elif hasattr(element, '__call__'):
        if inspect.isbuiltin(element):
            sys.stdout.write("builtin_function %s" % element_name)
            data = describe_builtin(element)
            data = data.replace("[", " [")
            data = data.replace("  [", " [")
            data = data.replace(" [, ", " [")
            sys.stdout.write(data.replace(", ", " "))
            print("")
        else:                    
            try:
                data = inspect.getargspec(element)
                sys.stdout.write("function %s" % element_name)
                for a in data.args:
                    sys.stdout.write(" ")
                    sys.stdout.write(a)
                if data.varargs:
                    sys.stdout.write(" *")
                    sys.stdout.write(data.varargs)
                print("")
            except:
                pass
    else:
        print("value %s" % element_name)

This is what I get for the module os:

value EX_CANTCREAT
value EX_CONFIG
value EX_DATAERR
value EX_IOERR
value EX_NOHOST
value EX_NOINPUT
value EX_NOPERM
value EX_NOUSER
value EX_OK
value EX_OSERR
value EX_OSFILE
value EX_PROTOCOL
value EX_SOFTWARE
value EX_TEMPFAIL
value EX_UNAVAILABLE
value EX_USAGE
value F_OK
value NGROUPS_MAX
value O_APPEND
value O_ASYNC
value O_CREAT
value O_DIRECT
value O_DIRECTORY
value O_DSYNC
value O_EXCL
value O_LARGEFILE
value O_NDELAY
value O_NOATIME
value O_NOCTTY
value O_NOFOLLOW
value O_NONBLOCK
value O_RDONLY
value O_RDWR
value O_RSYNC
value O_SYNC
value O_TRUNC
value O_WRONLY
value P_NOWAIT
value P_NOWAITO
value P_WAIT
value R_OK
value SEEK_CUR
value SEEK_END
value SEEK_SET
value ST_APPEND
value ST_MANDLOCK
value ST_NOATIME
value ST_NODEV
value ST_NODIRATIME
value ST_NOEXEC
value ST_NOSUID
value ST_RDONLY
value ST_RELATIME
value ST_SYNCHRONOUS
value ST_WRITE
value TMP_MAX
value WCONTINUED
builtin_function WCOREDUMPstatus
builtin_function WEXITSTATUSstatus
builtin_function WIFCONTINUEDstatus
builtin_function WIFEXITEDstatus
builtin_function WIFSIGNALEDstatus
builtin_function WIFSTOPPEDstatus
value WNOHANG
builtin_function WSTOPSIGstatus
builtin_function WTERMSIGstatus
value WUNTRACED
value W_OK
value X_OK
class _Environ
value __all__
value __builtins__
value __doc__
value __file__
value __name__
value __package__
function _execvpe file args env
function _exists name
builtin_function _exitstatus
function _get_exports_list module
function _make_stat_result tup dict
function _make_statvfs_result tup dict
function _pickle_stat_result sr
function _pickle_statvfs_result sr
function _spawnvef mode file args env func
builtin_function abort
builtin_function accesspath mode
value altsep
builtin_function chdirpath
builtin_function chmodpath mode
builtin_function chownpath uid gid
builtin_function chrootpath
builtin_function closefd
builtin_function closerangefd_low fd_high
builtin_function confstrname
value confstr_names
builtin_function ctermid
value curdir
value defpath
value devnull
builtin_function dupfd
builtin_function dup2old_fd new_fd
value environ
class error
function execl file *args
function execle file *args
function execlp file *args
function execlpe file *args
builtin_function execvpath args
builtin_function execvepath args env
function execvp file args
function execvpe file args env
value extsep
builtin_function fchdirfildes
builtin_function fchmodfd mode
builtin_function fchownfd uid gid
builtin_function fdatasyncfildes
builtin_function fdopenfd [mode='r' [bufsize]]
builtin_function fork
builtin_function forkpty
builtin_function fpathconffd name
builtin_function fstatfd
builtin_function fstatvfsfd
builtin_function fsyncfildes
builtin_function ftruncatefd length
builtin_function getcwd
builtin_function getcwdu
builtin_function getegid
function getenv key default
builtin_function geteuid
builtin_function getgid
builtin_function getgroups
builtin_function getloadavg
builtin_function getlogin
builtin_function getpgidpid
builtin_function getpgrp
builtin_function getpid
builtin_function getppid
builtin_function getresgid
builtin_function getresuid
builtin_function getsidpid
builtin_function getuid
builtin_function initgroupsusername gid
builtin_function isattyfd
builtin_function killpid sig
builtin_function killpgpgid sig
builtin_function lchownpath uid gid
value linesep
builtin_function linksrc dst
builtin_function listdirpath
builtin_function lseekfd pos how
builtin_function lstatpath
builtin_function majordevice
builtin_function makedevmajor minor
function makedirs name mode
builtin_function minordevice
builtin_function mkdirpath [mode=0777]
builtin_function mkfifofilename [mode=0666]
builtin_function mknodfilename [mode=0600 device]
value name
builtin_function niceinc
builtin_function openfilename flag [mode=0777]
builtin_function openpty
value pardir
builtin_function pathconfpath name
value pathconf_names
value pathsep
builtin_function pipe
builtin_function popencommand [mode='r' [bufsize]]
function popen2 cmd mode bufsize
function popen3 cmd mode bufsize
function popen4 cmd mode bufsize
builtin_function putenvkey value
builtin_function readfd buffersize
builtin_function readlinkpath
builtin_function removepath
function removedirs name
builtin_function renameold new
function renames old new
builtin_function rmdirpath
value sep
builtin_function setegidgid
builtin_function seteuiduid
builtin_function setgidgid
builtin_function setgroupslist
builtin_function setpgidpid pgrp
builtin_function setpgrp
builtin_function setregidrgid egid
builtin_function setresgidrgid egid sgid
builtin_function setresuidruid euid suid
builtin_function setreuidruid euid
builtin_function setsid
builtin_function setuiduid
function spawnl mode file *args
function spawnle mode file *args
function spawnlp mode file *args
function spawnlpe mode file *args
function spawnv mode file args
function spawnve mode file args env
function spawnvp mode file args
function spawnvpe mode file args env
builtin_function statpath
builtin_function stat_float_times [newval]
class stat_result
builtin_function statvfspath
class statvfs_result
builtin_function strerrorcode
builtin_function symlinksrc dst
builtin_function sysconfname
value sysconf_names
builtin_function systemcommand
builtin_function tcgetpgrpfd
builtin_function tcsetpgrpfd pgid
builtin_function tempnam [dir [prefix]]
builtin_function times
builtin_function tmpfile
builtin_function tmpnam
builtin_function ttynamefd
builtin_function umasknew_mask
builtin_function uname
builtin_function unlinkpath
builtin_function unsetenvkey
builtin_function urandomn
builtin_function utimepath (atime mtime
builtin_function wait
builtin_function wait3options
builtin_function wait4pid options
builtin_function waitpidpid options
function walk top topdown onerror followlinks
builtin_function writefd strin

Of course for functions I want to build a model of its interface (which parameters it takes, which ones are optional, which ones are variadic and so on). We have the information needed here, it is just a matter of transforming it in a representable form.

Conclusions

I still need to build a model of the imported classes but I starting to have a decent model of the elements I can import in my Python code. This would permit to verify easily which import statements are valid. Of course this can be used in combination with virtualenvs and requirements files: given a list of requirements I would install them in a virtualenv and build the model of the modules available in that virtualenv. I could then statically verify which import would work in that context.