Python Packaging Do's and Don'ts

July 15, 2014 at 05:20 PM | categories: Python, Mozilla | View Comments

Are you someone who casually interacts with Python but don't know the inner workings of Python? Then this post is for you. Read on to learn why some things are the way they are and how to avoid making some common mistakes.

Always use Virtualenvs

It is an easy trap to view virtualenvs as an obstacle, a distraction towards accomplishing something. People see me adding virtualenvs to build instructions and they say I don't use virtualenvs, they aren't necessary, why are you doing that?

A virtualenv is effectively an overlay on top of your system Python install. Creating a virtualenv can be thought of as copying your system Python environment into a local location. When you modify virtualenvs, you are modifying an isolated container. Modifying virtualenvs has no impact on your system Python.

A goal of a virtualenv is to isolate your system/global Python install from unwanted changes. When you accidentally make a change to a virtualenv, you can just delete the virtualenv and start over from scratch. When you accidentally make a change to your system Python, it can be much, much harder to recover from that.

Another goal of virtualenvs is to allow different versions of packages to exist. Say you are working on two different projects and each requires a specific version of Django. With virtualenvs, you install one version in one virtualenv and a different version in another virtualenv. Things happily coexist because the virtualenvs are independent. Contrast with trying to manage both versions of Django in your system Python installation. Trust me, it's not fun.

Casual Python users may not encounter scenarios where virtualenvs make their lives better... until they do, at which point they realize their system Python install is beyond saving. People who eat, breath, and die Python run into these scenarios all the time. We've learned how bad life without virtualenvs can be and so we use them everywhere.

Use of virtualenvs is a best practice. Not using virtualenvs will result in something unexpected happening. It's only a matter of time.

Please use virtualenvs.

Never use sudo

Do you use sudo to install a Python package? You are doing it wrong.

If you need to use sudo to install a Python package, that almost certainly means you are installing a Python package to your system/global Python install. And this means you are modifying your system Python instead of isolating it and keeping it pristine.

Instead of using sudo to install packages, create a virtualenv and install things into the virtualenv. There should never be permissions issues with virtualenvs - the user that creates a virtualenv has full realm over it.

Never modify the system Python environment

On some systems, such as OS X with Homebrew, you don't need sudo to install Python packages because the user has write access to the Python directory (/usr/local in Homebrew).

For the reasons given above, don't muck around with the system Python environment. Instead, use a virtualenv.

Beware of the package manager

Your system's package manager (apt, yum, etc) is likely using root and/or installing Python packages into the system Python.

For the reasons given above, this is bad. Try to use a virtualenv, if possible. Try to not use the system package manager for installing Python packages.

Use pip for installing packages

Python packaging has historically been a mess. There are a handful of tools and APIs for installing Python packages. As a casual Python user, you only need to know of one of them: pip.

If someone says install a package, you should be thinking create a virtualenv, activate a virtualenv, pip install <package>. You should never run pip install outside of a virtualenv. (The exception is to install virtualenv and pip itself, which you almost certainly want in your system/global Python.)

Running pip install will install packages from PyPI, the Python Packaging Index by default. It's Python's official package repository.

There are a lot of old and outdated tutorials online about Python packaging. Beware of bad content. For example, if you see documentation that says use easy_install, you should be thinking, easy_install is a legacy package installer that has largely been replaced by pip, I should use pip instead. When in doubt, consult the Python packaging user guide and do what it recommends.

Don't trust the Python in your package manager

The more Python programming you do, the more you learn to not trust the Python package provided by your system / package manager.

Linux distributions such as Ubuntu that sit on the forward edge of versions are better than others. But I've run into enough problems with the OS or package manager maintained Python (especially on OS X), that I've learned to distrust them.

I use pyenv for installing and managing Python distributions from source. pyenv also installs virtualenv and pip for me, packages that I believe should be in all Python installs by default. As a more experienced Python programmer, I find pyenv just works.

If you are just a beginner with Python, it is probably safe to ignore this section. Just know that as soon as something weird happens, start suspecting your default Python install, especially if you are on OS X. If you suspect trouble, use something like pyenv to enforce a buffer so the system can have its Python and you can have yours.

Recovering from the past

Now that you know the preferred way to interact with Python, you are probably thinking oh crap, I've been wrong all these years - how do I fix it?

The goal is to get a Python install somewhere that is as pristine as possible. You have two approaches here: cleaning your existing Python or creating a new Python install.

To clean your existing Python, you'll want to purge it of pretty much all packages not installed by the core Python distribution. The exception is virtualenv, pip, and setuptools - you almost certainly want those installed globally. On Homebrew, you can uninstall everything related to Python and blow away your Python directory, typically /usr/local/lib/python*. Then, brew install python. On Linux distros, this is a bit harder, especially since most Linux distros rely on Python for OS features and thus they may have installed extra packages. You could try a similar approach on Linux, but I don't think it's worth it.

Cleaning your system Python and attempting to keep it pure are ongoing tasks that are very difficult to keep up with. All it takes is one dependency to get pulled in that trashes your system Python. Therefore, I shy away from this approach.

Instead, I install and run Python from my user directory. I use pyenv. I've also heard great things about Miniconda. With either solution, you get a Python in your home directory that starts clean and pure. Even better, it is completely independent from your system Python. So if your package manager does something funky, there is a buffer. And, if things go wrong with your userland Python install, you can always nuke it without fear of breaking something in system land. This seems to be the best of both worlds.

Please note that installing packages in the system Python shouldn't be evil. When you create virtualenvs, you can - and should - tell virtualenv to not use the system site-packages (i.e. don't use non-core packages from the system installation). This is the default behavior in virtualenv. It should provide an adequate buffer. But from my experience, things still manage to bleed through. My userland Python install is extra safety. If something wrong happens, I can only blame myself.

Conclusion

Python's long and complicated history of package management makes it very easy for you to shoot yourself in the foot. The long list of outdated tutorials on The Internet make this a near certainty for casual Python users. Using the guidelines in this post, you can adhere to best practices that will cut down on surprises and rage and keep your Python running smoothly.

Read and Post Comments

Aggregating Version Control Info at Mozilla

January 21, 2014 at 10:50 AM | categories: Git, Mercurial, Mozilla, Python | View Comments

Over the winter break, I set out on an ambitious project to create a service to help developers and others manage the flury of patches going into Firefox. While the project is far from complete, I'm ready to unleash the first part of the project upon the world.

If you point your browsers to moztree.gregoryszorc.com, you'll hopefully see some documentation about what I've built. Source code is available and free, of course. Patches very welcome.

Essentially, I built a centralized indexing service for version control repositories with Mozilla's extra metadata thrown in. I tell it what repositories to mirror, and it clones everything, fetches data such as the pushlog and Git SHA-1 mappings, and stores everything in a central database. It then exposes this aggregated data through world-readable web services.

Currently, I have the service indexing the popular project branches for Firefox (central, aurora, beta, release, esr, b2g, inbound, fx-team, try, etc). You can view the full list via the web service. As a bonus, I'm also serving these repositories via hg.gregoryszorc.com. My server appears to be significantly faster than hg.mozilla.org. If you want to use it for your daily needs, go for it. I make no SLA guarantees, however.

I'm also using this service as an opportunity to experiment with alternate forms of Mercurial hosting. I have mirrors of mozilla-central and the try repository with generaldelta and lz4 compression enabled. I may blog about what those are eventually. The teaser is that they can make Mercurial perform a lot faster under some conditions. I'm also using ZFS under the hood to manage repositories. Each repository is a ZFS filesystem. This means I can create repository copies on the server (user repositories anyone?) at a nearly free cost. Contrast this to the traditional method of full clones, which take lots of time, memory, CPU, and storage.

Anyway, some things you can do with the existing web service:

  • Obtain metadata about Mercurial changesets. Example.
  • Look up metadata about Git commits. Example.
  • Obtain a SPORE descriptor describing the web service endpoints. This allows you to auto-generate clients from descriptors. Yay!

Obviously, that's not a lot. But adding new endpoints is relatively straightforward. See the source. It's literally as easy as defining a URL mapping and writing a database query.

The performance is also not the best. I just haven't put in the effort to tune things yet. All of the querying hits the database, not Mercurial. So, making things faster should merely be a matter of database and hosting optimization. Patches welcome!

Some ideas that I haven't had time to implement yet:

  • Return changests in a specific repository
  • Return recently pushed changesets
  • Return pushes for a given user
  • Return commits for a given author
  • Return commits referencing a given bug
  • Obtain TBPL URLs for pushes with changeset
  • Integrate bugzilla metadata

Once those are in place, I foresee this service powering a number of dashboards. Patches welcome.

Again, this service is only the tip of what's possible. There's a lot that could be built on this service. I have ideas. Others have ideas.

The project includes a Vagrant file and Puppet manifests for provisioning the server. It's a one-liner to get a development environment up and running. It should be really easy to contribute to this project. Patches welcome.

Read and Post Comments

Why do Projects Support old Python Releases

January 08, 2014 at 05:00 PM | categories: Python, Mozilla | View Comments

I see a number of open source projects supporting old versions of Python. Mercurial supports 2.4, for example. I have to ask: why do projects continue to support old Python releases?

Consider:

  • Python 2.4 was last released on December 19, 2008 and there will be no more releases of Python 2.4.
  • Python 2.5 was last released on May 26, 2011 and there will be no more releases of Python 2.5.
  • Python 2.6 was last released on October 29, 2013 and there will be no more releases of Python 2.6.
  • Everything before Python 2.7 is end-of-lifed
  • Python 2.7 continues to see periodic releases, but mostly for bug fixes.
  • Practically all of the work on CPython is happening in the 3.3 and 3.4 branches. Other implementations continue to support 2.7.
  • Python 2.7 has been available since July 2010
  • Python 2.7 provides some very compelling language features over earlier releases that developers want to use
  • It's much easier to write dual compatible 2/3 Python when 2.7 is the only 2.x release considered.
  • Python 2.7 can be installed in userland relatively easily (see projects like pyenv).

Given these facts, I'm not sure why projects insist on supporting old and end-of-lifed Python releases.

I think maintainers of Python projects should seriously consider dropping support for Python 2.6 and below. Are there really that many people on systems that don't have Python 2.7 easily available? Why are we Python developers inflicting so much pain on ourselves to support antiquated Python releases?

As a data point, I successfully transitioned Firefox's build system from requiring Python 2.5+ to 2.7.3+ and it was relatively pain free. Sure, a few people complained. But as far as I know, not very many new developers are coming in and complaining about the requirement. If we can do it with a few thousand developers, I'm guessing your project can as well.

Update 2014-01-09 16:05:00 PST: This post is being discussed on Slashdot. A lot of the comments talk about Python 3. Python 3 is its own set of considerations. The intended focus of this post is strictly about dropping support for Python 2.6 and below. Python 3 is related in that porting Python 2.x to Python 3 is much easier the higher the Python 2.x version. This especially holds true when you want to write Python that works simultaneously in both 2.x and 3.x.

Read and Post Comments

Python Package Providing Clients for Mozilla Services

January 06, 2014 at 10:45 AM | categories: Python, Mozilla | View Comments

I have a number of Python projects and tools that interact with various Mozilla services. I had authored clients for all these services as standalone Python modules so they could be reused across projects.

I have consolidated all these Python modules into a unified source control repository and have made the project available on PyPI. You can install it by running:

$ pip install mozautomation

Currently included in the Python package are:

  • A client for treestatus.mozilla.org
  • Module for extracting cookies from Firefox profiles (useful for programmatically getting Bugzilla auth credentials).
  • A client for reading and interpretting the JSON dumps of automation jobs
  • An interface to a SQLite database to manage associations between Mercurial changesets, bugs, and pushes.
  • Rudimentary parsing of commit messages to extract bugs and reviewers.
  • A client to obtain information about Firefox releases via the releases API
  • A module defining common Firefox source repositories, aliases, logical groups (e.g. twigs and integration trees), and APIs for fetching pushlog data.
  • A client for the self serve API

Documentation and testing is currently sparse. Things aren't up to my regular high quality standard. But something is better than nothing.

If you are interested in contributing, drop me a line or send pull requests my way!

Read and Post Comments

Python Bindings Updates in Clang 3.1

May 14, 2012 at 12:05 AM | categories: Python, Clang, compilers | View Comments

Clang 3.1 is scheduled to be released any hour now. And, I'm proud to say that I've contributed to it! Specifically, I've contributed improvements to the Python bindings, which are an interface to libclang, the C interface to Clang.

Since 3.1 is being released today, I wanted to share some of the new features in this release. An exhaustive list of newly supported APIs is available in the release notes.

Diagnostic Metadata

Diagnostics are how Clang represents warnings and errors during compilation. The Python bindings now allow you to get at more metadata. Of particilar interest is Diagnostic.option. This property allows you to see the compiler flag that triggered the diagnostic. Or, you could query Diagnostic.disable_option for the compiler flag that would silence this diagnostic.

These might be useful if you are analyzing diagnostics produced by the compiler. For example, you could parse source code using the Python bindings and collect aggregate information on all the diagnostics encountered.

Here is an example:

from clang.cindex import Index

index = Index.create()
tu = index.parse('hello.c')

for diag in tu.diagnostics:
    print diag.severity
    print diag.location
    print diag.spelling
    print diag.option

Or, if you are using the Python bindings from trunk:

from clang.cindex import TranslationUnit

tu = TranslationUnit.from_source('hello.c')
...

Sadly, the patch that enabled this simpler usage did not make the 3.1 branch.

Finding Entities from Source Location

Two new APIs, SourceLocation.from_position and Cursor.from_location, allow you to easily extract a cursor in the AST from any arbitrary point in a file.

Say you want to find the element in the AST that occupies column 6 of line #10 in the file foo.c:

from clang.cindex import Cursor
from clang.cindex import Index
from clang.cindex import SourceLocation

index = Index.create()
tu = index.parse('foo.c')

f = File.from_name(tu, 'foo.c')

location = SourceLocation.from_position(tu, f, 10, 6)
cursor = Cursor.from_location(tu, location)

Of course, you could do this by iterating over cursors in the AST until one with the desired source range is found. But, that would involve more API calls.

I would like to say that these APIs feel klunky to me. There is lots of redundancy in there. In my opinion, there should just be a TranslationUnit.get_cursor(file='foo.c', line=10, column=6) that does the right thing. Maybe that will make it into a future release. Maybe it won't. After all, the Python bindings are really a thin wrapper around the C API and an argument can be made that there should be minimal extra logic and complexity in the Python bindings. Time will tell.

Type Metadata

It is now possible to access more metadata on Type instances. For example, you can:

  • See what the elements of an array are using Type.get_array_element_type
  • See how many elements are in a static array using Type.get_array_element_count
  • Determine if a function is variadic using Type.is_function_variadic
  • Inspect the Types of function arguments using Type.argument_types

In this example, I will show how to iterate over all the functions declared in a file and to inspect their arguments.

from clang.cindex import CursorKind
from clang.cindex import Index
from clang.cindex import TypeKind

index = Index.create()
tu = index.parse('hello.c')

for cursor in tu.cursor.get_children():
    # Ignore AST elements not from the main source file (e.g.
    # from included files).
    if not cursor.location.file or cursor.location.file.name != 'hello.c':
        continue

    # Ignore AST elements not a function declaration.
    if cursor.kind != CursorKind.FUNCTION_DECL:
        continue

    # Obtain the return Type for this function.
    result_type = cursor.type.get_result()

    print 'Function: %s' % cursor.spelling
    print '  Return type: %s' % result_type.kind.spelling
    print '  Arguments:'

    # Function has no arguments.
    if cursor.type.kind == TypeKind.FUNCTIONNOPROTO:
        print '    None'
        continue

    for arg_type in cursor.argument_types():
        print '    %s' % arg_type.kind.spelling

This example is overly simplified. A more robust solution would also inspect the Type instances to see if they are constants, check for pointers, check for variadic functions, etc.

An example application of these APIs is to build a tool which automatically generated ctypes or similar FFI bindings. Many of these tools today use custom parsers. Why invent a custom (and likely complex) parser when you can call out into Clang and have it to all the heavy lifting for you?

Future Features

As I write this, there are already a handful of Python binding features checked into Clang's SVN trunk that were made after the 3.1 branch was cut. And, I'm actively working at integrating many more.

Still to come to the Python bindings are:

  • Better memory management support (currently, not all references are kept everywhere, so it is possible for a GC to collect and dispose of objects that should be alive, even though they are not in scope).
  • Support for token API (lexer output)
  • More complete coverage of Cursor and Type APIs
  • More friendly APIs

I have a personal goal for the Python bindings to cover 100% of the functionality in libclang. My work towards that goal is captured in my python features branch on GitHub. I periodically clean up a patch, submit it for review, apply feedback, and commit. That branch is highly volatile and I do rebase. You have been warned.

Furthermore, I would like to add additional functionality to libclang [and expose it to Python]. For example, I would love for libclang to support code generation (i.e. compiling), not just parsing. This would enable all kinds of nifty scenarios (like channeling your build system's compiler calls through a proxy which siphons off metadata such as diagnostics).

Credits and Getting Involved

I'm not alone in my effort to improve Clang's Python bindings. Anders Waldenborg has landed a number of patches to add functionality and tests. He has also been actively reviewing patches and implementing official LLVM Python bindings! On the reviewing front, Manuel Klimek has been invaluable. I've lost track of how many bugs he's caught and good suggestions he's made. Tobias Grosser and Chandler Carruth have also reviewed their fair share of patches and handled community contributions.

If you are interested in contributing to the Python bindings, we could use your help! You can find me in #llvm as IndyGreg. If I'm not around, the LLVM community is generally pretty helpful, so I'm sure you'll get an answer. If you prefer email, send it to the cfe-dev list.

If you have any questions, leave them in the comments or ask using one of the methods above.

Read and Post Comments