News

Welcome to End Point’s blog

Ongoing observations by End Point people

Raw Packet Manipulation with Scapy

Installation

Scapy is a Python-based packet manipulation tool which has a number of useful features for those looking to perform raw TCP/IP requests and analysis. To get Scapy installed in your environment the best options are to either build from the distributed zip of the current version, or there are also some pre-built packages for Red Hat and Debian derived linux OS.

Using Scapy

When getting started with Scapy, it's useful to start to understand how all the aspects of the connection get encapsulated into the Python syntax. Here is an example of creating a simple IP request:

Welcome to Scapy (2.2.0)
>>> a=IP(ttl=10)
>>> a
<IP  ttl=10 |>
>>> a.dst="10.1.0.1"
>>> a
<IP  ttl=10 dst=10.1.0.1 |>
>>> a.src
'10.1.0.2'
>>> a.ttl
10

In this case I created a single request which was point from one host on my network to the default gateway on the same network. Scapy will allow the capability to create any TCP/IP request in raw form. There are a huge number of possible options for Scapy that can be applied, as well as huge number of possible packet types defined. The documentation with these options and packet types is available on the main site for Scapy.

Creating custom scripts with Scapy

Using Scapy within Python rather than as a standalone application would allow for creating more complex packets, sending them, and then parsing the response that is given. Here is a simple tester script example in which I will initiate a HTTP 1.1 request:

#! /usr/bin/env python
import logging
logging.getLogger("scapy").setLevel(1)

from scapy.all import *

def make_test(x,y):
    request = "GET / HTTP/1.1\r\nHost: " + y  + "\r\n"
    p = IP(dst=x)/TCP()/request
    out = sr1(p)
    if out:
        out.show()
if __name__ == "__main__":
    interact(mydict=globals(), mybanner="Scapy HTTP Tester")

Within this script there is the make_test function which takes as parameters the destination address and host header string respectively. The script will attempt to send the HTTP GET request to that address with the proper Host header set. If the request is successful, it will print out the details of the response packet. It would also be possible to perform more complex analysis of this response packet using the built in psdump and pdfdump functions which will create a human readable analysis of the packet in PostScript and PDF respectively.

Welcome to Scapy (2.2.0)
Scapy HTTP Tester
>>> make_test("www.google.com","google.com")
Begin emission:
...Finished to send 1 packets.
.*
Received 5 packets, got 1 answers, remaining 0 packets
###[ IP ]###
  version= 4L
  ihl= 5L
  tos= 0x20
  len= 56
  id= 64670
  flags=
  frag= 0L
  ttl= 42
  proto= tcp
  chksum= 0x231b
  src= 74.125.28.103
  dst= 10.1.0.2
  \options\
###[ TCP ]###
     sport= http
     dport= ftp_data
     seq= 1130043850
     ack= 1
     dataofs= 9L
     reserved= 0L
     flags= SA
     window= 42900
     chksum= 0x8c7e
     urgptr= 0
     options= [('MSS', 1430), (254, '\xf9\x89\xce\x04bm\x13\xd3)\xc8')]
>>>

Conclusions

Scapy is a powerful tool, if a bit daunting in syntax initially. Creating raw TCP/IP packets systematically will probably challenge most people's understanding of the TCP/IP stack (it certainly did mine!) but exposing this level of configuration has serious advantages. Full control of the requests and responses as well as ability to add custom Python logic allows Scapy to become a packet foundry which you can use for things like unit testing of web applications, verification of state of an unknown network, etc. I will definitely be using Scapy in the future when performing raw HTTP testing of web applications.

RailsConf 2015 for the non-Attendee

This blog post is really for myself. Because I had the unique experience of bringing a baby to a conference, I made an extra effort to talk to other attendees about what sessions shouldn't be missed. Here are the top takeaways from the conference that I recommend (in no particular order):

Right now, the videos are all unedited from the Confreaks live stream of the keynote/main room, and I'll update as the remaining videos become available.

A Message From the Sponsors

My husband: My conferences never have giveaways that cool.
Me: That's because you work in academia.

You can read the full list of sponsors here, but I wanted to comment further:

Hired was the diamond sponsor this year and they ran a ping pong tournament. The winner received $2000, and runners up received $1000, $500, $250. The final match was heavily attended and competitive! Practice up for next year?

Engine Yard also put on a really fun scavenger hunt using Scavify. Since I couldn't attend the multiple parties going on at night, this was a really fun way to participate and socialize.

Protect Interchange Passwords with Bcrypt

Interchange default configurations have not done a good job of keeping up with the best available password security for its user accounts. Typically, there are two account profiles associated with a standard Interchange installation: one for admin users (access table) where the password is stored using Perl's crypt() command (bad); and one for customers (userdb) where the password isn't encrypted at all (even worse). Other hashing algorithms have long been available (MD5, salted MD5, SHA1) but are not used by default and have for some time not been useful protection. Part of this is convenience (tools for retrieving passwords and ability to distribute links into user assets) and part is inertia. And no small part was the absence of a strong cryptographic option for password storage until the addition of Bcrypt to the account management module.

The challenge we face in protecting passwords is that hardware continues to advance at a rapid rate, and with more computational power and storage capacity, brute-force attacks become increasingly effective and widely available. Jeff Jarmoc's Enough with the Salts provides some excellent discussion and background on the subject. To counter the changing landscape, the main line of defense moves toward ensuring that the work required to create and test a given stored password is too expensive, too time-consuming, for brute-force attacks to be profitable.

One of the best options for handling encrypted password storage with a configurable "hardware cost" is Bcrypt. We chose to integrate Bcrypt into Interchange over other options primarily because of its long history of operation with no known exploits, and its cost parameter that allows an administrator to advance the work required to process a password slowly over time as hardware continues to increase in efficiency. The cost feature introduces an exponential increase in calculation iterations (i.e., required processing power and time) as powers of 2, from 1 (2 iterations, essentially no cost) to 31 (2^31, or 2,147,483,648 iterations). Ideally an administrator would want to identify a cost that causes no perceptible penalty to end users, but would be such a burden to any brute-force attack as to have no worthwhile return on investment to crack.

Converting an existing user base from any of the existing encryption schemes to Bcrypt is trivial in Interchange. The existing UserDB profile is changed to the "bcrypt" option and the "promote" boolean set to true. Promote allows your users to continue to validate against their existing stored password, but after the next access will upgrade their storage to the Bcrypt password. In the mean time, a backend process could be developed using the construct_bcrypt() routine in Vend::UserDB to update all outstanding accounts prior to being updated organically.

If the switch on the front end involves going from no encryption to any encrypted storage, including Bcrypt, and your front end uses the default tools for retrieving lost passwords, you'll also need to construct some new code for resetting passwords instead. There is no such facility for the admin, and since the admin accounts are typically far more valuable than the front end accounts, making the change for the admin should be the first priority and have the least effort involved.

Switching accounts to Bcrypt password storage is a simple, effective means for increasing protection on your users' and business' information. Every bit as importantly, it also helps protect your business' reputation, that can be severely damaged by a data breech. Lastly, in particular for your admin accounts, Bcrypt password storage is useful in meeting PCI DSS requirements for strong password hashing.

How to Bring a Baby to a Tech Conference

Last week, I brought my 4 month old to RailsConf. In a game-day decision, rather than drag a two year old and husband along on the ~5 hour drive and send the dogs to boarding, we decided it would ultimately be easier on everyone (except maybe me) if I attended the conference with the baby, especially since a good amount of the conference would be live-streamed.


Daily morning photos at the conference.

While I was there, I was asked often how it was bringing a baby to a conference, so I decided to write a blog post. As with all parenting advice, the circumstances are a strong factor in how the experience turned out. RailsConf is a casual three-day multi-track tech conference with many breaks and social events – it's as much about socialization as it is about technical know-how. This is not my first baby and not my first time at RailsConf, so I had some idea of what I might be getting into. Minus a few minor squeaks, baby Skardal was sleeping or sitting happily throughout the conference.

Here's what I [qualitatively] perceived to be the reaction of others attending the conference to baby Skardal:

In list form:

  • Didn't Notice: Probably about 50% didn't notice I had a baby, especially when she was sleeping soundly in the carrier.
  • Stares: Around 50% may have stared. See below for more.
  • Joke: A really small percentage of people made some variation of the joke "Starting her early, eh?"
  • Conversation: An equally small percentage of people started a conversation with me, which often led to more technical talk.

Here are some reasons I imagined behind the staring:

  • That baby is very cute (I'm not biased at all!)
  • Shock
  • Wonder if day care is provided (No, it wasn't. But with a 4 month old who hasn't been in day care yet, I probably wouldn't have used it.)
  • Too hungover to politely not stare

Pros & Cons

And here is how I felt after the conference, regarding pros and cons on bringing the baby:

Pros
  • A baby is a good conversation starter, which is beneficial in a generally introverted crowd.
  • I realized there are helpful & nice people in the world who offered to help plenty of times.
  • The baby was happiest staying with me.
Cons
  • Because children were the focus of many conversations, I missed out on talking shop a bit.
  • It's tiring, but in the same way that all parenting is.
  • I couldn't participate in all of the social/evening activities of the conference.
  • Staring generally makes me feel uncomfortable.

Tips

And finally, some tips:

  • Plan ahead:
    • Review the sessions in advance and pick out ones you want to attend because you may not have time to do that on the fly.
    • Walk (or travel) the route from your hotel to the conference so you know how long it will take and if there will be challenges.
  • Be agile and adapt. Most parents are already probably doing this with a 4 month old.
  • Manage your expectations:
    • Expect the conference with a baby to be tiring & challenging at times.
    • Expect stares.
    • Expect you won't make it to every session you want, so make a point of talking to others to find out their favorite sessions.
  • If not provided, ask conference organizers for access to a nursing or stashing room.
  • Bring baby gear options: carrier, stroller, bouncy seat, etc.
  • Research food delivery options ahead of time.
  • Order foods that are easy to eat with one hand. Again, another skill parents of a 4 month old may have developed.
  • Sit or stand in the back of talks.

While in these circumstances I think we made the right decision, I look forward to attending more conferences sans-children.

RailsConf 2015 - Atlanta: Day Three

Today, RailsConf concluded here in Atlanta. The day started with the reveal of this year's Ruby Heroes, followed by a Rails Core panel. Watch the video here.

On Trailblazer

One interesting talk I attended was See You on The Trail by Nick Sutterer, sponsored by Engine Yard, a talk where he introduced Trailblazer. Trailblazer is an abstraction layer on top of Rails that introduces a few additional layers that build on the MVC convention. I appreciated several of the arguments he made during the talk:

  • MVC is a simple level of abstraction that allows developers to get up and running efficiently. The problem is that everything goes into those three buckets, and as the application gets more complex, the simplified structure of MVC doesn't answer on how to organize logic like authorization and validation.
  • Nick made the argument that DHH is wrong when says that microservices are the answer to troublesome monolithic apps. Nick's answer is a more structured, organized OO application.
  • Rails devs often say "Rails is simple", but Nick made the argument that Rails is easy (subjective) but not simple (objective). While Rails follows convention with the idea that transitioning between developers on a project should be easy if conventions have been followed, in actuality, there is still so much interpretation into how and where to organize business logic for a complex Rails application that makes transition between developers less straightforward and not simple.
  • Complex Rails tends to include fat models (as opposed to fat controllers), and views with [not-so-helpful] helpers and excessive rendering logic.
  • Rails doesn't introduce convention on where dispatch, authorization, validation, business logic, and rendering logic should live.
  • Trailblazer, an open source framework, introduces a new abstraction layer to introduce conventions for some of these steps. It includes Cells to encapsulate the OO approach in views, and Operations to deserialize and validate without touching the model.

There was a Trailblazer demo during the talk, but as I mentioned above, the takeaway for me here is that rather than focus on the specific technical implementation at this point, this buzzworthy topic of microservices is more about good code organization and conventions for increasingly complex applications, that encourages readability and maintenance on the development side.

I went to a handful of other decent talks today and will include a summary of my RailsConf experience sharing links to popular talks on this blog.

RailsConf 2015 - Atlanta: Day Two

It's day 2 of RailsConf 2015 in Atlanta! I made it through day 1!

The day started with Aaron Patterson's keynote (watch it here). He covered features he's been working on including auto parallel testing, cache compiled views, integration test performance, and "soup to nuts" performance. Aaron is always good at starting his talk with self-deprecation and humor followed by sharing his extensive performance work supported by lots of numbers.

On Hiring

One talk I attended today was "Why We're Bad At Hiring (And How To Fix It)" by @kerrizor of Living Social (slides here, video here). I was originally planning on attending a different talk, but a fellow conference attendee suggested this one. A few gems (not Ruby gems) from this talk were:

  • Imagine your company as a small terrarium. If you are a very small team, hiring one person can drastically affect the environment, while hiring one person will be less influential for larger companies. I liked this analogy.
  • Stay away from monocultures (e.g. the banana monoculture) and avoid hiring employees just like you.
  • Understand how your hiring process may bias you to reject specific candidates. For example, requiring a GitHub account may bias reject applicants that are working with a company that can't share code (e.g. security clearance required). Another example: requiring open source contributions may bias reject candidates with very little free time outside of their current job.
  • The interview process should be well organized and well communicated. Organization and communication demonstrate confidence in the hiring process.
  • Hypothetical scenarios or questions are not a good idea. I've been a believer of this after reading some of Malcolm Gladwell's books where he discusses how circumstances are such a strong influence of behavior.
  • Actionable examples that are better than hypothetical scenarios include:
    1. ask an applicant to plan an app out (e.g. let's plan out strategy for an app that does x)
    2. ask an applicant to pair program with a senior developer
    3. ask the applicant to give a lightning talk or short presentation to demonstrate communication skills
  • After a rejected interview, think about what specifically might change your mind about the candidate.
  • Also communicate the reapplication process.
  • Improve your process by measuring with the goal to prevent false negatives. One actionable item here is to keep tabs on people – are there any developers you didn't hire that went on to become very successful & what did you miss?
  • Read this book.

Interview practices that Kerri doesn't recommend include looking at GPA/SAT/ACT scores, requiring a Pull request to apply, speed interviews, giving puzzle questions, whiteboard coding & fizzbuzz.

While I'm not extremely involved in the hiring processes for End Point, I am interested in the topic of growing talent within a company. The notes specifically related to identifying your own hiring biases was compelling.

Testing

I also attended a few talks on testing. My favorite little gem from one of these talks was the idea that when writing tests, one should try to balance between readability, maintainability, and performance, see:

Eduardo Gutierrez gave a talk on Capybara where he went through explicit examples of balancing maintainability, readability, and performance in Capybara. I'll update this post to include links to all these talks when they become available. Here are the videos & slides for these talks:

RailsConf 2015 - Atlanta: Day One

I'm here in Atlanta for my sixth RailsConf! RailsConf has always been a conference I enjoy attending because it includes a wide spectrum of talks and people. The days are long, but rich with nitty gritty tech details, socializing, and broader topics such as the social aspects of coding. Today's keynote started with DHH discussing the shift towards microservices to support different types of integrated systems, and then transitioned to cover code snippets of what's to come in Rails 5, to be released this year. Watch the keynote here.

Open Source & Being a Hero

One of the talks I was really looking forward to attending was "Don't Be a Hero - Sustainable Open source Dev" by Lillie Chilen (slides here, video here), because of my involvement in open source (with Piggybak, RailsAdminImport, Annotator and Spree, another Ruby on Rails ecommerce framework). In the case of RailsAdminImport, I found a need for a plugin to RailsAdmin, developed it for a client, and then released it into the open source with no plans on maintaining a community. I've watched as it's been forked by a handful of users who were in need of the same functionality, but I most recently gave another developer commit & Rubygems access since I have historically been a horrible maintainer of the project. I can leverage some of Lillie's advice to help build a group of contributors & maintainers for this project since it's not something I plan to put a ton of time into.

With Piggybak, while I haven't developed any new features for it in a while, I released it into the open source world with the intention of spending time maintaining a community after being involved in the Spree community. Piggybak was most recently upgraded to Rails 4.2.

Lillie's talk covered actionable items you can do if you find yourself in a hero role in an open source project. She explained that while there are some cool things about being a hero, or a single maintainer on a project, ultimately you are also the single point of failure of the project and your users are in trouble if you get eaten by a dinosaur (or get hit by a bus).

Here are some of these actionable items to recovery from hero status:

  1. Start with the documentation on how to get the app running, how to run tests, and how to contribute. Include comments on your workflow, such as if you like squashed commits or how documentation should look.
  2. Write down everything you do as a project maintainer and try to delegate. You might not realize all the little things you do for the project until you write them down.
  3. Try to respond quickly to requests for code reviews (or pull requests). Lillie referenced a study that mentioned if a potential contributor receives a code review within 48 hours, they are much more likely to come back, but if they don't hear back within 7 days, there is little chance they will continue to be involved.
  4. Recruit collaborators by targeted outreach. There will be a different audience of collaborators if you open source tool is an app versus a library.
  5. Manage your own expectations for contributors. Understand the motivations of contributors and try to figure out ways to encourage specific deliverables.
  6. Have regular retrospectives to analyze what's working and what's not, and encourage introspection.

While Lillie also covered several things that you can do as a contributor, I liked the focus on actionable tasks here for owners of projects. The ultimate goal should be to find other collaborators, grow a team, and figure out what you can do to encourage people to progress in the funnel and transition from user to bug fixer to contributor to maintainer. I can certainly relate to being the single maintainer on an open source project (acting as a silo), with no clear plan as to how to grow the community.

Other Hot Topics

A couple of other hot topics that came up in a few talks were microservices and Docker. I find there are hot topics like this at every RailsConf, so if the trend continues, I'll dig deeper into these topics.

What Did I Miss?

I always like to ask what talks people found memorable throughout the day in case I want to look back at them later. Below are a few from today. I'd like to revisit these later & I'll update to include the slides when I find them.

The `name' attribute is required in cookbook metadata: Solving a Vagrant/Chef Provisioning Issue

When Vagrant/Chef Provisioning Goes South

I recently ran into the following error when provisioning a new vagrant machine via the `vagrant up` command:

[2015-04-21T17:10:35+00:00] FATAL: Stacktrace dumped to /var/chef/cache/chef-stacktrace.out
[2015-04-21T17:10:35+00:00] ERROR: Cookbook loaded at path(s) [/tmp/vagrant-chef/path/to/my-cookbook] has invalid metadata: The `name' attribute is required in cookbook metadata
[2015-04-21T17:10:35+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)
  

After some googling and digging I learned version 12 of chef-client introduced a breaking change. From version 12 on, every cookbook requires a name attribute in their metadata.rb file . A quick grep through the metadata.rb files in the project revealed several did not include name attributes. You would be correct at this point to suggest I could have added name attributes to the cookbook metadata files and been done with this. However, in this case I was a consumer of the cookbooks and was not involved in maintaining them so an alternate solution was preferable.

Selecting a Specific Version of Chef in Vagrant

My idea for a solution was to install the most recent chef-client release prior to version 12. I was not sure how to do this initially but along the way I learned that by default, Vagrant will install the most recent release of chef-client. The Vagrant documentation for Chef provisioners described what I needed to do. The Chef version could be specified in config.vm.provision block in the Vagrantfile:

config.vm.provision :chef_solo do |chef|
      chef.version = "11.18"
      chef.cookbooks_path = "cookbooks"
      chef.data_bags_path = "data_bags"

      # List of recipes to run
      chef.add_recipe "vagrant_main::my_project"
  end
  

With this configuration change, chef-client 11.18 completed the provisioning step successfully.

Handling databases in dev environments for web development

One of the biggest problems for web development environments is copying large amounts of data. Every time a new environment is needed, all that data needs to be copied. Source code should be tracked in version control software, and so copying it should be a simple matter of checking it out from the code repository. So that is usually not the problem. The main problem area is database data. This can be very large, take a long time to copy, and can impact the computers involved in the copy (usually the destination computer gets hammered with IO which makes load go high).

Often databases for development environments are created by copying a database dump from production and then importing that database dump. And since database dumps are text, they can be highly compressed, which can result in a relatively small file to copy over. But the import of the dump can still take lots of time and cause high load on the dev computer as it rebuilds tables and indexes. As long as your data is relatively small, this process may be perfectly acceptable.

Your database WILL get bigger

At some point though your database will get so big that this process will take too long and cause too much load to be acceptable.

To address the problem you can try to reduce the amount of data involved by only dumping a portion of the database data instead of all of it, or possibly using some "dummy sample data" instead. These techniques may work if you don't care that development environments no longer have the same data as production. However, one serious problem with this is that a bug or behavior found in production can't be replicated in a development environment because the data involved isn't the same. For example, say a customer can't checkout on the live site but you can't replicate the bug in your development environment to fix the bug. In this example, the root cause of the problem could be a bug in the code handling certain products that are out of stock, and since the dev database didn't have the same data it could make finding and fixing these types of problems a lot harder.

Snapshots

Another option is to use file system snapshots, like LVM snapshots, to quickly make clones of the database without needing to import the database dump each time. This works great if development environments live on the same server, or at least the development databases live on the same server. You would need to create a volume to hold a single copy of the database; this copy would be the origin for all snapshots. Then for each development environment, you could snapshot the origin volume, mount it read-write in a place accessible by the developer, customize the database configuration (like setting a unique port number to listen on), and then start up the database. This then provides a clone of the entire database in a tiny fraction of the time and uses less disk space and other system resources too.

In using snapshots there are some things you'll need to be careful about. Snapshots are usually created using copy-on-write tables. The more snapshots mounted read-write, the more IO overhead is involved for the volumes involved. For this reason it is important that writes to the origin volume be avoided as much as possible while the snapshots are open. Also, snapshots that get a lot of writes can fill up their copy-on-write table, and depending on the file system and database that you are using this can be a big problem. So it is important to monitor each open snapshot for how full it is and increase their size if needed so they don't fill up. Updating the origin database will require shutting down and removing all snapshots first, then update the origin database, then create and mount all the snapshots again. This is because all the copy-on-write tables would get full if you tried to update the origin while the snapshots are open.

Using snapshots like this may sound more complicated, and it is, but the processes involved can be scripted and automated and the benefits can be pretty significant if you have several developers and a lot of data to copy.

Nvidia: Invalid or Corrupted Push Buffer Stream

As a high-performance video rendering appliance, the Liquid Galaxy requires really good video cards -- better than your typical on-board integrated video cards. Despite ongoing attempts by competitors to displace them, Nvidia remains the best choice for high-end video, if you use the proprietary Nvidia driver for Linux.

In addition to providing regular security and system updates, End Point typically provides advanced remote monitoring of our customers' systems for issues such as unanticipated application behavior, driver issues, and hardware errors. One particularly persistent issue presents as an error with an Nvidia kernel module.  Unfortunately, relying on proprietary Nvidia drivers so as to maintain an acceptable performance level limits the available diagnostic information and options for resolution.

The issue presents when the system ceases all video output functions as Xorg crashes. The kernel log contains the following error message:

2015-04-14T19:59:00.000083+00:00 lg2 kernel: [  719.850677] NVRM: Xid (0000:01:00): 32, Channel ID 00000003 intr 02000000

The message is repeated approximately 11000 times every second until the disk fills and the ability to log in to the system is lost. The only known resolution at this time is to power-cycle the affected machine. In the error state, the module cannot be removed from the kernel, which also prevents Linux from shutting down properly. All affected systems were running some version of Ubuntu x86-64. The issue seems to be independent of driver version, but is at least present in 343.36 and 340.65, and affects all Geforce cards. Quadro cards seem unaffected.

The Xid message in the kernel log contains an error code that provides a little more information. The Nvidia docs list the error as "Invalid or corrupted push buffer stream". Possible causes listed include driver error, system memory corruption, bus error, thermal error, or frame buffer error. All affected systems were equipped with ECC RAM and were within normal operating temperature range when the issue presented.

Dealing with bugs like these can be arduous, but until they can be fixed, we cope by monitoring and responding to problems as quickly as possible.

Joe Mastey at Mountain West Ruby Conference 2015

A conversation with a co-worker today about the value of improving one's professional skills reminded me of Joe Mastey's talk he gave at the 2015 Mountain West Ruby Conference. That then reminded me that I had never finished my write up on that conference. Blogger won't let me install harp music and an animated soft focus flashback overlay, so please just imagine it's the day after the conference when you're reading this. "That reminds me of the time..."

I've just finished my second MWRC and I have to give this one the same 5-star rating I gave last year's. There were a few small sound glitches here and there, but overall the conference is well-run, inclusive, and packed with great speakers and interesting topics. Rather than summarizing each talk, I want to dig into the one most relevant to my interests. "Building a Culture of Learning" by Joe Mastey

I was excited to catch Joe's talk because learning and teaching have always been very interesting to me, regardless of the particular discipline. I find it incredibly satisfying to improve upon my own learning skills, as well as improving my teaching skills by teasing out how different individuals learn best and then speak to that. There's magic in that one on one interaction when everything comes together just right. I just really dig that.

Joe's work as the Manager of Internal Learning at Enova has gone way beyond the subtleties of the one-on-one. He's taken the not-so-simple acts of learning and training, and scaled them up in an environment that does not sound, on paper, like it would support it. He's created a culture of learning ("oh hey they just said the title of the movie in the movie!") in a financial company that's federally regulated, saw huge growth due to an IPO, and had very real business-driven deadlines for shipping their software.

Joe broke his adventure down into three general phases after refreshingly admitting that "YMMV" and that you can't ignore the existing corporate culture when trying to build a culture of learning within.

Phase 1 - Building Credibility

I would hazard a guess that most software development shops are perpetually at Phase 1: Learning is mostly ad-hoc by way of picking things up from one's daily work; and has little to no people pushing for more formal training. People probably agree that training is important, but the mandate has not come down from the CTO, and there's "no time for training" because there's so much work to do.

How did Joe help his company evolve past Phase 1? Well, he did a lot of things that I think many devs would be happy to just get one or two of at the their company. My two favorites from his list probably appeal to polar opposite personality types, but that's part of why I like them.

My first favorite is that books are all-you-can-eat. If a developer asks Joe for a tech book, he'll say yes, and he'll buy a bunch of extra copies for the office. I like having a paper book to read through to get up to speed on a topic, ideally away from my desk and the computer screen. I've also found that for some technologies, the right book can be faster and less frustrating than potentially spotty documentation online.

My second favorite is how Joe implemented "new hire buddies." Each new hire is teamed up with an experienced dev from a different team. Having a specific person to talk to, and get their perspective on company culture, can really help people integrate into the culture much more quickly. When I joined End Point in 2010, I worked through our new hire "boot camp" training like all new hires. I then had the occasionally-maddening honor of working directly with one of the more senior evil super-geniuses at End Point on a large project that I spent 100% of my time on. He became my de facto new hire buddy and I could tell that despite the disparity in our experience levels, being relatively joined at the hip with someone like that improved my ramp-up and cultural integration time greatly.

Phase 2 - Expand Reach and Create Impact

If my initial guess about Phase 1 is correct, it follows that that dev shops in Phase 2 are more rare: More people are learning more, more people are driving that learning, but it's still mostly focused on new hires and the onboarding process.

Phase 2 is where Joe's successful efforts are a little more intimidating to me, especially given my slightly introverted nature. The efforts here scale up and get more people speaking publically, both internally and externally. It starts with a more formal onboarding process, and grows to things like weekly tech talks and half day internal workshops. Here is where I start to make my "yeah, but…" face. We all have it. It's the face you make when someone says something you don't think can work, and you start formulating your rebuttal immediately. E.g. "Yeah, but how do you get management and internal clients to be OK with ‘shutting down development for half a day' for training?" Joe does mention the danger of being perceived as "wasting too much time." You'll want to be sure you get ahead of that and communicate the value of what you're spending "all that dev time on."

Phase 3 - Shift The Culture

It would be interesting to know how many shops are truly in Phase 3 because it sounds pretty intense: Learning is considered part of everyone's job, the successes from the first two phases help push the culture of learning to think and act bigger, the acts of learning and training others are part of job descriptions, and things like FOSS contributions and that golden unicorn of "20% personal project time" actually happen on company time. Joe describes the dangers or downsides to Phase 3 in a bit of a "with great power comes great responsibility" way. I've never personally worked somewhere that's in Phase 3, but it make sense that the increased upside has increased (potential) downside.

At End Point, we have some elements of all three phases, but we're always looking to improve. Joe's talk at MWRC 2015 has inspired me to work on expanding our own culture of learning. I think his talk is also going to serve as a pretty good road-map on how to get to the next phase.

RailsConf 2015: Coming Soon

Next week, I'm headed to my 6th RailsConf 2015 in Atlanta, with the whole family in tow:


The gang. Note: Dogs will not be attending conference.

This will be a new experience since my husband will be juggling two kids while I attend the daily sessions. So it makes sense going into the conference fairly organized to aid in the kid juggling, right? So I've picked out a few sessions that I'm looking forward to attending. Here they are:

RailsConf is a multi-track conference, with tracks including Distributed Systems, Culture, Growing Talent, Testing, APIs, Front End, Crafting Code, JavaScript, and Data & Analytics. There are also Beginner and Lab tracks, which might be suitable to those looking for a learning & training oriented experience. As you might be able to tell, the sessions I'm interested in cover a mix of performance, open source, and front-end dev. As I've become a more experienced Rails developer, RailsConf has been more about seeing what's going on in the Rails community and what the future holds, and less about the technical nitty-gritty or training sessions.

Stay tuned for a handful of blog posts from the conference!

RubyConf India 2015

The 6th edition of RubyConf India 2015 was held at Goa (in my opinion, one of the most amazing places in India). The talks were spread over various topics, mainly related to Ruby generally and RoR.

Aaron Patterson (a core member of Ruby and Rails team) gave a very interesting talk about Pair Programming, benchmarking on Integration tests vs Controller tests and precompiling the view to increase the speed in Rails 5.

Christophe Philemotte presented a wonderful topic on "Diving in the unknown depths of a project" with his experience of contributing to the Rails project. He mentioned that 85% of a developer’s time is spent on reading the code and 15% of the time is spent on writing the code. So he explained a work process plan to make use of the developer’s time effectively which should adopt well to any kind of development. Here is the list of steps he explained:

  1. Goal (ex: bug fixing, implement new feature, etc… )
  2. Map (ex: code repository, documentation, readme, etc…)
  3. Equipment (ex: Editor, IDE) and Dive (read, write, run and use)
  4. Next Task

Rajeev from ThoughtWorks talked about "Imperative vs Functional programming" and interesting concepts in Haskell which can be implemented in Ruby, such as function composition, lazy evaluation, thunks, higher order functions, currying, pure functions and immutable objects.

Aakash from C42 Engineering talked about an interesting piece of future web components called "Shadow DOM" which has interoperability, encapsulation and portability features built-in. He also mentioned polymer as a project to develop custom Shadow DOM.

Vipul and Prathamesh from BigBinary showed an experimental project of "Building an ORM with AReL" which is Torm (Tiny Object Relation Mapping) to gain more control over the ORM process in Rails.

Smit Shah from Flipkart gave a talk on "Resilient by Design" which follows some design patterns like 1. Bounding - change the default timeouts 2. Circuit breakers 3. Fail Fast

Christopher Rigor shared some awesome information about "Cryptography for Rails Developers". He explained some concepts like Public Key cryptography, symmetric cryptography, and SSL/TLS versions. He recommended we all use TLS 1.2 and AES-GCM on production to keep the application more secure.

Eleanor McHugh, a british hacker gave an important talk on "Privacy is always a requirement". The gist of the talk is to keep security tight by encrypt all transports, encrypt all passwords, provide two-factor authentication, encrypt all storage and anchor trust internally. The data won't by safe by privacy or trust or contract in the broken internet world.

Laurent Sansonetti who works for RubyMotion gave a demo of a Floppy bird game which he created using RubyMotion with code walkthrough. RubyMotion is used to develop native mobile applications for iOS, OS X and Android platforms. It provides features to use Objective-C API and Android API, and the whole build process is Rake-based.

Shadab Ahmed gave a wonderful demo on the 'aggrobot' gem, used to perform easy aggregations. aggrobot runs on top of ActiveRecord as well as working directly with database to provide good performance on query results.

Founder and CEO of CodeClimate Bryan Helmkamp spoke about "Rigorous Deployment" using a few wonderful tools. The ‘rollout’ gem is extensively helpful to deploy to specific users, specific branch deployment, etc… So there is no need of staging environment and it helps to avoid things like 'it works in staging, not production'

Erik Michaels-Ober gave an awesome talk on "Writing Fast Ruby" (must-visit slides) with information about how to tweak regular code to improve the performance of the application. He also presented the Benchmark/IPS results of two versions of code with working logic and execution time. He mentioned as a rule-of-thumb that any performance improvement changes should require at least 12% improvement compare to current code.

The final presenter Terence Lee from the Ruby Security Team gave a talk on "Ruby & You" which summarised all the talks and gave information on the Ruby Security Team and its contribution to Ruby community. He suggested everyone to keep their Ruby version up-to-date so to get the latest security patches and avoid vulnerabilities. He also encouraged the audience to submit bug reports to the official Ruby ticketing system because, quoting him, "Twitter is not bug tracker".

It was really fascinating to interact with like minded people and I was very happy to leave the conference with many new interesting ideas and input about Ruby and RoR along with some new techie friends.

New NoSQL benchmark: Cassandra, MongoDB, HBase, Couchbase

Today we are pleased to announce the results of a new NoSQL benchmark we did to compare scale-out performance of Apache Cassandra, MongoDB, Apache HBase, and Couchbase. This represents work done over 8 months by Josh Williams, and was commissioned by DataStax as an update to a similar 3-way NoSQL benchmark we did two years ago.

The database versions we used were Cassandra 2.1.0, Couchbase 3.0, MongoDB 3.0 (with the Wired Tiger storage engine), and HBase 0.98. We used YCSB (the Yahoo! Cloud Serving Benchmark) to generate the client traffic and measure throughput and latency as we scaled each database server cluster from 1 to 32 nodes. We ran a variety of benchmark tests that included load, insert heavy, read intensive, analytic, and other typical transactional workloads.

We avoided using small datasets that fit in RAM, and included single-node deployments only for the sake of comparison, since those scenarios do not exercise the scalability features expected from NoSQL databases. We performed the benchmark on Amazon Web Services (AWS) EC2 instances, with each test being performed three separate times on three different days to avoid unreproduceably anomalies. We used new EC2 instances for each test run to further reduce the impact of any “lame instance” or “noisy neighbor” effect on any one test.

Which database won? It was pretty overwhelmingly Cassandra. One graph serves well as an example. This is the throughput comparison in the Balanced Read/Write Mix:

Our full report, Benchmarking Top NoSQL Databases, contains full details about the configurations, and provides this and other graphs of performance at various node counts. It also provides everything needed for others to perform the same tests and verify in their own environments. But beware: Your AWS bill will grow pretty quickly when testing large numbers of server nodes using EC2 i2.xlarge instances as we did!

Earlier this morning we also sent out a press release to announce our results and the availability of the report.

Update: See our note about updated test runs and revised report as of June 4, 2015.

Happy 10th birthday, Git!

Git’s birthday was yesterday. It is now 10 years old! Happy birthday, Git!

Git was born on 7 April 2005, as its creator Linus Torvalds recounted in a 2007 mailing list post. At least if we consider the achievement of self-hosting to be “birth” for software like this. :)

Birthdays are really arbitrary moments in time, but they give us a reason to pause and reflect back. Why is Git a big deal?

Even if Git were still relatively obscure, for any serious software project to survive a decade and still be useful and maintained is an accomplishment. But Git is not just surviving.

Over the past 5–6 years, Git has become the standard version control system in the free software / open source world, and more recently, it is becoming the default version control system everywhere, including in the proprietary software world. It is amazing to consider how fast it has overtaken the older systems, and won out against competing newer systems too. It is not unreasonable these days to expect anyone who does software development, and especially anyone who claims to be familiar with version control systems, to be comfortable with Git.

So how did I get to be friends with Git, and end up at this birthday celebration?

After experimenting with Git and other distributed version control systems for a while in early and mid-2007, I started using Git for real work in July 2007. That is the earliest commit date in one my personal Git repositories (which was converted from an earlier CVS repository I started in 2000). Within a few weeks I was in love with Git. It was so obviously vastly superior to CVS and Subversion that I had mostly used before. It offered so much more power, control, and flexibility. The learning curve was real but tractable and it was so much easier to prevent or repair mistakes that I didn’t mind the retraining at all.

So I’m sounding like a fanboy. What was so much better?

First, the design. A distributed system where all commits were objects with a SHA-1 hash to identify them and the parent commit(s). Locally editable history. Piecemeal committing thanks to the staging power of the Git index. Cheap and quick branching. Better merging. A commit log that was really useful. Implicit rename tracking. Easy tagging and commit naming. And nothing missing from other systems that I needed.

Next, the implementation. Trivial setup, with no political and system administrative fuss for client or server. No messing with users and permissions and committer identities, just name & email address like we’re all used to. An efficient wire protocol. Simple ssh transport for pushes and/or pulls of remote repositories, if needed. A single .git directory at the repository root, rather than RCS, CVS, or .svn directories scattered throughout the checkout. A simple .git/config configuration file. And speed, so much speed, even for very large repositories with lots of binary blobs.

The speed is worth talking about more.

The speed of Git mattered, and was more than just a bonus. It proved true once again the adage that a big enough quantitative difference becomes a qualitative difference. Some people believed that speed of operations wasn’t all that important, but once you are able to complete your version control tasks so quickly that they’re not at all bothersome, it changes the way you work.

The ease of setting up an in-place repository on a whim, without worrying about where or if a central repository would ever be made, let alone wasting any time with access control, is a huge benefit. I used to administer CVS and Subversion repositories and life is so much better with the diminished role a “Git repository administrator” plays now.

Cheap topic branches for little experiments are easy. Committing every little thing separately makes sense if I can later reorder or combine or split my commits, and craft a sane commit before pushing it out where anyone else sees it.

Git subsumed everything else for us.

RCS, despite its major limitations, stuck around because CVS and Subversion didn’t do the nice quick in-place versioning that RCS does. That kind of workflow is so useful for a system administrator or ad-hoc local development work. But RCS has an ugly implementation and is based on changing single files, not sets. It can’t be promoted to real distributed version control later if needed. At End Point we used to use CVS, Subversion, and SVK (a distributed system built on top of Subversion), and also RCS for those cases where it still proved useful. Git replaced them all.

Distributed is better. Even for those who mostly use Git working against a central repository.

The RCS use case was a special limited subset of the bigger topic of distributed version control, which many people resisted and thought was overkill, or out of control, or whatever. But it is essential, and was key to fixing a lot of the problems of CVS and Subversion. Getting over the mental block of Git not having a single sequential integer revision number for each commit as Subversion did was hard for some people, but it forces us to confront the new reality: Commits are their own objects with an independent identity in a distributed world.

When I started using Git, Mercurial and Bazaar were the strongest distributed version control competitors. They were roughly feature-equivalent, and were solid contenders. But they were never as fast or compact on disk, and didn’t have Git’s index, cheap branching, stashing, or so many other niceties.

Then there is the ecosystem.

GitHub arrived on the scene as an unnecessary appendage at first, but its ease of use and popularity, and social coding encouragement, quickly made it an essential part of the Git community. It turned the occasional propagandistic accusation that Git was antisocial and would encourage project forks, into a virtue, by calling everyone’s clone a “fork”.

Over time GitHub has played a major role in making Git as popular as it is. Bypassing the need to set up any server software at all to get a central Git repository going removed a hurdle for many people. GitHub is a centralized service that can go down, but that is no serious risk in a distributed system where you generally have full repository mirrors all over the place, and can switch to other hosting any time if needed.

I realize I’m gushing praise embarrassingly at this point. I find it is warranted, based on my nearly 8 years of using Git and with good familiarity with the alternatives, old and new.

Thanks, Linus, for Git. Thanks, Junio C Hamano, who has maintained the Git open source project since early on.

Presumably someday something better will come along. Until then let’s enjoy this rare period of calm where there is an obvious winner to a common technology question and thus no needless debate before work can begin. And the current tool stability means we don’t have to learn new version control skills every few years.

To commemorate this 10-year birthday, Linux.com interviewed Linus and it is a worthwhile read.

You may also be interested to read more in the Wikipedia article on Git or the Git wiki’s history of Git.

Finally, an author named Stephen has written an article called The case for Git in 2015, which revisits the question of which version control system to use as if it were not yet a settled question. It has many good reminders of why Git has earned its prominent position.

PgConf 2015 NYC Recap

I recently just got back from PGConf 2015 NYC.  It was an invigorating, fun experience, both attending and speaking at the conference.

What follows is a brief summary of some of the talks I saw, as well as some insights/thoughts:

On Thursday:

"Managing PostgreSQL with Puppet" by Chris Everest.  This talk covered experiences by CoverMyMeds.com staff in deploying PostgreSQL instances and integrating with custom Puppet recipes.

"A TARDIS for your ORM - application level timetravel in PostgreSQL" by Magnus Hagander. Demonstrated how to construct a mirror schema of an existing database and manage (via triggers) a view of how data existed at some specific point in time.  This system utilized range types with exclusion constraints, views, and session variables to generate a similar-structured schema to be consumed by an existing ORM application.

"Building a 'Database of Things' with Foreign Data Wrappers" by Rick Otten.  This was a live demonstration of building a custom foreign data wrapper to control such attributes as hue, brightness, and on/off state of Philips Hue bulbs.  Very interesting live demo, nice audience response to the control systems.  Used a python framework to stub out the interface with the foreign data wrapper and integrate fully.

"Advanced use of pg_stat_statements: Filtering, Regression Testing & More" by Lukas Fittl.  Covered how to use the pg_stat_statements extension to normalize queries and locate common performance statistics for the same query.  This talk also covered the pg_query tool/library, a Ruby tool to parse/analyze queries offline and generate a JSON object representing the query.  The talk also covered the example of using a test database and the pg_stat_statements views/data to perform query analysis to theorize about planning of specific queries without particular database indexes, etc.

On Friday:

"Webscale's dead! Long live Postgres!" by Joshua Drake.  This talk covered improvements that PostgreSQL has made over the years, specific technologies that they have incorporated such as JSON, and was a general cheerleading effort about just how awesome PostgreSQL is.  (Which of course we all knew already.)  The highlight of the talk for me was when JD handed out "prizes" at the end for knowing various factoids; I ended up winning a bottle of Macallan 15 for knowing the name of the recent departing member of One Direction.  (Hey, I have daughters, back off!)

"The Elephants In The Room: Limitations of the PostgreSQL Core Technology" by Robert Haas.  This was probably the most popular talk that I attended.  Robert is one of the major developers of the PostgreSQL team, and is heavily knowledgeable in the PostgreSQL internals, so his opinions of the existing weaknesses carry some weight.  This was an interesting look forward at possible future improvements and directions the PostgreSQL project may take.  In particular, Robert looked at the IO approach Postgres currently take and posits a Direct IO idea to give Postgres more direct control over its own IO scheduling, etc.  He also mentioned the on-disk format being somewhat suboptimal, Logical Replication as an area needing improvement, infrastructure needed for Horizontal Scalability and Parallel Query, and integrating Connection Pooling into the core Postgres product.

"PostgreSQL Performance Presentation (9.5devel edition)" by Simon Riggs.  This talked about some of the improvements in the 9.5 HEAD; in particular looking at the BRIN index type, an improvement in some cases over the standard btree index method.  Additional metrics were shown and tested as well, which demonstrated Postgres 9.5's additional performance improvements over the current version.

"Choosing a Logical Replication System" by David Christensen.  As the presenter of this talk, I was also naturally required to attend as well.  This talk covered some of the existing logical replication systems including Slony and Bucardo, and broke down situations where each has strengths.

"The future of PostgreSQL Multi-Master Replication" by Andres Freund.  This talk primarily covered the upcoming BDR system, as well as the specific infrastructure changes in PostgreSQL needed to support these features, such as logical log streaming.  It also looked at the performance characteristics of this system.  The talk also wins for the most quote-able line of the conference:  "BDR is spooning Postgres, not forking", referring to the BDR project's commitment to maintaining the code in conjunction with core Postgres and gradually incorporating this into core.

As part of the closing ceremony, there were lightning talks as well; quick-paced talks (maximum of 5 minutes) which covered a variety of interesting, fun and sometimes silly topics.  In particular some memorable ones were one about using Postgres/PostGIS to extract data about various ice cream-related check-ins on Foursquare, as well as one which proposed a generic (albeit impractical) way to search across all text fields in a database of unknown schema to find instances of key data.

As always, it was good to participate in the PostgreSQL community, and look forward to seeing participants again at future conferences.

Manage Python Script Options

Some time ago I was working on a simple Python script. What the script did is not very important for this article. What is important, is the way it parsed arguments, and the way I managed to improve it.

All below examples look similar to that script, however I cut most of the code, and changed the sensitive information, which I cannot publish.

The main ideas for the options management are:

  • The script reads all config values from a config file, which is a simple ini file.
  • The script values can be overwritten by the command line values.
  • There are special command line arguments, which don't exist in the config file like:
    • --help - shows help in command line
    • --create-config - creates a new config file with default values
    • --config - the path to the config file which should be used
  • If there is no value for a setting in the config file, and in the command line arguments, then a default value should be taken.
  • The option names in the configuration file, and the command line, must be the same. If there is repo-branch in the ini file, then there must be --repo-branch in the command line. However the variable where it will be stored in Python will be named repo_branch, as we cannot use - in the variable name.

The Basic Implementation

The basic config file is:

[example]
repo-branch = another

The basic implementation was:

#!/usr/bin/env python

import sys
import argparse
import ConfigParser

import logging
logger = logging.getLogger("example")
logger.setLevel(logging.DEBUG)

ch = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s - %(name)s : %(lineno)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)

class Options:

    def __init__(self, args):
        self.parser = argparse.ArgumentParser(description="Example script.")
        self.args = args

        self.parser.add_argument("--create-config",
                                 dest="create_config",
                                 help="Create configuration file with default values")

        self.parser.add_argument("--config",
                                 dest="config",
                                 default="/tmp/example.cfg",
                                 help="Path to example.cfg")

        self.parser.add_argument("--repo-branch",
                                 dest="repo_branch",
                                 default="something",
                                 help="git branch OR git tag from which to build")

        # HERE COME OVER 80 LINES WITH DECLARATION OF THE NEXT 20 ARGUMENTS

        self.options = self.parser.parse_args()
        print "repo-branch from command line is: {}".format(self.options.repo_branch)



    def get_options(self):
        return self.options

    def get_parser(self):
        return self.parser

class UpgradeService():

    def __init__(self, options):
        if not options:
            exit(1)
        self.options = options
        if self.options.config:
            self.config_path = self.options.config
            self.init_config_file()
        self.init_options()

    def init_config_file(self):
        """ This function is to process the values provided in the config file """

        self.config = ConfigParser.RawConfigParser()
        self.config.read(self.config_path)

        self.repo_branch = self.config.get('example', 'repo-branch')

        # HERE COME OVER 20 LINES LIKE THE ABOVE

        print "repo-branch from config is: {}".format(self.repo_branch)

    def init_options(self):
        """ This function is to process the command line options.
            Command line options always override the values given in the config file.
        """
        if self.options.repo_branch:
            self.repo_branch = self.options.repo_branch

        # HERE COME OVER 20 LINES LIKE THE TWO ABOVE

    def run(self):
        pass

if __name__ == "__main__":
    options = Options(sys.argv).get_options()
    upgrade_service = UpgradeService(options)

    print "repo-branch value to be used is: {}".format(upgrade_service.repo_branch)
    upgrade_service.run()

The main idea of this code was:

  • All the command line arguments parsing is done in the Options class.
  • The UpgradeService class reads the ini file.
  • The values from the Options class and the ini file are merged into the UpgradeService fields. So a config option like repo-branch will be stored in the upgrade_service.repo_branch field.
  • The upgrade_service.run() method does all the script's magic, however this is not important here.

This way I can run the script with:

  • ./example.py - which will read the config file from /tmp/example.cfg, and the repo_branch should contain another.
  • ./example.py --config=/tmp/a.cfg - which will read the config from the /tmp/a.cfg.
  • ./example.py --help - which will show the help (this is automatically supported by the argparse module).
  • ./example.py --repo-branch=1764 -- and the repo_branch variable should contain 1764.

The Problems

First of all, there is a lot of repeated code, and repeated option names. Repeating code is a great way to provide lots of bugs. Each each option name is mentioned in the command line arguments parser (see the line 36). It is repeated later in the config file parser (see line 68). The variable name, which is used for storing each value, is repeated a couple of times. First in the argparse declaration (see line 36), then in the function init_options (see line 79). The conditional assignment (like in the lines 76-77) is repeated for each option. However for some options it is a little bit different.

This makes the code hard to update, when we change an option name, or want to add a new one.

Another thing is a simple typo bug. There is no check if an option in the config file is a proper one. When a user, by a mistake, writes in the config file repo_branch instead of repo-branch, it will be ignored.

The Bug

One question: can you spot the bug in the code?

The problem is that the script reads the config file. Then overwrites all the values with the command line ones. What if there is no command line argument for --repo-branch? Then the default one will be used, and it will overwrite the config one.

./example.py --config=../example.cfg
repo-branch from command line is: something
repo-branch from config is: another
repo-branch value to be used is: something

Fixing Time

The code for the two implementations (the one described above, and the one described below) can be found on github:

I tried to implement a better solution, it should fix the bug, inform user about bad config values, be easier to change later, and give the same result: the values should be used as UpgradeService fields.

The class Options is not that bad. We need to store the argparse configuration somewhere. I'd like just to have the option names, and default values declared in one place, without repeating it in different places.

I left the Options class, however I moved all the default values to another dictionary. There is no default value for any option in the argparse configuration. So now, if there is no command line option e.g. for --repo-branch then the repo_branch field in the object returned by the method Options.get_options() will be None.

After the changes, this part of the code is:

DEFAULT_VALUES = dict(
    config="/tmp/example.cfg",
    repo_branch="something",
)

class Options:

    def __init__(self, args):
        self.parser = argparse.ArgumentParser(description="Example script.")
        self.args = args

        self.parser.add_argument("--create-config",
                                 dest="create_config",
                                 help="Create configuration file with default values")

        self.parser.add_argument("--config",
                                 dest="config",
                                 help="Path to example.cfg")

        self.parser.add_argument("--repo-branch",
                                 dest="repo_branch",
                                 help="git branch OR git tag from which to build")

        self.options = self.parser.parse_args()
        print "repo-branch from command line is: {}".format(self.options.repo_branch)

        # Here comes the next about 20 arguments

    def get_options(self):
        return self.options

    def get_parser(self):
        return self.parser

So I have a dictionary with the default values. If I would have a dictionary with the config values, and a dictionary with the command line ones, then it would be quite easy to merge them, and compare.

Get Command Line Options Dictionary

First let's make a dictionary with the command line values. This can be made with a simple:

def parse_args():
  return Options(sys.argv).get_options().__dict__

However there are two things to remember:

  • There is the command --create-config which should be supported, and this is the best place to do it.
  • The arguments returned by the __dict__, will have underscores in the names, instead of dashes.

So let's add creation of the new config file:

def parse_args():
    """ Parses the command line arguments, and returns dictionary with all of them.

    The arguments have dashes in the names, but they are stored in fields with underscores.

    :return: arguments
    :rtype: dictionary
    """
    options = Options(sys.argv).get_options()
    result = options.__dict__
    logger.debug("COMMAND LINE OPTIONS: {}".format(result))

    if options.create_config:
        logger.info("Creating configuration file at: {}".format(options.create_config))
        with open(options.create_config, "w") as c:
            c.write("[{}]\n".format("example"))
            for key in sorted(DEFAULT_VALUES.keys()):
                value = DEFAULT_VALUES[key]
                c.write("{}={}\n".format(key, value or ""))
        exit(0)
    return result

The above function first gets the options from an Options class object, then converts it to a dictionary. If there is the option create_config set, then it creates the config file. If not, this function returns the dictionary with the values.

Get Config File Dictionary

The config file converted to a dictionary is also quite simple. However what we can get is a dictionary with keys like they are written in the config file. These will contain dashes like repo-branch, but in the other dictionaries we have underscores like repo_branch, I will also convert all the keys to have underscores instead of the dashes.

CONFIG_SECTION_NAME = "example"
def read_config(fname, section_name=CONFIG_SECTION_NAME):
    """ Reads a configuration file.

    Here the field names contain the dashes, in args parser,
    and the default values, we have underscores.
    So additionally I will convert the dashes to underscores here.

    :param fname: name of the config file
    :return: dictionary with the config file content
    :rtype: dictionary
    """
    config = ConfigParser.RawConfigParser()
    config.read(fname)

    result = {key.replace('-','_'):val for key, val in config.items(section_name)}
    logger.info("Read config file {}".format(fname))
    logger.debug("CONFIG FILE OPTIONS: {}".format(result))
    return result

And yes, I'm using dictionary comprehension there.

Merging Time

Now I have three dictionaries with configuration options:

  • The DEFAULT_VALUES.
  • The config values, returned by the read_config function.
  • The command line values, returned by the parse_args function.

And I need to merge them. Merging cannot be done automatically, as I need to:

  • Get the DEFAULT_VALUES.
  • Overwrite or add values read from the config file.
  • Overwrite or add values from command line, but only if the values are not None, which is a default value when an argument it not set.
  • At the end I want to return an object. So I can call the option with settings.branch_name instead of the settings['branch_name'].

For merging I created this generic function, it can merge the first with the second dictionary, and can use the default values for the initial dictionary.

At the end it uses the namedtuple to get a nice object with fields' names taken from the keys, and filled with the merged dictionary values.

def merge_options(first, second, default={}):
    """
    This function merges the first argument dictionary with the second.
    The second overrides the first.
    Then it merges the default with the already merged dictionary.

    This is needed, because if the user will set an option `a` in the config file,
    and will not provide the value in the command line options configuration,
    then the command line default value will override the config one.

    With the three-dictionary solution, the algorithm is:
    * get the default values
    * update with the values from the config file
    * update with the command line options, but only for the values
      which are not None (all not set command line options will have None)

    As it is easier and nicer to use the code like:
        options.path
    then:
        options['path']
    the merged dictionary is then converted into a namedtuple.

    :param first: first dictionary with options
    :param second: second dictionary with options
    :return: object with both dictionaries merged
    :rtype: namedtuple
    """
    from collections import namedtuple
    options = default
    options.update(first)
    options.update({key:val for key,val in second.items() if val is not None})
    logger.debug("MERGED OPTIONS: {}".format(options))
    return namedtuple('OptionsDict', options.keys())(**options)

Dictionary Difference

The last utility function I need is something to compare dictionaries. I think it is a great idea to inform the user that he has a strange option name in the config file. Let's assume, that:

  • The main list of the options is the argparse option list.
  • The config file can contain less options, but cannot contain options which are not in the argparse list.
  • There are some options which can be in the command line, but cannot be in the config file, like --create-config.

The main idea behind the function is to convert the keys for the dictionaries to sets, and then make a difference of the sets. This must be done for the settings names in both directions:

  • config.keys - commandline.keys - if the result is not an empty set, then it is an error
  • commandline.keys - config.keys - if the result is not an empty set, then we should just show some information about this

The below function gets two arguments first and second. It returns a tuple like (first-second, second-first). There is also the third argument, it is a list of the keys which we should ignore, like the create_config one.

def dict_difference(first, second, omit_keys=[]):
    """
    Calculates the difference between the keys of the two dictionaries,
    and returns a tuple with the differences.

    :param first:     the first dictionary to compare
    :param second:    the second dictionary to compare
    :param omit_keys: the keys which should be omitted,
                      as for example we know that it's fine that one dictionary
                      will have this key, and the other won't

    :return: The keys which are different between the two dictionaries.
    :rtype: tuple (first-second, second-first)
    """
    keys_first = set(first.keys())
    keys_second = set(second.keys())
    keys_f_s = keys_first - keys_second - set(omit_keys)
    keys_s_f = keys_second - keys_first - set(omit_keys)

    return (keys_f_s, keys_s_f)

Build The Options

And now the end. The main function for building the options, which will use all the above code. This function:

  • Gets a dictionary with command line options from the parse_args function.
  • Finds the path to the config file (from the command line, or from the default value).
  • Reads the dictionary with config file options from the read_config function.
  • Calculates the differences between the dictionaries using the dict_difference function.
  • Prints information about the options which can be set in the config file, but are not set currently. Those options are in the Options class, but are not in the config file.
  • Prints information about the options which are in the config file, but shouldn't be there, because they are not declared in the argparse options list, in the Options class.
  • If there are any options which cannot be in the config file, the script exits with error code.
  • Then it merges all three dictionaries using the function merge_options, and returns the named tuple.
    """
    Builds an object with the merged opions from the command line arguments,
    and the config file.

    If there is an option in command line which doesn't exist in the config file,
    then the command line default value will be used. That's fine, the script
    will just print an info about that.

    If there is an option in the config file, which doesn't exist in the command line,
    then it looks like an error. This time the script will show this as error information,
    and will exit.

    If there is the same option in the command line, and the config file,
    then the command line one overrides the config one.
    """

    options = parse_args()
    config = read_config(options['config'] or DEFAULT_VALUES['config'])

    (f, s) = dict_difference(options, config, COMMAND_LINE_ONLY_ARGS)
    if f:
        for o in f:
            logger.info("There is an option, which is missing in the config file,"
                        "that's fine, I will use the value: {}".format(DEFAULT_VALUES[o]))
    if s:
        logger.error("There are options, which are in the config file, but are not supported:")
        for o in s:
            logger.error(o)
        exit(2)

    merged_options = merge_options(config, options, DEFAULT_VALUES)
    return merged_options

Other Changes

There are some additional changes. I had to add a list with the command line argumets, which are fine to be omitted in the config file:

COMMAND_LINE_ONLY_ARGS = ["create_config"]

The UpgradeService class is much simpler now:

class UpgradeService():

    def __init__(self, options):
        if not options:
            exit(1)
        self.options = options

    def run(self):
        pass

The runner part also changed a little bit:

if __name__ == "__main__":
    options = build_options()
    upgrade_service = UpgradeService(options)

    print "repo-branch value to be used is: {}".format(upgrade_service.options.repo_branch)
    upgrade_service.run()

The only main difference between those two implementations is that in the first, the options could be accessed as upgrade_service.repo_branch, and in the second they need to be accessed as: upgrade_service.options.repo_branch.

Full Code

The code for the two implementations can be found on github: