News

Welcome to End Point’s blog

Ongoing observations by End Point people

In Our Own Words

What do our words say about us?

Recently, I came across Wordle, a Java-based Google App Engine application that generates word clouds from websites and raw text. I wrote a cute little rake task to grab text from our blog to plug into Wordle. The rake task grabs the blog contents, uses REXML for parsing, and then lowercases the results. The task also applies a bit of aliasing since we use postgres, postgreSQL and pg interchangeably in our blog.

task :wordle => :environment do
   data = open('http://blog.endpoint.com/feeds/posts/default?alt=rss&max-results=999', 'User-Agent' => 'Ruby-Wget').read
   doc = REXML::Document.new(data)
   text = ''
   doc.root.each_element('//item') do |item|
     text += item.elements['description'].text.gsub(/<\/?[^>]*>/, "") + ' '
     text += item.elements['title'].text.gsub(/<\/?[^>]*>/, "") + ' '
   end
   text = text.downcase \
     .gsub(/\./, ' ')   \  
     .gsub(/^\n/, '')   \  
     .gsub(/ postgres /, ' postgresql ') \
     .gsub(/ pg /, ' postgresql ')
   file = File.new(ENV['filename'], "w")
   file.puts text
   file.close
 end

So, you tell me: Do you think we write like engineers? How well does this word cloud represent our skillset?

2 comments:

Jon Jensen said...

That's really cool.

The script's attempt to combine PostgreSQL with Postgres & Pg doesn't seem to have worked since those all appear separately in the word cloud. When I changed that locally it of course made PostgreSQL even more prominent. :)

To answer your question about whether it represents our skillset: No, it's not very balanced. We don't blog nearly enough about Perl, Interchange, JavaScript, security, networking, or operations topics to represent those major parts of what we do.

Let's re-run it in a year and see whether the balance has changed!

Steph Skardal said...

Jon, yes the script didn't clean up all instances of pg and Postgres, because you might notice I only did substitution when those strings were surrounded by spaces. I did this intentionally to be minimally invasive, because there are occurrences of pg_log or something_pg throughout our blog.