News

Welcome to End Point’s blog

Ongoing observations by End Point people

Showing Your Client Measurable Progress with Video Updates

I’ve recently started working on a big new fancy project -- the kind everyone dreams of -- a nice budget, a great client, and lots of fun stuff to dig into. I happened to stumble on a fun and easy way to get a client a weekly update on progress, Videos!

I used Google Chrome’s Screencastify app to capture my screen along with a vocal narrative to give the client a visual update of what we have accomplished so far this week. I have been using Screencastify Lite and was really happy with the end product. Combined with Google’s developer tools device mode you can toggle between the site’s full screen or mobile view.

This is a great tool to show a client some measurable progress without going through the pain of showing them an unfinished site, letting them dig into lots of broken links, and having them inevitably stumble on more questions than answers. Plus, your client can digest the content you’ve shown them, avoid the hassle of scheduling a face to face demo, and allow them some time to compose their feedback.

Next time you need an easy fast way to show progress on a project, try a video!

Report on The Perl Conference 2016

In June, I traveled to Orlando, Florida to attend the event formerly known as Yet Another Perl Conference (or YAPC::NA), now known as The Perl Conference. This was my second time in a row to attend this conference (after my first attendance back in 2007).

Conferences are a great place to learn how others are using various tools, hear about new features, and interact with the community. If you are speaking, it's a great opportunity to brush up on your subject, which was true for me in the extreme, as I was able to give a talk on the PostgreSQL database, which I hadn't used in a long time (more on that later).

The conference name

The event organizers were able to license the name The Perl Conference from O'Reilly Media, as O'Reilly doesn't hold conferences by this name anymore. This name is now preferred over "YAPC" as it is more friendly to newcomers and more accurately describes the conference. More on the name change.

Notes from the conference

Over the three days of the conference, I was able to take in many talks. Here are some of my more interesting notes from various sessions:

  • MetaCPAN is the best way to browse and search for Perl modules. Anyone can help with development of this fine project, via their GitHub.
  • Ricardo Signes says "use subroutine signatures!" They are "experimental", but are around to stay.
  • Perl6 is written in Perl6 (and something called "Not Quite Perl"). This allows one to read the source to figure out how something is done. (There were many talks on Perl6, which is viewed as a different programming language, not a replacement for Perl5.)
  • jq is a command-line utility that can pretty-print JSON (non Perl, but nice!)
  • Ricardo Signes gave a talk on encoding that was over my head, but very interesting.
  • The presenter of Emacs as Perl IDE couldn't attend, so Damian Conway spoke on VIM as Perl IDE (photo above)

From John Anderson's talk:

  • Just say "no" to system Perl. Use plenv or the like.
  • There's a DuckDuckGo bang command for searching MetaCPAN: !cpan [module]
  • Use JSON::MaybeXS over the plain JSON module.
  • Use Moo for object-oriented programming in Perl, or Moose if you must.
  • Subscribe to Perl Weekly
  • Submit your module on PrePAN first, to receive feedback before posting to CPAN.

Lee Johnson gave a talk called Battling a legacy schema with DBIx::Class (video). Key takeaways:

  • When should I use DBIC?
  • Something that has grown organically could be considered "legacy," as it accumulates technical debt
  • With DBIC, you can start to manage that debt by adding relationships to your model, even if they aren't in your database
  • RapidApp's rdbic can help you visualize an existing database

D. Ruth Bavousett spoke on Perl::Critic, which is a tool for encouraging consistency. Basically, Perl::Critic looks at your source code, and makes suggestions for improvement, etc. These suggestions are known as "policies" and can be configured to enable or disable any of them, or to even write new policies. One suggestion was to create a Git hook to run the perlcritic command at the time code is committed to the source code repository (possibly using App::GitHooks). End Point has its own perlcritic configuration, which I have started trying to use more.

Logan Bell shared Strategies for leading a remote team. Some of the tools and techniques he uses include:

  • tmate for terminal sharing
  • HipChat, with a chat room just for complaining called "head to desk"
  • Holds one-on-one meetings every Monday for those he works with and directs
  • Has new team members work on-site with another team member their first week or so, to help understand personalities that don't often come across well over chat
  • Tries to have in-person meetings every quarter or at least twice a year, to bring the team together

My talk

Finally, my own talk was titled Stranger in a Strange Land: PostgreSQL for MySQL users (video). I hadn't used Postgres in about seven years, and I wanted to get re-acquainted with it, so naturally, I submitted a talk on it to spur myself into action!

In my talk, I covered:

  • the history of MySQL and Postgres
  • how to pronounce "PostgreSQL"
  • why one might be preferred over the other
  • how to convert an existing database to Postgres
  • and some tools and tips.

I enjoyed giving this talk, and hope others found it helpful. All in all, The Perl Conference was a great experience, and I hope to continue attending in the future!

All videos from this year's conference

Liquid Galaxy Featured on Reef Builders

The Liquid Galaxy was recently featured on the front page of Reef Builders, the original and longest running saltwater fish blog and a leading source of aquarium news. Reef Builders writer Nicole Helgason wrote the story, which can be found here.

The Liquid Galaxy is an amazing tool for aquariums (not to mention other kinds of museums, offices, educational institutions, and more) around the world. It is a particularly effective for aquariums due to the underwater imagery shown on the system, as well as End Point's Content Management System that allows users the opportunity to tell underwater "stories" with supporting images, videos, and content. We have deployed to aquariums and science museums throughout the US, Europe, and East Asia.

The Liquid Galaxy lets visitors explore coral reefs and underwater environments the exact same way they navigate Street View (it's the same app and data set) with a full set of screens to give a totally immersive experience. While viewing the dazzling immersive display, the user can make use of the Liquid Galaxy touchpad and 3D joystick to look around and navigate through the display.

A video demonstrating how the Liquid Galaxy is utilized in aquariums can be found below. If you're interested in learning more about Liquid Galaxy for your aquarium, please contact us here for more information.

Case Study: Responsive Design Return on Investment

Responsive design has been a hot topic in the e-commerce world for several years now. End Point has worked on many sites over the last few years to transition clients to a responsive design website model. While many large sized retailers have already transitioned to a responsive design, there are many smaller e-commerce sites that are still on an older design model and I would like to show that the return on investment for those stores is still noteworthy.

The lead of our Interchange team, Greg Hanson, led a responsive design project and I’d like to summarize that work on our blog. For confidentiality, I am leaving out the client’s name in this case.

Why Go Responsive?

There are two main reasons every e-commerce website, even a small one, needs to become responsive:

  • Your customers
  • Google

The march toward mobile sales at this point is undeniable and unstoppable. As more and more people become comfortable using their phones and tablets to purchase things, the bigger this market share will become. Also, Google has begun de-prioritizing websites that do not cater to mobile users. If you are waiting to go responsive because of budget, you might surprised to learn how dramatically the mobile revenue increased for the customer in this case.

Goals

This client is a small e-commerce business with a heavy ‘on’ season and ‘off’ season where the business owners could focus on this project. They wanted:

  • To accommodate the increasing mobile user base; they knew it was there from looking at their Google Analytics.
  • To increase revenue from mobile users. They could see that they had a 10% mobile user base, but they were converting only a small percentage.

End Point’s strategy

Our strategy with this small client, to minimize impact to the business and cost was to:

Use Bootstrap

Bootstrap is one of many front end frameworks that allows you to create a design template and roll that out to all the pages within a website without having to re-code each and every page in a website. This dramatically increases the speed and decreases cost.

Break up the work into segments

In this case, we had a three phase approach:

  • Phase I, re­design the site using Bootstrap but still at fixed width
  • Phase II, site checkout sequence changed to responsive
  • Phase III, entire site responsive

The Project

Converting the site to Bootstrap was the biggest chunk of the time and money spent on this project. Since we knew this would be the foundation for the changes to come, we took our time getting it right, keeping the client involved every step of the way. We didn’t want to get in the way of their busy season either, so we completed that project and waited a half a year to begin the next piece.

The second step was updating checkout sequence to be responsive since this was arguably the hardest part about using the non-responsive site on a mobile device. The client considered this piece top priority. Plus, since it was only a few pages, it allowed us all to better understand the scope of coding the rest of the site and give the client an accurate picture of the budget.

Last, once we had a responsive checkout sequence, we changed the rest of the internal pages to be responsive and rolled out the entire site as responsive and mobile friendly.

Results

The first time we looked at the analytics following the responsive conversion, we were SHOCKED! Using a small sample period, from just a year prior, we saw a 280% increase in mobile purchases.


The timeframe for these comparison numbers was August 2014 before the responsive transition and August of 2015, after the transition.

To sanity check those numbers, we just re-ran some of the analytics recently and in May of 2016, revenue from mobile users is still up over 90% and the clients total revenue year-over-year is up 12%.


We also ran some numbers on the growth of mobile revenue as a percentage of overall revenue through this process. As you can see, two years ago mobile revenue was less than 1% and is now near 12%.

Untitled drawing.jpg

In this case, I didn’t go into all the ways to slice and dice the numbers in Google Analytics. However I'm happy to speak with you if you want to know more about mobile users, this client’s data, or how responsive design might help grow your e-commerce business.

The Lesson

The lesson here is that any store will benefit from going responsive. Having an experienced partner like End Point that can work with your budget and timeline can make this is a doable project for any size store and budget.

Disabling Postgres constraints for pg_dump

Constraints in Postgres are very powerful and versatile: not only are foreign keys, primary keys, and column uniqueness done internally via constraints, but you may create your own quite easily (at both the column and table level). Most of the time constraints are simply set and forget, but there is one time constraints may become a problem: copying the database using the pg_dump program.

The issue is that constraints are usually added *before* the data is copied to the new table via the COPY command. This means the constraint fires for each added row, to make sure that the row passes the conditions of the constraint. If the data is not valid, however, the COPY will fail, and you will not be able to load the output of your pg_dump into a new database. Further, there may be a non-trivial performance hit doing all that validation. Preventing the constraint from firing may provide a significant speed boost, especially for very large tables with non-trivial constraints.

Let's explore one way to work around the problem of pg_dump failing to work because some of the data is not valid according to the logic of the constraints. While it would be quicker to make some of these changes on the production system itself, corporate inertia, red tape, and the usual DBA paranoia means a better way is to modify a copy of the database instead.

For this example, we will first create a sample "production" database and give it a simple constraint. This constraint is based on a function, to both emulate a specific real-world example we came across for a client recently, and to allow us to easily create a database in which the data is invalid with regards to the constraint:

dropdb test_prod; createdb test_prod
pgbench test_prod -i -n
creating tables...
100000 of 100000 tuples (100%) done (elapsed 0.82 s, remaining 0.00 s)
set primary keys...
done.
psql test_prod -c 'create function valid_account(int) returns bool language sql immutable as $$ SELECT $1 > 0$$;'
CREATE FUNCTION
psql test_prod -c 'alter table pgbench_accounts add constraint good_aid check ( valid_account(aid) )'
ALTER TABLE

Note that the constraint was added without any problem, as all of the values in the aid column satisfy the function, as each one is greater than zero. Let's tweak the function, such that it no longer represents a valid, up to date constraint on the table in question:

## Verify that the constraint is working - we should get an error:
psql test_prod -c 'update pgbench_accounts set aid = -1 where aid = 1'
ERROR:  new row for relation "pgbench_accounts" violates check constraint "good_aid"
DETAIL:  Failing row contains (-1, 1, 0,                                         ...).

## Modify the function to disallow account ids under 100. No error is produced!
psql test_prod -c 'create or replace function valid_account(int) returns bool language sql volatile as $$ SELECT $1 > 99$$'
CREATE FUNCTION

## The error is tripped only when we violate it afresh:
psql test_prod -c 'update pgbench_accounts SET aid=125 WHERE aid=125'
UPDATE 1
psql test_prod -c 'update pgbench_accounts SET aid=88 WHERE aid=88'
ERROR:  new row for relation "pgbench_accounts" violates check constraint "good_aid"
DETAIL:  Failing row contains (88, 1, 0,                                         ...).

The volatility was changed from IMMUTABLE to VOLATILE simply to demonstrate that a function called by a constraint is not bound to any particular volatility, although it *should* always be IMMUTABLE. In this example, it is a moot point, as our function can be immutable and still be "invalid" for some rows in the table. Owing to our function changing its logic, we now have a situation in which a regular pg_dump cannot be done:

dropdb test_upgraded; createdb test_upgraded
pg_dump test_prod | psql test_upgraded -q
ERROR:  new row for relation "pgbench_accounts" violates check constraint "good_aid"
DETAIL:  Failing row contains (1, 1, 0,                                          ...).
CONTEXT:  COPY pgbench_accounts, line 1: "1             1   0          "
## Ruh roh!

Time for a workaround. When a constraint is created, it may be declared as NOT VALID, which simply means that it makes no promises about the *existing* data in the table, but will start constraining any data changed from that point forward. Of particular importance is the fact that pg_dump can dump things into three sections, "pre-data", "data", and "post-data". When a normal constraint is dumped, it will go into the pre-data section, and cause the problems seen above when the data is loaded. However, a constraint that has been declared NOT VALID will appear in the post-data section, which will allow the data to load, as it will not be declared until after the "data" section has been loaded in. Thus, our workaround will be to move constraints from the pre-data to the post-data section. First, let's confirm the state of things by making some dumps from the production database:

pg_dump test_prod --section=pre-data -x -f test_prod.pre.sql
pg_dump test_prod --section=post-data -x -f test_prod.post.sql
## Confirm that the constraint is in the "pre" section:
grep good_aid test*sql
test_prod.pre.sql:    CONSTRAINT good_aid CHECK (valid_account(aid))

There are a few ways around this constraint issue, but here is one that I like as it makes no changes at all to production, and produces valid SQL files that may be used over and over.

dropdb test_upgraded; createdb test_upgraded
## Note that --schema-only is basically the combination of pre-data and post-data
pg_dump test_prod --schema-only | psql test_upgraded -q
## Save a copy so we can restore these to the way we found them later:
psql test_upgraded -c "select format('update pg_constraint set convalidated=true where conname=%L and connamespace::regnamespace::text=%L;', \
  conname, nspname) from pg_constraint c join pg_namespace n on (n.oid=c.connamespace) \
  where contype ='c' and convalidated" -t -o restore_constraints.sql
## Yes, we are updating the system catalogs. Don't Panic!
psql test_upgraded -c "update pg_constraint set convalidated=false where contype='c' and convalidated"
UPDATE 3
## Why 3? The information_schema "schema" has two harmless constraints
pg_dump test_upgraded --section=pre-data -x -o test_upgraded.pre.sql
pg_dump test_upgraded --section=post-data -x -o test_upgraded.post.sql
## Verify that the constraint has been moved to the "post" section:
grep good test*sql
test_prod.pre.sql:    CONSTRAINT good_aid CHECK (valid_account(aid))
test_upgraded.post.sql:-- Name: good_aid; Type: CHECK CONSTRAINT; Schema: public; Owner: greg
test_upgraded.post.sql:    ADD CONSTRAINT good_aid CHECK (valid_account(aid)) NOT VALID;
## Two diffs to show the inline (pre) versus ALTER TABLE (post) constraint creations:
$ diff -u1 test_prod.pre.sql test_upgraded.pre.sql 
--- test_prod.pre.sql        2016-07-04 00:10:06.676766984 -0400
+++ test_upgraded.pre.sql    2016-07-04 00:11:07.978728477 -0400
@@ -54,4 +54,3 @@
     abalance integer,
-    filler character(84),
-    CONSTRAINT good_aid CHECK (valid_account(aid))
+    filler character(84)
 )

$ diff -u1 test_prod.post.sql test_upgraded.post.sql 
--- test_prod.post.sql        2016-07-04 00:11:48.683838577 -0400
+++ test_upgraded.post.sql    2016-07-04 00:11.57.265797869 -0400
@@ -17,2 +17,10 @@
 
+--
+-- Name: good_aid; Type: CHECK CONSTRAINT; Schema: public; Owner: greg
+--
+
+ALTER TABLE pgbench_accounts
+    ADD CONSTRAINT good_aid CHECK (valid_account(aid)) NOT VALID;
+
+
 SET default_tablespace = '';

Now we can simply sandwich our data load between the new pre and post files, and avoid having the constraints interfere with the data load portion at all:

dropdb test_upgraded; createdb test_upgraded
psql test_upgraded -q -f test_upgraded.pre.sql
pg_dump test_prod --section=data | psql test_upgraded -q
psql test_upgraded -q -f test_upgraded.post.sql
## As the final touch, make all the constraints we changed exactly how each were before:
psql test_upgraded -f restore_constraints.sql

A final sanity check is always a good idea, to make sure the two databases are identical, despite our system catalog tweaking:

diff -s <(pg_dump test_prod) <(pg_dump test_upgraded)
Files /dev/fd/63 and /dev/fd/62 are identical

Although we declared a goal of having the upgraded database match production as closely as possible, you can always not apply that final restore_constraints.sql file and leave the constraints as NOT VALID, which is a better reflection of the reality of things. It also means you will not have to go through this rigmarole again, as those constraints shall forevermore be put into the post-data section when doing a pg_dump (unless someone runs the ALTER TABLE ... VALIDATE CONSTRAINT ... command!).

While there is no direct way to disable constraints when loading data, using this pre-data to post-data trick can not only boost data load times, but get you out of a potential jam when your data is invalid!

Book review: “Two Scoops of Django: Best Practices for Django 1.8”

Two Scoops of Django: Best Practices for Django 1.8 is, shockingly, a book about best practices. It’s not a Django library reference, or a book about Django fundamentals, or tips and tricks. It’s a book designed to help web developers, novices and experts alike, avoid common pitfalls in every stage of the web development process, specifically the process of developing with Django.

The book can be used as a reference of best practices and a cover-to-cover guide to best practices. I’ve done both and found it to be enjoyable, accessible, and educational when read cover-to-cover and a valuable reference when setting up a new Django project or doing general Django development. It covers a huge range of material, answering questions like:

  • Where should I store secret keys?
  • Should I use virtualenv?
  • How should I structure my Django project?
  • When should I use ‘blank’ and ‘null’ in model fields?
  • Should I use Class-Based Views or Function-Based Views?
  • How should I structure my URLConfs?
  • When should I use Forms?
  • Where should templates be stored?
  • Should I write custom template tags and filters?
  • What package should I use to create a REST API?
  • What core components should I replace?
  • How can I modify or replace the User and authentication system?
  • How should I test my app?
  • How can I improve performance?
  • How can I keep my app secure?
  • How do I properly use the logging library?
  • How do I deploy to a Platform as a Service or my own server(s)?
  • What can I do to improve the debugging process?
  • What are some good third-party packages?
  • Where can I find good documentation and tutorials?
  • Where should I go to ask more questions?

The question, then, is whether or not this book delivers this information well. For the most part, it does. It’s important to recognize that the book doesn’t cover any of these subjects in great detail, but it does do a great job explaining the “why” behind some of the simple rules that are established and referencing online resources that can be used to go much more in-depth with the subject. It does a great job showing clearly marked bad examples, making it very easy to see whether or not you are or were planning on doing something badly. The writing style is very accessible and straightforward; I read large portions of the book during breakfast or lunch.

Two sections stood out to me as being very helpful for my own projects. First is Chapter 4, Fundamentals of Django App Design, which explained better than any resource I’ve found yet exactly what a Django “app” (as in ./manage.py startapp polls) should be used for. It explains what an app should or shouldn’t have in it, how much an app should do, when you should break into separate apps, and more.

The next section that really helped me was Chapter 6, Model Best Practices, which explained things like what code should and shouldn’t be in a model, how to use migrations and managers, and ModelFields that should be avoided. Perhaps the most useful part of that chapter is a table in the section “When to use Null and Blank,” which makes for an easy and quick reference to which fields go well with the null and blank parameters and when you should use both or neither.

The only real problem I had with Two Scoops of Django was that the illustrations rarely felt necessary or helpful. The majority of them are ice cream-themed jokes that aren’t particularly funny. Overall, I really enjoyed this book and I definitely recommend it for anybody who is or is interested in doing serious Django development.

Cesium on the Liquid Galaxy

Data visualization continues to evolve, with ever-more complex data sets available openly, and a corresponding increased pace in visualization tools. In mapping GIS data, the Cesium app is gaining quite a bit of traction. As we continue to branch out with new functionality and visualization apps for the Liquid Galaxy, we wanted to try the Cesium app as well.


Cesium is written all in JavaScript WebGL and offers some nice advantages over other engines: it's open source, it's flexible, and it's quick. It can accept an array of points, shapefiles, 3D models, and even KML. The JavaScript then chews these up and delivers a nice consistent 3D environment that we can fly through with the SpaceNav controller, set scenes in a presentation to tell a story, or mix together with video or graphic popups for a fully immersive multimedia experience. Cesium is open source, and provides a great deal of flexibility and accessibility to build different kinds of data visualizations and interactions. There are a lot of new startups exploiting this platform and we welcome the opportunity to work with them.

As we've written previously, the main advantage of the Liquid Galaxy platform is the ability to adjust the viewing angle on each screen to match the physical offset, avoiding (as much as possible) artificial distortions, fisheye views, or image stretching. The trickiest bit of this project was setting the distributed camera driver, which takes input from the SpaceNav controller and correctly aligns the view position for each of the geometrically offset displays. Once the math is worked out, it's relatively quick work to put the settings into a rosbridge WebSockets driver. Once again, we're really enjoying the flexibility that the ROS architecture grants this system.

Looking forward, we anticipate this can open up many more visualizations for the Liquid Galaxy. As we continue to roll out in corporate, educational, and archival environments such as real estate brokerages, hospitality service providers, universities, and museums, the Cesium platform will offer yet another way for our customers to visualize and interact with their data.

Liquid Galaxy on PBS

PBS recently aired a segment about the Liquid Galaxy! Just before we presented at New York Tech Meetup, we were interviewed about the Liquid Galaxy for SciTech Now, a PBS publication. The interview took place in NYU’s Skirball Center For The Performing Arts, which is where New York Tech Meetup takes place every month.



The Liquid Galaxy segment, which can be viewed above, features Ben Goldstein and me talking with complementary visuals playing at the same time.

Ben Goldstein opens the segment by talking about how the Liquid Galaxy is a panoramic system that engages your peripheral vision, and is immersive in that way.

I go on to add that the system consists of large paneled screens set up in an arch around the viewer. The Liquid Galaxy includes a touchscreen and 3D joystick that allows users can fly around the world. From there, with the use of End Point's Content Management System, users can add images, video, kml, other overlay, to add interactivity and build custom presentations on the system. Thus far, the Liquid Galaxy has been particularly popular in real estate, museums, aquariums, research libraries, hospitality, and travel.



















Ben concludes the segment by talking about how the system kind of plays "follow the leader". Navigation occurs on the central display, while the other displays are configured at appropriate geometric offsets. The other displays pull down their appropriate section of the world so that the viewer can see the world in an immersive panoramic view all at once.

We hope you enjoy our segment!


Scrape web content with PHP (no API? no problem)

Introduction

There is a lot of data flowing everywhere. Not structured, not useful pieces of data moving here and there. Getting this data and structuring, processing can make it really expensive. There are companies making billions of dollars just (huh?) for scraping web content and showing in a nice form.

Another reason for doing such things can be for example, lack of an API from a source website. In this case, it's the only way to get data that you need to process.

Today I will show you how to get web data using PHP and that it can be as easy as pie.

Just do it

There are multiple scraping scripts ready to use. I can recommend one of them: PHP Simple HTML DOM Parser. It's extremely easy to start with and initial cost is almost nothing, it's open sourced also.

First, download a library from an official site: http://sourceforge.net/project/showfiles.php?group_id=218559. You can use a composer version too, it's here: https://github.com/sunra/php-simple-html-dom-parser.

Let's say that you have downloaded this file already. It's just a one PHP file called simple_html_dom.php. Create a new PHP file called scraper.php and include mentioned library like this:

<?php

require('simple_html_dom.php');

In our example, we will scrape top 10 trending YouTube videos and create a nice array of links and names out of it. We will use this link: https://www.youtube.com/feed/trending?gl=GB.

We need to grab this page first. Using PHP it's just a one additional line in our script:

<?php

require('simple_html_dom.php');

// Create DOM from URL or file
$html = file_get_html('https://www.youtube.com/feed/trending?gl=GB');

A PHP object was just created with the YouTube page structure.

Look at the YouTube page structure to find a repeating structure for a list of videos. It's best to use Chrome developer tools and its HTML browser. At the time of writing this post (it can change in the future of course) it's:

<ul class="expanded-shelf-content-list has-multiple-items">
 <li class="expanded-shelf-content-item-wrapper">...</li>
 <li class="expanded-shelf-content-item-wrapper">...</li>
 <li class="expanded-shelf-content-item-wrapper">...</li>
 ...
</ul>

Thanks Google! This time it will be easy. Sometimes a structure of the page lacks of classes and ids and it's more difficult to select exactly what we need.

Now, for each item of expanded-shelf-content-item-wrapper we need to find its title and url. Using developer tools again, it's easy to achieve:

<a 
 class="yt-uix-sessionlink yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 spf-link " 
 dir="ltr" 
 aria-describedby="description-id-284683" 
 title="KeemStar Swatted My Friend." 
 href="/watch?v=oChvoP8zEBw">
 KeemStar Swatted My Friend
</a>

Jackpot! We have both things that we need in the same HTML tag. Now, let's grab this data:

<?php

require('simple_html_dom.php');

// Create DOM from URL or file
$html = file_get_html('https://www.youtube.com/feed/trending');

// creating an array of elements
$videos = [];

// Find top ten videos
$i = 1;
foreach ($html->find('li.expanded-shelf-content-item-wrapper') as $video) {
        if ($i > 10) {
                break;
        }

        // Find item link element 
        $videoDetails = $video->find('a.yt-uix-tile-link', 0);

        // get title attribute
        $videoTitle = $videoDetails->title;

        // get href attribute
        $videoUrl = 'https://youtube.com' . $videoDetails->href;

        // push to a list of videos
        $videos[] = [
                'title' => $videoTitle,
                'url' => $videoUrl
        ];

        $i++;
}

var_dump($videos);

Look, it's simple as using CSS. What we just did? First, we extracted all videos and started looping through them here:

foreach ($html->find('li.expanded-shelf-content-item-wrapper') as $video) {

Then, just extracted a title and url per each video item here:

// Find item link element 
$videoDetails = $video->find('a.yt-uix-tile-link', 0);

// get title attribute
$videoTitle = $videoDetails->title;

At the end, we just push an array object with scraped data to the array and dump it. The result looks like this:

array(10) {
  [0]=>
  array(2) {
    ["title"]=>
    string(90) "Enzo Amore & Big Cass help John Cena even the odds against The Club: Raw, July 4, 2016"
    ["url"]=>
    string(39) "https://youtube.com/watch?v=940-maoRY3c"
  }
  [1]=>
  array(2) {
    ["title"]=>
    string(77) "Loose Women Reveal Sex Toys Confessions In Hilarious Discussion | Loose Women"
    ["url"]=>
    string(39) "https://youtube.com/watch?v=Xxzy_bZwNcI"
  }
  [2]=>
  array(2) {
    ["title"]=>
    string(51) "Tinie Tempah - Mamacita ft. Wizkid (Official Video)"
    ["url"]=>
    string(39) "https://youtube.com/watch?v=J4GQxzUdZNo"
  }
  [3]=>
  array(2) {
    ["title"]=>
    string(54) "Michael Gove's Shows you What's Under his Kilt"
    ["url"]=>
    string(39) "https://youtube.com/watch?v=GIpVBLDky30"
  }
  [4]=>
  array(2) {
    ["title"]=>
    string(25) "Deception, Lies, and CSGO"
    ["url"]=>
    string(39) "https://youtube.com/watch?v=_8fU2QG-lV0"
  }
  [5]=>
  array(2) {
    ["title"]=>
    string(68) "Last Week Tonight with John Oliver: Independence Day (Web Exclusive)"
    ["url"]=>
    string(39) "https://youtube.com/watch?v=IQwMCQFgQgo"
  }
  [6]=>
  array(2) {
    ["title"]=>
    string(21) "Last Week I Ate A Pug"
    ["url"]=>
    string(39) "https://youtube.com/watch?v=TTk5uQL2oO8"
  }
  [7]=>
  array(2) {
    ["title"]=>
    string(59) "PEP GUARDIOLA VS NOEL GALLAGHER | Exclusive First Interview"
    ["url"]=>
    string(39) "https://youtube.com/watch?v=ZWE8qkmhGmc"
  }
  [8]=>
  array(2) {
    ["title"]=>
    string(78) "Skins, lies and videotape - Enough of these dishonest hacks. [strong language]"
    ["url"]=>
    string(39) "https://youtube.com/watch?v=8z_VY8KZpMU"
  }
  [9]=>
  array(2) {
    ["title"]=>
    string(62) "We Are America ft. John Cena | Love Has No Labels | Ad Council"
    ["url"]=>
    string(39) "https://youtube.com/watch?v=0MdK8hBkR3s"
  }
}

Isn't it easy?

The end

I have some advice if you want to make this kind of script be processing the same page all the time:

  • set the user agent header to simulate a real web browser request,
  • make calls with a random delay to avoid blacklisting from a web server,
  • use PHP 7,
  • try to optimize the script as much as possible.
You can use this script for production code but, to be honest, it's not the most optimal approach. If you are not satisfied, code it by yourself :-).

Nice documentation is located here: http://simplehtmldom.sourceforge.net/

WAL-based Estimates For When a Record Was Changed

I originally titled this: Inferring Record Timestamps by Analyzing PITR Streams for Transaction Commits and Cross-Referencing Tuple xmin Values. But that seemed a little long, though it does sum up the technique.

In other words, it's a way to approximate an updated_at timestamp column for your tables when you didn't have one in the first place.

PostgreSQL stores the timestamp of a transaction's commit into the transaction log. If you have a hot standby server, you can see the value for the most-recently-applied transaction as the output of the pg_last_xact_replay_timestamp() function. That's useful for estimating replication lag. But I hadn't seen any other uses for it, at least until I came up with the hypothesis that all the available values could be extracted wholesale, and matched with the transaction ID's stored along with every record.

If you're on 9.5, there's track_commit_timestamps in postgresql.conf, and combined with the pg_xact_commit_timestamp(xid) function has a similar result. But it can't be turned on retroactively.

This can -- sort of. So long as you have those transaction logs, at least. If you're doing Point-In-Time Recovery you're likely to at least have some of them, especially more recent ones.

I tested this technique on a pgbench database on stock PostgreSQL 9.4, apart from the following postgresql.conf settings that (sort of) turn on WAL archival -- or at least make sure the WAL segments are kept around:

wal_level = archive
archive_mode = on
archive_command = '/bin/false'

We'll be using the pg_xlogdump binary to parse those WAL segments, available from 9.3 on. If you're on an earlier version, the older xlogdump code will work.

Once pgbench has generated some traffic, then it's time to see what's contained in the WAL segments we have available. Since I have them all I went all the way back to the beginning.

$ pg_xlogdump -p pg_xlog/ --start 0/01000000 --rmgr=Transaction
rmgr: Transaction len (rec/tot):     12/    44, tx:          3, lsn: 0/01006A58, prev 0/01005CA8, bkp: 0000, desc: commit: 2016-05-15 22:32:32.593404 EDT
rmgr: Transaction len (rec/tot):     12/    44, tx:          4, lsn: 0/01008BC8, prev 0/01008A60, bkp: 0000, desc: commit: 2016-05-15 22:32:32.664374 EDT
rmgr: Transaction len (rec/tot):     12/    44, tx:          5, lsn: 0/01012EA8, prev 0/01012E58, bkp: 0000, desc: commit: 2016-05-15 22:32:32.668813 EDT
(snip)
rmgr: Transaction len (rec/tot):     12/    44, tx:       1746, lsn: 0/099502D0, prev 0/099501F0, bkp: 0000, desc: commit: 2016-05-15 22:55:12.711794 EDT
rmgr: Transaction len (rec/tot):     12/    44, tx:       1747, lsn: 0/09951530, prev 0/09951478, bkp: 0000, desc: commit: 2016-05-15 22:55:12.729122 EDT
rmgr: Transaction len (rec/tot):     12/    44, tx:       1748, lsn: 0/099518D0, prev 0/099517F0, bkp: 0000, desc: commit: 2016-05-15 22:55:12.740823 EDT
pg_xlogdump: FATAL:  error in WAL record at 0/99518D0: record with zero length at 0/9951900

The last line just indicates that we've hit the end of the transaction log records, and it's written to stderr, so it can be ignored. Otherwise, that output contains everything we need, we just need to shift around the components so we can read it back into Postgres. Something like this did the trick for me, and let me import it directly:

$ pg_xlogdump -p pg_xlog/ --start 0/01000000 --rmgr=Transaction | awk -v Q=\' '{sub(/;/, ""); print $8, Q$17, $18, $19Q}' > xids

postgres=# CREATE TABLE xids (xid xid, commit timestamptz);
CREATE TABLE
postgres=# \copy xids from xids csv
COPY 1746

At which point it's a simple join to pull in the commit timestamp records:

postgres=# select xmin, aid, commit from pgbench_accounts inner join xids on pgbench_accounts.xmin = xids.xid;
 xmin |  aid   |            commit             
------+--------+-------------------------------
  981 | 252710 | 2016-05-15 22:54:34.03147-04
 1719 | 273905 | 2016-05-15 22:54:35.622406-04
 1183 | 286611 | 2016-05-15 22:54:34.438701-04
 1227 | 322132 | 2016-05-15 22:54:34.529027-04
 1094 | 331525 | 2016-05-15 22:54:34.26477-04
 1615 | 383361 | 2016-05-15 22:54:35.423995-04
 1293 | 565018 | 2016-05-15 22:54:34.688494-04
 1166 | 615272 | 2016-05-15 22:54:34.40506-04
 1503 | 627740 | 2016-05-15 22:54:35.199251-04
 1205 | 663690 | 2016-05-15 22:54:34.481523-04
 1585 | 755566 | 2016-05-15 22:54:35.368891-04
 1131 | 766042 | 2016-05-15 22:54:34.33737-04
 1412 | 777969 | 2016-05-15 22:54:34.953517-04
 1292 | 818934 | 2016-05-15 22:54:34.686746-04
 1309 | 940951 | 2016-05-15 22:54:34.72493-04
 1561 | 949802 | 2016-05-15 22:54:35.320229-04
 1522 | 968516 | 2016-05-15 22:54:35.246654-04