Whenever I work on SEO reports, I often start by looking at pages indexed in Google. I just want a simple list of the URLs indexed by the *GOOG*. I usually use this list to get a general idea of navigation, look for duplicate content, and examine initial counts of different types of pages indexed.
Yesterday, I finally got around to figuring out a command line solution to generate this desired indexation list. Here's how to use the command line using http://www.endpoint.com/ as an example:
Grab the search results using the "site:" operator and make sure you run an advanced search that shows 100 results. The URL will look something like: http://www.google.com/search?num=100&as_sitesearch=www.endpoint.com
But it will likely have lots of other query parameters of lesser importance [to us]. Save the search results page as search.html.
Run the following command:
sed 's/<h3 class="r">/\n/g; s/class="l"/LINK\n/g' search.html | grep LINK | sed 's/<a href="\|" LINK//g'
There you have it. Interestingly enough, the order of pages can be an indicator of which pages rank well. Typically, pages with higher PageRank will be near the top, although I have seen some strange exceptions. End Point's indexed pages:
http://www.endpoint.com/ http://www.endpoint.com/clients http://www.endpoint.com/team http://www.endpoint.com/services http://www.endpoint.com/sitemap http://www.endpoint.com/contact http://www.endpoint.com/team/selena_deckelmann http://www.endpoint.com/team/josh_tolley http://www.endpoint.com/team/steph_powell http://www.endpoint.com/team/ethan_rowe http://www.endpoint.com/team/greg_sabino_mullane http://www.endpoint.com/team/mark_johnson http://www.endpoint.com/team/jeff_boes http://www.endpoint.com/team/ron_phipps http://www.endpoint.com/team/david_christensen http://www.endpoint.com/team/carl_bailey http://www.endpoint.com/services/spree ...
For the site I examined yesterday, I saved the pages as one.html, two.html, three.html and four.html because the site had about 350 results. I wrote a simple script to concatenate all the results:
#!/bin/bash rm results.txt for ARG in $* do sed 's/<h3 class="r">/\n/g; s/class="l"/LINK\n/g' $ARG | grep LINK | sed 's/<a href="\|" LINK//g' >> results.txt done
And I called the script above with:
./list_google_index.sh one.html two.html three.html four.html
This solution isn't scalable nor is it particularly elegant. But it's good for a quick and dirty list of pages indexed by the *GOOG*. I've worked with the WWW::Google::PageRank module before and there are restrictions on API request limits and frequency, so I would highly advise against writing a script that makes requests to Google repeatedly. I'll likely use the script described above for sites with less than 1000 pages indexed. There may be other solutions out there to list pages indexed by Google, but as I said, I was going for a quick and dirty approach.
Remember not to get eaten by the Google Monster
Learn more about End Point's technical SEO services.