Whenever I work on SEO reports, I often start by looking at pages indexed in Google. I just want a simple list of the URLs indexed by the *GOOG*. I usually use this list to get a general idea of navigation, look for duplicate content, and examine initial counts of different types of pages indexed.
Yesterday, I finally got around to figuring out a command line solution to generate this desired indexation list. Here's how to use the command line using http://www.endpoint.com/ as an example:
Step 1
Grab the search results using the "site:" operator and make sure you run an advanced search that shows 100 results. The URL will look something like:
http://www.google.com/search?num=100&as_sitesearch=www.endpoint.com
But it will likely have lots of other query parameters of lesser importance [to us]. Save the search results page as search.html.
Step 2
Run the following command:
sed 's/<h3 class="r">/\n/g; s/class="l"/LINK\n/g' search.html | grep LINK | sed 's/<a href="\|" LINK//g'
There you have it. Interestingly enough, the order of pages can be an indicator of which pages rank well. Typically, pages with higher PageRank will be near the top, although I have seen some strange exceptions. End Point's indexed pages:
http://www.endpoint.com/ http://www.endpoint.com/clients http://www.endpoint.com/team http://www.endpoint.com/services http://www.endpoint.com/sitemap http://www.endpoint.com/contact http://www.endpoint.com/team/selena_deckelmann http://www.endpoint.com/team/josh_tolley http://www.endpoint.com/team/steph_powell http://www.endpoint.com/team/ethan_rowe http://www.endpoint.com/team/greg_sabino_mullane http://www.endpoint.com/team/mark_johnson http://www.endpoint.com/team/jeff_boes http://www.endpoint.com/team/ron_phipps http://www.endpoint.com/team/david_christensen http://www.endpoint.com/team/carl_bailey http://www.endpoint.com/services/spree ...
For the site I examined yesterday, I saved the pages as one.html, two.html, three.html and four.html because the site had about 350 results. I wrote a simple script to concatenate all the results:
#!/bin/bash
rm results.txt
for ARG in $*
do
sed 's/<h3 class="r">/\n/g; s/class="l"/LINK\n/g' $ARG | grep LINK | sed 's/<a href="\|" LINK//g' >> results.txt
done
And I called the script above with:
./list_google_index.sh one.html two.html three.html four.html
This solution isn't scalable nor is it particularly elegant. But it's good for a quick and dirty list of pages indexed by the *GOOG*. I've worked with the WWW::Google::PageRank module before and there are restrictions on API request limits and frequency, so I would highly advise against writing a script that makes requests to Google repeatedly. I'll likely use the script described above for sites with less than 1000 pages indexed. There may be other solutions out there to list pages indexed by Google, but as I said, I was going for a quick and dirty approach.
Remember not to get eaten by the Google Monster
Learn more about End Point's technical SEO services.

8 comments:
I'd suggest using curl and bash's {} grouping operators to stream content to sed like this. You could also check out my post on poor man's concurrency with bash to run several of these processes at once.
{ for i in http://www.google.com http://www.backcountry.com;do curl $i;done; } | sed -e '/stuff/d'
Hi Shane,
Thanks for the suggestion. However, running:
{ for i in "http://www.google.com/search?num=100&as_sitesearch=www.endpoint.com"; do wget $i; done; } | ...
or
{ for i in "http://www.google.com/search?num=100&as_sitesearch=www.endpoint.com"; do curl $i; done; } | ...
triggers a 403 (forbidden) response, so it would require some hacking to get around forbidden script requests to Google. Perhaps I'll play around with the User Agent settings and find a way to successfully make curl requests in the future.
~Steph
Good point. Google seems to require the user agent string. Using something as simple as
curl -A 'mozilla' 'http://www.google.com/search?num=100&as_sitesearch=www.endpoint.com'
seems to work. The linkscape api is a great tool for this sort of thing also.
Yes, I've been thinking about working with the Linkscape API. According to the API docs, from the free (limited) API, you can grab:
* the mozRank of the page requested
* the number of external, juice-passing links
* the subdomain mozrank
* The total number of links (coming soon!)
* Domain Authority (coming soon!)
* Page Authority (coming soon!)
* The top 500 links sorted by Page Authority (coming soon!)
* The top 3 linking domains sorted by Domain Authority (coming soon!)
* The top 3 anchor texts to the site or page (coming soon!)
I would love to integrate the Linkscape API into my SEO workflow.
this may be a stupid question, but how do i run the command?
Is it done through the browser?
Worked great just by typing the command where you would otherwise enter the URL address in your browser window.
Hi Steph,
I ran the sed command in terminal but don't know how to view the output of the url's you showed.
Would you have any advice for a novice???
Thanks,
Daniel
Reiki,
The results of the sed command should output directly into the terminal. If you want, you can output them into a file by appending "> filename" to the end of the command, and using a text editor (vi, emacs, notepad, gedit, etc.) to read the file.
~Steph
Post a Comment