Welcome to End Point’s blog

Ongoing observations by End Point people

List Google Pages Indexed for SEO: Two Step How To

Whenever I work on SEO reports, I often start by looking at pages indexed in Google. I just want a simple list of the URLs indexed by the *GOOG*. I usually use this list to get a general idea of navigation, look for duplicate content, and examine initial counts of different types of pages indexed.

Yesterday, I finally got around to figuring out a command line solution to generate this desired indexation list. Here's how to use the command line using as an example:

Step 1

Grab the search results using the "site:" operator and make sure you run an advanced search that shows 100 results. The URL will look something like:

But it will likely have lots of other query parameters of lesser importance [to us]. Save the search results page as search.html.

Step 2

Run the following command:

sed 's/<h3 class="r">/\n/g; s/class="l"/LINK\n/g' search.html | grep LINK | sed 's/<a href="\|" LINK//g' 

There you have it. Interestingly enough, the order of pages can be an indicator of which pages rank well. Typically, pages with higher PageRank will be near the top, although I have seen some strange exceptions. End Point's indexed pages:

For the site I examined yesterday, I saved the pages as one.html, two.html, three.html and four.html because the site had about 350 results. I wrote a simple script to concatenate all the results:


rm results.txt

for ARG in $*
        sed 's/<h3 class="r">/\n/g; s/class="l"/LINK\n/g' $ARG | grep LINK | sed 's/<a href="\|" LINK//g' >> results.txt

And I called the script above with:

./ one.html two.html three.html four.html

This solution isn't scalable nor is it particularly elegant. But it's good for a quick and dirty list of pages indexed by the *GOOG*. I've worked with the WWW::Google::PageRank module before and there are restrictions on API request limits and frequency, so I would highly advise against writing a script that makes requests to Google repeatedly. I'll likely use the script described above for sites with less than 1000 pages indexed. There may be other solutions out there to list pages indexed by Google, but as I said, I was going for a quick and dirty approach.

Remember not to get eaten by the Google Monster

Learn more about End Point's technical SEO services.


Shane M Hansen said...

I'd suggest using curl and bash's {} grouping operators to stream content to sed like this. You could also check out my post on poor man's concurrency with bash to run several of these processes at once.

{ for i in;do curl $i;done; } | sed -e '/stuff/d'

Steph Powell said...

Hi Shane,

Thanks for the suggestion. However, running:

{ for i in ""; do wget $i; done; } | ...


{ for i in ""; do curl $i; done; } | ...

triggers a 403 (forbidden) response, so it would require some hacking to get around forbidden script requests to Google. Perhaps I'll play around with the User Agent settings and find a way to successfully make curl requests in the future.


Shane M Hansen said...

Good point. Google seems to require the user agent string. Using something as simple as
curl -A 'mozilla' ''

seems to work. The linkscape api is a great tool for this sort of thing also.

Steph Powell said...

Yes, I've been thinking about working with the Linkscape API. According to the API docs, from the free (limited) API, you can grab:
* the mozRank of the page requested
* the number of external, juice-passing links
* the subdomain mozrank
* The total number of links (coming soon!)
* Domain Authority (coming soon!)
* Page Authority (coming soon!)
* The top 500 links sorted by Page Authority (coming soon!)
* The top 3 linking domains sorted by Domain Authority (coming soon!)
* The top 3 anchor texts to the site or page (coming soon!)

I would love to integrate the Linkscape API into my SEO workflow.

SEO Melbourne said...

this may be a stupid question, but how do i run the command?

Is it done through the browser?

Robert said...

Worked great just by typing the command where you would otherwise enter the URL address in your browser window.

Reiki Vancouver said...

Hi Steph,

I ran the sed command in terminal but don't know how to view the output of the url's you showed.

Would you have any advice for a novice???


Steph Skardal said...


The results of the sed command should output directly into the terminal. If you want, you can output them into a file by appending "> filename" to the end of the command, and using a text editor (vi, emacs, notepad, gedit, etc.) to read the file.