The object being observed is always changed
by the instrument of observation.
- Unknown (paraphrased by SlingShot)
The quote at the beginning of this page could not
be more relevant to this study of the American Road Cycling
usage logs. I read something similar more than 30 years ago, but
I really gained a greater appreciation for it
during this review of web site traffic. I
believe my paraphrase does the thought justice, but that may just be
my eyes making it look better to me.
The quote's relevance to the study of the logs comes
from the fact that when I first started the
ATTENDANCE RECORDS, I had
only the barest evidence that anybody was even reading the
American Road Cycling web site at all. Then after I started keeping records,
it appeared that the act of keeping and publishing them had itself become an
attraction, so I cannot be sure how much of my observation was based
on typical browsing patterns, or how much the traffic was skewed by publishing those patterns themselves.
Previously, a few people had mentioned to me that
American Road Cycling had become
all the rage, but I assumed this was just the perception of a few
people who all knew each other, and who were
merely reinforcing this belief by talking amongst themselves. I
figured there were only about a ½
dozen somewhat regular readers, or a baker's dozen tops.
Most of the feedback was about how everybody
really loved
ROAD RASH COMICS.
It was a sentiment I totally agreed with and
considered Road Rash to be the only truly excellent part of American Road
Cycling, making it all worth the effort, a serendipitous boon
which made publishing the site easy.
I originally began the site as a practical
study to: 1) Show
Paul Latrine
how quickly a fully functional
web site can be put together using "light
structures" (my own term), and 2) add support to my general thesis
that all things Internet are mostly a waste of time, when somebody is actually involved in doing something significant in the
real world.
Almost as an aside, there was also that little matter of
one of my newsletter articles being
CENSORED BY THE TALIBAN,
then afterward, during my protest, my personally being harassed by parts of the OCBC
leadership. This censorship and harassment became the focal point of
early American Road Cycling activity. Happily, all harassment
has stopped, and the web site has supplanted the Taliban's censorship with
freedom of nonsense.
But back to the matter, that little bit of feedback I
started receiving almost immediately on publishing American Road Cycling,
even such a small amount, was
vastly more than I had received in the previous decade of my
constructing and publishing three dozen other
web sites. Those sites began in 1993, the year
the World Wide Web began.
Those other sites were done gratis just like American Road Cycling.
The exceptions were
Endico (which foots the bill here) and
Equipoise which wasn't meant to be gratis, it just turned out that
way.
The earliest of these sites were established on my belief
about "what could be" (or rather, what I knew surely would be), while later
ones were put together as caveats to help other people understand what the
Internet is totally incapable of providing, despite widespread hype to the
contrary.
In light of my previous online experiences, the
amount of feedback that was coming in about American Road Cycling
was astonishing.
The final trigger that made me take a closer look was
after I mentioned Terry Bowden on the home page, and he reported back
very soon after that someone had told him he was now famous on the
Internet. I had not known Terry before the mentioning, and did not
know (even by name) the person he reported as saying something to him
about it. It was clearly time to take a closer look at how many
people were actually viewing the site. I still figured it was about
6 to 13 people tops.
Hit Counter: Before reading on,
go back to the home page, pull the page down till the
Hit Counter is in the middle of the screen, then hit your
browser's Refresh button several times. You will notice the Hit
Counter is going backwards, starting from some astronomical
number that I put in myself. The hit counter has been functioning
this way for most of the life of American Road Cycling, but
Grant Salter
is the only person who has ever mentioned noticing
it.
The enormous number of "hits" aside, after
speaking to Terry, and after Lynn Meyer reported this guy
Dan (Palletman) McNeilly
was also reading American Road Cycling a lot,
I merely hoped to begin a study to confirm that about a half dozen people were showing up somewhat
regularly, so I started taking a close look at the logs files.
It is not an easy task to decide what the data
contained in a raw log file means. Plus, from the little I knew
about it, I could tell that all the supposed "Web Traffic Reports"
that were commercially available were reporting nonsense to the
people who relied on them. These pitiful reports are a good thing
for people selling web space, because they allow them to grossly overstate
the amount of traffic a site is receiving. People are so accustomed
to doing advertising that cannot be effectively tracked, they don't
question these web usage reports very much.
So I developed my own system; and, in that process,
found that it is even harder than I expected to get the truth out.
If not for the number of actual human beings I knew were visiting
the site, I would still be in the dark about it.
Here's how I figured it out.
First, a question. Do you want to know what kind of stuff your
government is trying to get Google to cough up?
TAKE A LOOK AT A RAW DATA FILE.
The file linked above are the American Road
Cycling log entries
for only the date 01/27/06. That day was chosen because it was the lowest traffic
recorded for the month. but enough complexity remains to make the point.
Security sensitive numbers have been redacted, so be aware that the
uncensored version looks even more complex. Scroll down the
log entries while thinking about the character Cypher (Joe
Pantoliano) as he watched the raw data from The Matrix and
says to Neo, "Why oh why didn't I take the BLUE pill?"
Why indeed.
Fortunately I knew some of the actual people who
were attached to a some of the IP#'s in the logs. IP stands for
Internet Protocol and the numbers are in this
format: xxx.xxx.xxx.xxx
Every person on the web has at least one IP#, it is
the only way a connection can be made. Somewhere there is a data
table that links each and every IP# to the person who is using it. That's how
Bad Boy become Caught Boys.
There are cloaking technologies, but every thing made
can be broken, except not necessarily by me, so I was extremely
excited to get the numbers of "known" browsers, who were returning
on a regular basis. Really, there is no way to explain to somebody,
who doesn't already know, what a wonderful opportunity this was.
I've been waiting for more than a decade.
Still, just knowing the numbers isn't enough. There
is a great deal of complexity that can be removed by importing the
raw data into an Access database, and establishing filters to get
rid of noise. Such as the following:
Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)
This means the "person" connecting is actually a
Google web search bot, that is to say this is a web robot that is
automatically scanning the internet gathering information and
content from pages. Most of the traffic on a web server comes from
such bots, not from humans. So once a "well behaved" bot (meaning it
doesn't do bad stuff like download whole sites over and over, fill
the logs with spam, steal images and other things you don't want
stolen, etc), one can filter those entries from the log file, and
that makes the log much more readable.
Establishing good filters is an iterative process, and
it takes constant tweaking. Again, it is almost impossible to do
without having good references at the outset regarding, "who's a bot, and who's not."
It is really a great help to be able to look at a number and say, "Oh,
that's just Lynn, it's ok. She's allowed to hit
Mary Ellen's Birthday Countdown as many times as she wants. No
problem."
Once you know a few peoples' IP#'s, you can start
filtering using other criteria such as, "drop the log entries
for all the images which get
loaded every time somebody hits a page." You can get rid of things
like:
/images/weblogo.jpg
/images/new.gif
/images/t_00145b.jpg
/_vti_bin/fpcount.exe/ Page=index.htm|Image=4|Digits=10
/images/patriot.jpg
/images/t_support.jpg
/images/t_ID.jpg
/images/poster.gif
...which just happens to be the images that are
loaded (plus the hit counter decrement) every time somebody accesses
the American Road Cycling home page. You take this stuff out
of the report, and it gets a lot easier to read. but finding which
things are important to see, and which things should be left out of
the report, takes a lot of time, trial, and error.
Here is my current full list of filters:
ACCESS QUERY LOG FILTERS
Once again,
security sensitive numbers have been redacted. You'll also note that
several shortcuts have been used to filter whole types of entries. I
really shouldn't publish this list, because it can give someone who
has mal-intent a shortcut to finding how best to hide from
SlingShot's eagle eye.
I also use numerous filters with automated sorting,
deleting and renaming macros to get the raw data into an ever more
readable format even before applying the filters above. But that's
beyond the scope of this discussion.
Here's a picture of the final screen I use for logs
review:
LOGS TRACKER
The top list is the filtered log events. The bottom
list holds the copy of selected data for tracking. The bottom list is where I
keep the track of observed IP#'s and their browsing behaviors
exhibited. If it is an Unknown Viewer, 3 return visits of
human like behavior gets a UV number assigned to them. UV11 is circled
in red on the screen shot.
Tucked between the two lists is the note clipboard I use to keep
track of what point in the logs I have reviewed, such as ARC,
209.210.33.75, 01/30/06, 23:58:22. That way I know where to
start the next review.
Besides knowing which things can be filtered out, it
is really important to figure out which things are good to keep
track of, such as: Referrers. This is a record of whatever page it
was that linked the viewer to your site. Below is the Referrer entry
for somebody who arrived at American Road Cycling after
having performed a Yahoo search for "Bela Caroli":
http://search.yahoo.com/search?p=bela+caroli&fr=FP-tab-web-t&toggle=1&cop=&ei=UTF-8
You can plug the above into a your browser's URL
field to see just what led that person to the site. Search engine
results are to some degree variable, so things might have changed by
the time you take a look at it. Otherwise, the information derived
this way is used to make necessary changes so the next person
doesn't show up
using bandwidth needlessly while they are getting a bad feeling
about your name, as it is being associated with something that wasted their
time.
Other things you can learn by keeping certain items
in the report are such as:
Dan (Palletman) McNeilly
uses a Firefox browser. That sort of stuff is a comfort, because if
something I do is incompatible with that browser, I'm sure Palletman
is going to yell, "foul," and I'll be put on notice.
On the other hand, some people use this sort of
information to merely attract as much traffic as is absolutely possible to
a site, rational or not, just so they can report, "Wow, thousands
and thousands and thousands of hits. That's worth a lot. Pay me
more!'
A couple times a day, I log onto my server, grab a
copy of the log file, run it through the filtering process, check to
see who showed up. Take a look at how effectively American Road Cycling
seems to be working for them, and check them onto the
ATTENDANCE RECORDS.
Reviewing everybody's browsing this month have given me a solid
ability to look at logs and see right away who's a human, and
who's a bot. The biggest improvement is in my ability to identify a
person who has arrived through an AOL connection which assigns
several IP#'s at the same time, so their number keeps changing as
they click from page to page. Being able to do all this really helps
me when I review the Endico logs, and those logs are important,
because Mary's painting sales buys me computers, Ottrotts, and trips
to Florida.
In the final analysis, it turns out my first
impressions were almost totally correct. There are about a half
dozen regular readers of American Road Cycling, so about a dozen considering those
who show up once in awhile, and at the very most two dozen considering rare
or one time arrivals. Once in awhile there's a newbie who happens
onto the site, but just like we already knew, the status of "American
road cycling" is pretty sad. But at least I'm sure about those
numbers now,
not just guessing.
An unscrupulous Internet provider could easily
reinterpret the raw data, and truthfully state, there were 23,488
unique viewings of the American Road Cycling web site this month. The fact of the matter
would be that this "truth" represents a readership of less than a dozen
regulars, a half dozen more of irregulars, and maybe a dozen tourists
happening through from time to time. The rest is bots and nonsense. I can report
this, because I'm not making any money on your belief that a lot is
happening with American Road Cycling.
But this is an astounding number
of people, considering what I've seen with 36 other sites that I've done
over the last 13 years. One of them,
Equipoise, is even one of the earliest equine sites on the web,
and remains one of the best run and most unique horse sites, while
holding the title as the world's first "catch engine"—which is the
opposite of a search engine, and maybe more useful.
A somewhat lesser degree of lying occurs when web
hosting services provide generic reports that pretend to distinguish
people from bots, while overstating the degree of reliability for
"unique viewings."
During the course of this month's study, I wound up
looking at numerous web log reports throughout the Internet and was
astonished at the general level of misinformation they could be
providing to uninformed site owners. Just the knowing of who the
bots are is only the first step in understanding logs output.
Disregarding for the moment that bots often change
IP#'s, along with the other information they carry onto a site which
may be used to filter the output of generic automated
reports, standardized web reports are still unlikely to ever be
able to deal with the vast array of digital processes that routinely
access a web server without human intervention, the hardest task may
be in effectively deciding which elements on a site are best left off
the final report, and then using the information gathered during the
reporting process to cycle back into the design and development
process in order to continually improve the entire site as a whole.
The only way to make a site more understandable in
terms of what the actual human traffic flow is, and what it really
means, is to have a human constantly review and fine tune the
reporting process, then match it back into changes made on the site,
and iterate that process again, and again..
If anybody reads to this point and wants to know more
about it, let me know, and I'll continue.
WELL, SOMEBODY DID COMMENT...
|