Chili Pepper Design

Web development relleno

Web Analytics Tracking With JavaScript, a Tracking Pixel, and PHP

| Comments

Articles: -Basic javascript tracking theory – what data is available to collect – why JS vs server side analytics (include differences in data you can collect) – IP only on server side — why do cookies client side if you can also access them SS? no reason for 1st party, but if collecting analytics on different host than serving page from, you won’t have access to cookies on SS – referrer - need to get from JS b/c analytics request will have the current page as referrer! – sometimes you have a choice - UserAgent is available on CS and SS — same UA should be present on gif request, so no reason to - try to reduce amount in information being sent (for speed) – 3 levels of info: page, session, user -asynchronous command queue -sending data to the server – tracking pixel – iframe post? – ajax request? – PHP server side tracking pixel implementation — why not send empty response instead of gif? http://stackoverflow.com/questions/6638504/why-serve-1x1-pixel-gif-web-bugs-data-at-all – client timestamps (should be UTC), so timezone is irrelevant) – put on separate subdomain from site - more modular this way (can host tracking collector separately) - -unique visitor tracking methods – cookies (limitations) – fingerprinting – ETags – DoNotTrack – legal implications – we are de-identifying and not setting cookies if DNT is set, but are still recording the hit -visitor and session cookies – information in each cookie (and what it for) – expiration times – performance (small cookies) – js implementation -browser capabilities – screen resolution, colors, viewport size, etc – Java and Flash detection – language, timezone – UserAgent parsing -extra segmentation with custom variables and campaign tracking – variables at page, session and user level – google campaign variables - more referral context – campaign cookies / attribution (remember how the visitor first arrived at the site? or how they most recently arrived here? -server side stats you can calculate: – time on site – pages per session – uniques per time period -server side IP geocoding – the basic principles (ip blocks used by ISPs in certain geographic areas) – the databases available – accuracy issues (IPs do change, mobile IPs, etc) -client side page load speed – HTML5 Navigation Timing – https://support.google.com/analytics/answer/1205784?hl=en – http://www.html5rocks.com/en/tutorials/webperformance/basics/ – http://www.igvita.com/2012/04/04/measuring-site-speed-with-navigation-timing/ – http://calendar.perfplanet.com/2011/a-practical-guide-to-the-navigation-timing-api/ – track page load speed on every request? or just a sample or loads? — have a 2nd event fire off with performance data -client side DOM interaction tracking – clicks, etc – heatmaps – external link exits -security / rate limiting – for security, code the allowed referrers to only be certain approved domains? — this should help prevent DDOS kinds of things, and other abuse, and placing the code

We needed to implement analytics are SplashLab, to track the engagement and effectiveness of the social applications we create for our clients, in an attempt to measure the return on investment of their social marketing efforts.

Many of our applications are on Facebook, which provides some analytics (called “Insight”), but many of our applications are not on Facebook as all, or have a non-Facebook component. Facebook Insights are somewhat limited as well, focusing on the overall performance of your Facebook Page and Posts, and not providing much data about the application. Some of our applications share Facebook Applications as well, meaning the data from Facebook can muddled multiple client’s data together.

We could also use 3rd party services like Google Analytics to provide some of the data, but we wanted more flexibility and control. 3rd party services greatly reduce the overall work, but integration can take a surprising amount of work sometimes, and you are always at risk of the comany you rely on changing their pricing structure, or not being reliable, or going out of business altogether (maybe as much of a concern using Google Analytics, specifically).

So we are building our own solution, borrowing liberally in the name of flattery from all of the existing services I can find information on, in an attempt to follow current (undocumented) best practices.

With any analytics application you are attempting to answer the basic questions of “who, what, when, where, why and how (and… which?)”. With a web application that means collecting as much data as you can about who the end user is, what they are requesting from your server, when they requested it, and how they are interacting with the content you return. The where part doesn’t really apply I guess, except perhaps in ascertaining the location of the who. And you can only to infer why they are requesting it to a very small extent.

You have a limited number of methods available for collecting this information on a web app. These include:

  • Data from the end user’s HTTP request for your resource
  • Data from JavaScript running in the user’s web browser
  • Inferred data from persistent storage on the client’s computer, such as a Cookie

The bulk of the data comes from the HTTP request, which doesn’t contain a whole lot of information, but some of it is quite informative. HTTP request data usually includes: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

  • The name (URL) resource requested
  • The time of the request (this is really just inferred on the server side when fulfilling the request - “What time is it now?”)
  • The previous URL this resource’s was request originated from (“referer”)
  • The client’s IP address
  • The program the user is requesting the resource with - usually a web browser (“user-agent”)
  • The preferred language accepted by the user (“accept-language”)

There is also usually a bunch of other boring information about caching and gzip compression, which isn’t something more people are interested in know about - it lacks human interest.

It’s really only about six pieces of information, but some of it is quite juicy. It answers what is being requested, when it’s being requested, and although the data is not always accurate, it fills in a few parts of the who behind the request: where they came from (referer), where they are (IP address), which browser they are using (user-agent - useful for determining if they are on a mobile device or not), and which language they prefer to speak (accept-language).

The IP address can be particularly useful. Technically an IP address is just the “virtual” location of a computer (not that interesting), which has no bearing on the physical location of the computer (which is more interesting). However, on the Internet as it currently exists, IP address as usually assigned to servers which to exist in a physical location, and don’t move around very often. This fact has been exploited, and databases exist which map IP address to known physical locations. The accuracy and far from 100%, and at best the reliable accuracy is only to the granularity of cities, but on the whole it yields very interesting data.

JS Implementation details:

asynchronous loader created the script tag (with correct protocol) – discussion of async loading, and which browsers support it properly: —http://friendlybit.com/js/lazy-loading-asyncronous-javascript/ —- this post says it should be in the onload callback, or it will block in some browsers? —http://mrcoles.com/blog/google-analytics-asynchronous-tracking-how-it-work/ tracking event queue object is created separately when the JS is asynchronously loaded, it gets the object, and processes any events queued -thereafter, the queue is not simply an array, but is now an object, which immediately processes any new events pushed to it once all of the event data is collected, it is sent to the server via GET parameters on an gif image request (the “tracking pixel” web bug (or beacon)) -this is done without actually inserting the image into the DOM, but with a JS image preload there is a maximum GET request length, and if that is exceeded a POST is made instead, from a hidden iframe that is created

session and cookie stuff? google does the cookies and sends the cookie content b/c it’s a 3rd party service, so it can’t directly access the 1st party cookie data from the server we are tracking all first-party, but if we want to deploy analytics on custom server, we’ll need this

WE SHOULD MAKE A FLAG OR SOMETHING WHEN RECORDING HITS ON THE SERVER WHICH DON’T HAVE COOKIE INFORMATION

cookie alternatives: -browser fingerprinting -flash cookies -local storage? -Etags?

Privacy legislation about cookies and Do Not Track policies… some companies like KISSmetrics are facing lawsuits about trying to work around cookie tracking with things like ETags. Potentially any tracking strategy which does not allow users control over tracking (like cookies do) could open you up to legal action but all this really just affects Unique tracking, not general hit tracking

what design pattern best describes the command queue system? http://addyosmani.com/resources/essentialjsdesignpatterns/book I guess it’s a Command pattern, with a queue broker

use JS prototype inheritance for the events? -so all event types inherit from a base event with the same information (user_id, tab_id, etc)

Browser Fingerprinting? as an enhancement to Unique counting (for cookies disabled) -https://wiki.mozilla.org/Fingerprinting

make note of custom variable “levels” - page, visitor, session

campaign tracking? - piggyback on the same variables google uses: – utm_source: (newsletter|google|etc) – utm_medium: (email|cpc|fbPost|etc) – utm_term: (running+shoes|etc) – utm_content: (adVersion1|adVersion2) // used for A/B testing – utm_campaign: (spring_sale)

performance

Google tracks: a.domainLookupEnd-a.domainLookupStart, // DNS a.connectEnd-a.connectStart, // connect time (TCP) a.responseStart-a.requestStart, // time between requesting data and beginning of response a.responseEnd-a.responseStart, // time to receive all request data back (base page) a.fetchStart-a.navigationStart, // time to start of request, including redirect time // 3 time, from start of nav to (1) a.domInteractive-a.navigationStart, // time until the DOM is ready to be interacted with (just after response, start of Processing) (2) a.domContentLoadedEventStart-a.navigationStart // time until the DOM is fully loaded and parsed, but images/css not ready yet (3) a.loadEventStart-a.navigationStart, // end of page processing, start of onload event when images and css are loaded // other suggestions a.responseStart-a.connectEnd // time to first byte / “request” block a.loadEventStart-a.responseEnd // front end / page “processing” a.responseEnd-a.fetchStart // network latency a.loadEventEnd-a.responseEnd // time taken for page load after data recieved a.loadEventEnd-a.navigationStart // entire load from start to end of images and everything a.redirectEnd-a.redirectStart // redirect time // piwik tracks a.responseEnd - a.requestStart // “generation time” server request + server response time // breakdown from chart a.redirectEnd-a.redirectStart // redirect time a.domainLookupStart-a.fetchStart // App cache a.domainLookupEnd-a.domainLookupStart // DNS a.connectEnd-a.connectStart // TCP a.responseStart-a.requestStart // Request a.responseEnd-a.responseStart // Response a.loadEventStart-a.responseEnd // client side processing a.loadEventEnd-a.loadEventStart // images etc loading

// ones I want to record a.redirectEnd-a.redirectStart // redirect time a.connectEnd-a.connectStart, // TCP connect time a.domainLookupEnd-a.domainLookupStart // DNS resolution time a.responseStart-a.requestStart // Server Request (time to get data back from server) a.responseEnd-a.responseStart // Server Response (time to get all data from server) a.loadEventStart-a.responseEnd // client side, including images/css a.domContentLoadedEventStart-a.navigationStart // time until the DOM is fully loaded and parsed, but no images/css yet a.loadEventStart-a.navigationStart // time until the DOM is fully loaded and parsed with images/css

a.fetchStart-a.navigationStart, // time to start of request, including redirect time?

// loadEventEnd to see how long all assets take as well?

do not track handling - check every time? - check at bootstrap/queue load time? - how are exceptions handled? how does browser inform of exception? does the flag show a different status for the domain? – I think it does: http://www.w3.org/TR/tracking-dnt/#js-dom – what is this? navigator.storeSiteSpecificTrackingException – can be used to request exemption I think? var retVal = navigator.storeSiteSpecificTrackingException(args); – http://msdn.microsoft.com/en-us/library/ie/dn254981(v=vs.85).aspx – http://msdn.microsoft.com/en-us/library/ie/dn265021(v=vs.85).aspx

Comments