So, after what seems like a long time just trying to learn R on my own (I’m well into the Coursera / Johns Hopkins Data Science specialization), I finally came across a work-related problem I could try some very simple scripting on. Like a pre-schooler eager to show his work, here’s a blog post 🙂
Hoping others find it useful, or can point to a better way of doing this.
One of the products I work with analyzes HTTP[S] sessions, looking for malicious application-level behaviour (really cool stuff IMHO, but not the focus of this post). To do that, we need to be able to reconstruct the entire HTTP stream based on network traffic captured with a variety of methods (SPAN ports, network taps, monitoring switches, or in some cases cloned traffic from a load balancer). The “quality” of that captured stream is key – if we lose too many packets, we can’t reliably follow the TCP streams, which means we may miss user clicks on the site.
Note that this is different than analyzing Web logs (Apache, nginx, …) as these log files often only have partial information about a click, whereas a full traffic capture offers much, much more.
One of the ways we analyze that quality is by estimating how many sessions had “lost” packets during a monitoring window. This requires a little bit of tinkering as some lost packets may refer to sessions already in flight, so just counting lost packets as a proportion of total will be misleading. We need to count how many NEW sessions in our monitoring window have had lost packets.
Using whatever method you prefer, capture traffic in pcap format. This is often done with a server plugged in the capture destination, using tcpdump to write contents to a file.
Open up Wireshark on a separate PC and load the pcap file.
Then, conduct two separate analysis:
- First, we create a list of all TCP sessions of interest that started within the monitoring window. We do this by applying the following filter:
tcp.flags.syn==1 && tcp.flags.ack==0
then exporting the resulting list (select File->Export Packet Dissections…) as a CSV (call it “Sessions.csv”) without packet details.
The output looks something like this (sanitized):
"1195","0.134502000","10.8.15.216","192.168.123.82","TCP","62","50680 > 443 [SYN] Seq=0 Win=4380 Len=0 MSS=1460 SACK_PERM=1"
- Now, we create a list of all events where Wireshark detected missing TCP segments. This can be done with this filter:
Again, export it to a CSV (call it “Lost.csv”) as above without adding packet details.
This is what the [sanitized] output looks like:
"1031","0.114669000","10.1.205.60","192.168.123.134","TCP","66","[TCP ACKed unseen segment] 49856 > 443 [ACK] Seq=757 Ack=35816 Win=45192 Len=0 TSval=1428649841 TSecr=188814810"
At first, I used a ‘quick & dirty’ approach using Excel(!) to compare these files, but that is not repeatable. Let’s try a little R…
Now that we have the two CSV files, the R script below tells us exactly what we need to know – what percentage of NEW TCP sessions, started within the monitoring window, had at least one “lost segment”.
(Notice that the “webservers” variable is just something I used to filter out unwanted traffic from the pcap and has been sanitized in the example below.)
library (dplyr) library (stringr) lostfile <- "Lost.csv" sessionfile <- "Session.csv" webservers <- c("192.168.80") lost <- read.csv(lostfile,stringsAsFactors = FALSE) sessions <- read.csv(sessionfile, stringsAsFactors = FALSE) df_sessions <- sessions %>% filter(grepl(webservers, Destination)) %>% mutate(SrcPort=gsub(" >","",str_extract(Info,"(\\d+) >"))) %>% mutate(SrcSocket=paste(Source,":",SrcPort,sep="")) df_lost <- lost %>% filter(grepl(webservers, Destination)) %>% mutate(SrcPort=gsub(" >","",str_extract(Info,"(\\d+) >"))) %>% mutate(SrcSocket=paste(Source,":",SrcPort,sep="")) badsessions <- intersect(df_sessions$SrcSocket,df_lost$SrcSocket) df_badsessions <- df_sessions[df_sessions$SrcSocket %in% badsessions,] n_sess <- nrow(df_sessions) n_bad <- nrow(df_badsessions) print(paste("Total Sessions:",n_sess)) print(paste("Bad Sessions:",n_bad)) print(paste("Percentage:",round(n_bad*100/n_sess,digits = 2)))
Results and Conclusion
Running the script yields this:
Simple and to the point.
I’m sure there’s better ways to achieving the same goal: some tshark foo and scripting are obvious candidates, but in many cases we need to simplify the initial capture process as much as possible. Asking for a pcap for us to process is as easy as it gets.
So, I’m excited to finally have had the opportunity to use R for a ‘real-life’ scenario that I can share. Let me know how I can do better next time.