\documentclass[12pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{amsmath,amsthm,amssymb,amsfonts}
\usepackage{graphicx,epstopdf,fancyhdr,xspace,algorithm,algorithmic}
\usepackage[space]{grffile}
\pagestyle{fancy}
\usepackage{epsfig}
\usepackage[space]{grffile}
\usepackage{titlesec}
\usepackage[table]{xcolor}
\usepackage[hyphens]{url}
\usepackage{enumitem} % package used to generate bulleted lists
\usepackage{tikz}
\usetikzlibrary{arrows,automata}
\theoremstyle{definition}
\renewcommand{\epsilon}{\varepsilon}
\newtheorem{problem}{Problem} %
\cfoot{\thepage}
\renewcommand{\headrulewidth}{0.4pt}
\renewcommand{\headwidth}{\textwidth}
%\renewcommand{\footwidth}{\textwidth}
%\addtolength{\headwidth}{\marginparwidth}
%\addtolength{\headwidth}{\marginparsep}
\renewcommand{\footrulewidth}{0.4pt}
\newcommand{\gradingcomment}[1]{{ \color{blue} #1} }
\newcommand{\mymin}{\ensuremath{\mbox{\sc FindMin}}\xspace}
\newcommand{\ms}{\ensuremath{\mbox{\sc Mergesort}}\xspace}
\newcommand{\qs}{\ensuremath{\mbox{\sc Quicksort}}\xspace}
\newtheorem{claim}{Claim}
\newtheorem{definition}{Definition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{observation}{Observation}
\newtheorem{question}{Problem}
\titleformat*{\section}{\large\bfseries}
\newcommand{\N}{\mathbb N} %you can define shorthands this way, here we are defined \N as a shorthand for math bold N for natural numbers
\newenvironment{solution}{\bigskip\noindent{\em Solution.} \ignorespaces}{}
\usepackage{hyperref}
\hypersetup{
colorlinks=true,
linkcolor=blue,
filecolor=magenta,
urlcolor=cyan,
}
\urlstyle{same}
\PassOptionsToPackage{hyphens}{url}
\newcommand{\homework}[6]{
\pagestyle{myheadings}
\thispagestyle{plain}
\newpage
\setcounter{page}{1}
\noindent
\begin{center}
\framebox{
\vbox{\vspace{2mm}
\hbox to 6.28in { {\bf CS358: Applied Algorithms\hfill} }
\vspace{6mm}
\hbox to 6.28in { {\Large \hfill #1 \normalsize{(#2)} \hfill} }
\vspace{6mm}
\hbox to 6.28in { {\it Instructor: #3 \hfill } }
%\hbox to 6.28in { {\it TA: #4 \hfill #6}}
% \vspace{2mm}
}
}
\end{center}
\markboth{#1}{#1}
\vspace*{4mm}
}
\definecolor{grayX}{gray}{0.95}
\newcommand{\E}{\text{E}} %useful for making capital Es in math mode now that we're using expectation
\begin{document}
\homework{Assignment 4: Streaming}{due 10/20/2021 10PM EDT}{Sam McCauley}{}{}{}
\section*{Instructions}
All submissions are to be done through github. This process is detailed in the handout ``Handing In Assignments'' on the course website. Answers to the questions below should be submitted by editing this document. All places where you are expected to fill in solution are marked in comments with ``FILL IN.''
Please contact me at \url{srm2@cs.williams.edu} if you have any questions or find any problems with the assignment materials.
\section*{Problem Description}
In this problem, we will be analyzing a very long novel: ``In Search of Lost Time,'' by Marcel Proust.\footnote{This novel is, I understand, very popular and well-regarded. However, it was chosen for this class mostly because its copyright has expired.} This novel contains about a million words (including duplicates), and the text file given for this assignment is about 7MB. Nonetheless, we will be using very small streaming data structures to analyze this file with just a single pass over the data---the first using a handful of kilobytes of space, the second using just 32 bytes.
In this assignment, you will be building two data structures.
First, you will build a Count-Min Sketch data structure. All words in ``In Search of Lost Time'' will be inserted into the Count-Min Sketch. At the end, your data structure will be queried with some of the most common words in the novel: how many times does this word appear? The testing program will compare your output to the actual count of each word; your data structure should always overestimate the count, but give reasonably similar values.
Second, you will build a HyperLogLog data structure. Again, all words in ``In Search of Lost Time'' will be inserted into it. At the end, your data structure will be queried to find out approximately how many unique words occurred in the novel. HyperLogLog uses an incredibly small amount of space, so it is likely that your data structure will have some error. However, it should usually be reasonably close to the correct value.
\bigskip
\textsc{Input:} \texttt{test.out} is given three arguments. The first is a text document in ASCII format.\footnote{As with last time, this means there are no accented characters.} The second is a text document, where each line contains a word from the first text document, followed by a space, followed by the number of times that word appears in the first document. The final argument is an integer denoting the number of unique words in the original text document.
To run your program on ``In Search of Lost Time,'' you would use the following input:
\begin{verbatim}
./test.out proust.txt words.txt 36372
\end{verbatim}
The following input may be useful for testing:
\begin{verbatim}
./test.out proustShort.txt wordsShort.txt 125
\end{verbatim}
\bigskip
\noindent
\textsc{Output:} This assignment is unique in this course in that a single answer is not usually marked as correct or incorrect.\footnote{This choice is due to two concerns. First, we're using randomness, so some error is to be expected. (One could run the program multiple times and perform statistical tests, but that's very much contrary to the spirit of these structures.) Second, these structures are fairly inconsistent: for example, a cuckoo filter will almost always have approximately the same false positive rate on a large dataset; a CMS or HLL may not.} Instead, the testing program will output, for each word in the second text file, the actual number of occurrences of the word in the text compared to the number output by your Count-Min Sketch. Furthermore, the testing program will output the number of unique words predicted by your HyperLogLog data structure, compared to the actual number of unique words.
\bigskip
\noindent
\textsc{Interpreting the output:} For the large output, the CMS should generally answer most word queries within $1000$ of the correct value. Almost all CMS answers should be within $1500$ of the correct value.
The HLL overall estimation of the number of words should almost always be between $25000$ and $50000$.
You should use several different seeds to check that your answers satisfy these bounds.
\bigskip
\noindent
\textsc{Functions:} This assignment is, broadly, structured much like the last assignment. The functions for both data structures already exist, and you must fill them in. The code in \texttt{test.c} will perform the above tests using the functions you provide.
\texttt{cms.c} and \texttt{cms.h} contain the code for the Count-Min Sketch data structure. \texttt{hll.c} and \texttt{hll.h} contain the code for the HyperLogLog data structure.
\noindent
Here is a list of the functions and how they are used:
\begin{verbatim}
void cms_instantiate(Cms* cms)
\end{verbatim}
This function is called before any other calls to the cms functions. You can think of it like a constructor. It should set constants and allocate memory. \texttt{cms} is a pointer to a struct that I found useful (you can change this to not use a struct if you wish). You do not need to edit this function if you don't want to; the version in the assignment is the version I used.
\begin{verbatim}
void cms_insert(char* word, int length, Cms* cms)
\end{verbatim}
This function inserts a new word (given by \texttt{word}) into the filter. \texttt{length} is the length of the word, and \texttt{cms} is a pointer to the Count-Min Sketch we want to insert to.
\texttt{test.c} will insert each word in the first document into the Count-Min Sketch by calling this function. These inserts will all occur before any call to \texttt{filter\_lookup}.
\begin{verbatim}
int cms_lookup(char* word, int length, Cms* cms)
\end{verbatim}
This function looks up a word (given by \texttt{word}) in the sketch pointed to by \texttt{cms}. It returns an estimate of how often \texttt{word} (which has length \texttt{length}) occurs in the first document.
\begin{verbatim}
void hll_instantiate(Hll* hll)
\end{verbatim}
This function is called before any other calls to the hll functions. You can think of it like a constructor. It should set constants and allocate memory. \texttt{hll} is a pointer to a struct that I found useful\footnote{This is a bit wasteful, unlike the CMS and cuckoo filter. You could easily have the seed as a global constant and just pass around the table of counters rather than storing this struct.} (you can change this to not use a struct if you wish). You do not need to edit this function if you don't want to; the version in the assignment is the version I used.
\begin{verbatim}
void hll_insert(char* word, int length, Hll* hll)
\end{verbatim}
This function inserts a new word (given by \texttt{word}) into the HyperLogLog data structure \texttt{hll}. \texttt{length} is the length of the word, and \texttt{hll} is a pointer to the structure we want to insert to.
\texttt{test.c} will insert each word in the first document into the HyperLogLog data structure by calling this function.
\begin{verbatim}
int hll_estimate(Hll* hll)
\end{verbatim}
This function asks for an estimate of how many unique words have been inserted into \texttt{hll}.
\bigskip
\noindent
\textsc{Count-Min Sketch Parameters:} Your CMS should have 4 rows, each of 300 entries. Each entry should be of 32 bits.\footnote{This is wasteful! Unfortunately, 16 bit integers are JUST small enough to barely overflow on this data. Using 18 or 19 bit entries would almost certainly be ideal.}
\bigskip
\noindent
\textsc{HyperLogLog Parameters:} Your HyperLogLog data structure should keep track of 32 counters, each of length 8 bits. For 32 counters, the bias constant is .697. (The bias constants are included in \texttt{hll.h}.)
\newpage
\section*{Questions}
\textbf{Code}~(50 points). Implement a Count-Min Sketch and a HyperLogLog counter, each as described above. You do not need to describe your implementation---but I left some space below in case there's something that you think would be helpful to state.
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
\bigskip
\begin{question}[10 points]
Let $X$ be a positive random variable (i.e. $X$ only takes on values that are $\geq 0$). Show that
\[
\Pr\left[ X \geq e\cdot \E[X] \right] \leq 1/e.
\]
\textit{Hint:} Use the definition of expectation (and the assumption that $X$ is always positive). It may also be useful to write $\Pr[X \geq e\cdot \E[X]] = \sum_{i = e\cdot\E[X]}^\infty \Pr[X = i]$.
\end{question}
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
\bigskip
\begin{question}[20 points]
Consider a situation where I have a stream of 1,001,000 elements. 1,000,000 of the elements $a_1,\ldots a_{1000000}$ only appear once, but one element, $q$, appears 1000 times throughout the stream.
For each of the following use cases,
give the appropriate parameters\footnote{Here I am asking about the parameters you should use for your \emph{code}: number table entries, number of rows, etc. Determining $\epsilon$ and $\delta$ is likely an important part of your answer, but it is not the final solution.} for a correct
Count-Min Sketch with the smallest possible space.
Be sure to answer:
\begin{itemize}
\item How many rows should I have? ~~ (\texttt{numRepetitions})
\item How many entries should I have in each row? ~~ (\texttt{numEntries})
\item How large should each entry be in bits?\footnote{Please give an exact answer, even if \texttt{C} does not support variables of this size.} ~~ (The size of each entry in \texttt{table}, in bits)
\end{itemize}
\bigskip
\textbf{(a)} After inserting all stream elements into my CMS, I query an element $a_i$, and I receive an answer $o_i$, and I query $q$, and receive an answer $o_q$.
How should I set the parameters of my Count-Min Sketch so that at least $90\%$ of the time, $o_i \leq o_q$? That is to say, how do I set my parameters so that $90\%$ of the time, the CMS accurately returns that $q$ was more common than $a_i$?
You do not need to give a precise answer; giving parameters that work is OK even if slightly different parameters are more efficient. But, you should explain how you got your answer, and why it attains the required performance.
\textit{Hint:} One way to do this is to come up with constants $c_q$ and $c_i$ such that for your parameters you can show that:
\begin{enumerate}[noitemsep, topsep=0pt]
\item $c_i \leq c_q$, and
\item 90\% of the time, $o_i \leq c_i$ and $c_q \leq o_q$.
\end{enumerate}
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
\bigskip
\textbf{(b)} After inserting all stream elements into my CMS, I query all elements $a_1,\ldots a_{1000000}$ to obtain answers $o_1,\ldots o_{1000000}$, as well as querying $q$ to obtain an answer $o_q$.
How should I set the parameters of my Count-Min Sketch so that at least $90\%$ of the time, $\max_i{o_i} \leq o_q$? That is to say, how do I set my parameters so that $90\%$ of the time, the CMS accurately returns that $q$ was the most common out of all elements queried?
\textit{Hint}: You probably want to use the union bound on part (b) of this question, combining it with the answer to part (a).
\end{question}
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
\bigskip
\begin{question}[10 points]
Let's say I create a HyperLogLog structure $H_1$ for a stream $a_1,\ldots a_n$ and a second HyperLogLog structure $H_2$ for a stream $b_1,\ldots b_n$. Assume that all $a_i$ and $b_i$ are in the same universe $U$, and assume that identical parameters and hash functions are used to create $H_1$ and $H_2$.
Describe how to use $H_1$ and $H_2$ to create a new HyperLogLog structure $H_3$ that can estimate the number of unique items in the concatenation of the two streams: $a_1,\ldots a_n, b_1,\ldots, b_n$. (You should build $H_3$ using only $H_1$ and $H_2$, without seeing the stream again.)
\end{question}
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
\bigskip
\begin{question}[10 points]
%Is the Count-Min Sketch useful in each of the following situations---that is to say, can it give worst-case guarantees? (In many of these situations, it is not the \emph{most} effective data structure. I am asking if it is able to give guarantees on the answer to each question.)
The following short questions ask about different situations in which you can use a Count-Min Sketch. For each, say how many entries you would need in each column (in other words, what value of $\epsilon$ you would need asymptotically) to obtain the correct answer in a single row with constant probability. In each case, assume that there are $N$ total insertions into the CMS.
%, taken from $X$ unique items. (That is to say, $X = |\{x_i ~|~ i\in\{1,\ldots, N\} \}|$.)
\textbf{For each explanation, please give a brief explanation ($\approx$ one sentence) as to why---and bear in mind that I am only expecting an asymptotic answer.}
\bigskip
\textbf{(a)} A stream consists of two kinds of items: items of the first kind appear once each, and items of the second kind appear twice each. Can a Count-Min Sketch determine if a given item appeared once or twice? How many entries would you need in each row of the CMS in order for it to provide the correct answer with constant probability?
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
\bigskip
\textbf{(b)} In a stream of items, one appears the majority of the time (more than half the items are the same). Can the Count-Min Sketch determine the majority item? How many entries would you need in each row of the CMS in order for it to provide the correct answer with constant probability?
\end{question}
\begin{solution}
%FILL IN YOUR ANSWER HERE
\end{solution}
\end{document}