dpmon
Classes
|
The main class for DPMon. |
- class dpmon.DPMon(path, data_format, accountant, engine='local', spark=None, direction='outgoing', ipasn_db=None, head=None)
The main class for DPMon. Create on object of the DPMon class to make private queries.
- Parameters:
path (str) – The path do the data to be analyzed. Can be a string. When using the
localengine,pathcan be a list of paths. When using thesparkengine, the path is in the Spark format, thus can include*and{...}expressiondata_format (str) – Must specify the data format:
tstatofnfdumpaccountant (diffprivlib.BudgetAccountant) – A DiffPrivLib
BudgetAccountantthat specifies the privacy budget to limit the information it is possible to extract from the data. Create, for example, with:diffprivlib.BudgetAccountant(epsilon=1.0)engine (str) – Engine to be used:
localorspark. Default:"local"spark (spark.sql.SparkSession) – In case
engine = "spark", you must provide aSparkSessionas a Spark entrypoint. Default:Nonedirection (str) – Whether to focus on
outgoingflows (those issued by internal clients to the Internet) oringoingflows (those issued by any Internet endpoint towards an internal client). See documentation for an explaination. Default:"outgoing"ipasn_db (str) – the path of a file in
pyasnformat, used to map IP addresses to the corresponding ASN. If the file is provided, it is possible to make queries based on ASN - e.g., the volume to a specific ASN. Default:Nonehead (int) – Truncate the data to
headlines. Useful for debugging. Default:None
- flow_feature(feature, metric, ip=None, asn=None, domain=None, epsilon=1.0, percent=None, bins=10, range=None)
Extract statistics on a flow feature (i.e., a log’s column). Notice that this can be used only with Tstat data and on a subset of columns. Notice that DPMon first computes the average per-user value of the flow features. Then it applies the requested statistic (mean, standard deviation or percentile) among the average per-user value. This behavior is mandatory as Differential Privacy must protect users (not flows). You can filter by server IP, ASN or Domain. The three filters are considered together (i.e., they form an AND clause).
- Parameters:
feature (str) – The flow feature to compute statistics on.
metric (str) – The metric to compute. Can be
mean,std,histogramorpercentilepercent (float) – The percentile to compute in case
metric==percentileip (str) – The server IP to filter. Default:
Noneasn (int) – The server ASN to filter. Default:
Nonedomain (str) – The server domain to filter. Default:
Noneepsilon (float) – The privacy budget to allocate for the query. Default:
1.0bins (int) – Number of bins of the histogram in case
metric==histogram. . Default:10range ((float,float)) – The lower and upper range of the bins of the histogram in case
metric==histogram. Default:None
- Returns:
The desired statistic
- user_count_specific(ip=None, asn=None, domain=None, epsilon=1.0)
Compute a the number of users on given server IP, domain or ASN. In other words, it computes the number of users who, at least once, issued a flow to the an enpoint with the given characteristics. The three filters are considered together (i.e., they form an AND clause). It returns the number of user matching and the number of user non matching the filter.
- Parameters:
ip (str) – The server IP to filter. Default:
Noneasn (int) – The server ASN to filter. Default:
Nonedomain (str) – The server domain to filter. Default:
Noneepsilon (float) – The privacy budget to allocate for the query. Default:
1.0
- Returns:
A tuple
(a, b), whereais the number of users who never matched the filter andnthe number of users who did at least once.- Return type:
tuple
- volume_historam(volume_direction='ingoing', count_flows=False, bins=10, range=None, epsilon=1.0)
Compute a histogram of the per-user traffic volume
- Parameters:
volume_direction (str) – Whether to compute ingress (
"ingoing") or egress ("outgoing") volume, in bytes. Default:"ingoing"count_flows (bool) – Count the number of flows instead of volume. If set,
"volume_direction"is ignored. Default:Falsebins (int) – Number of bins of the histogram. Default:
10range ((float,float)) – The lower and upper range of the bins. Default:
Noneepsilon (float) – The privacy budget to allocate for the query. Default:
1.0
- Returns:
A tuple
(histo, bin_edges), wherehistois the histogram andbin_edgesthe boundaries- Return type:
tuple
- volume_historam_specific(volume_direction='ingoing', count_flows=False, bins=10, range=None, ip=None, asn=None, domain=None, epsilon=1.0)
Compute a histogram of the per-user traffic volume on given server IP, domain or ASN. The three filters are considered together (i.e., they form an AND clause).
- Parameters:
volume_direction (str) – Whether to compute ingress (
"ingoing") or egress ("outgoing") volume, in bytes. Default:"ingoing"count_flows (bool) – Count the number of flows instead of volume. If set,
"volume_direction"is ignored. Default:Falsebins (int) – Number of bins of the histogram. Default:
10range ((float,float)) – The lower and upper range of the bins. Default:
Noneip (str) – The server IP to filter. Default:
Noneasn (int) – The server ASN to filter. Default:
Nonedomain (str) – The server domain to filter. Default:
Noneepsilon (float) – The privacy budget to allocate for the query. Default:
1.0
- Returns:
A tuple
(histo, bin_edges), wherehistois the histogram andbin_edgesthe boundaries- Return type:
tuple
- volume_on_asn(asn, volume_direction='ingoing', count_flows=False, epsilon=1.0)
Obtain the traffic volume to/from a specific Autonomous System
- Parameters:
asn (int) – The AS number to query
volume_direction (str) – Whether to compute ingress (
"ingoing") or egress ("outgoing") volume, in bytes. Default:"ingoing"count_flows (bool) – Count the number of flows instead of volume. If set,
"volume_direction"is ignored. Default:Falseepsilon (float) – The privacy budget to allocate for the query. Default:
1.0
- Returns:
The volume in bytes of number of flows
- Return type:
int
- volume_on_domain(domain, volume_direction='ingoing', count_flows=False, epsilon=1.0)
Obtain the traffic volume to/from a specific domain
- Parameters:
domain (str) – The domain name to query
volume_direction (str) – Whether to compute ingress (
"ingoing") or egress ("outgoing") volume, in bytes. Default:"ingoing"count_flows (bool) – Count the number of flows instead of volume. If set,
"volume_direction"is ignored. Default:Falseepsilon (float) – The privacy budget to allocate for the query. Default:
1.0
- Returns:
The volume in bytes of number of flows
- Return type:
int
- volume_on_domain_pattern(pattern, volume_direction='ingoing', count_flows=False, epsilon=1.0)
Obtain the traffic volume to/from a specific domain pattern. The function searches for flows to a domain that matches the given pattern. Pattern are defined in the SQL style, thus for example the
%character represents any string of zero or more characters, while_represents any single character. This function is useful to obtain the traffic volume by second level domain, e.g.,%.googlevideo.com- Parameters:
pattern (str) – The domain pattern (in SQL syntax) to query
volume_direction (str) – Whether to compute ingress (
"ingoing") or egress ("outgoing") volume, in bytes. Default:"ingoing"count_flows (bool) – Count the number of flows instead of volume. If set,
"volume_direction"is ignored. Default:Falseepsilon (float) – The privacy budget to allocate for the query. Default:
1.0
- Returns:
The volume in bytes of number of flows
- Return type:
int
- volume_on_ip(ip, volume_direction='ingoing', count_flows=False, epsilon=1.0)
Obtain the traffic volume to/from a specific external IP address
- Parameters:
ip (str) – The IP address to query
volume_direction (str) – Whether to compute ingress (
"ingoing") or egress ("outgoing") volume, in bytes. Default:"ingoing"count_flows (bool) – Count the number of flows instead of volume. If set,
"volume_direction"is ignored. Default:Falseepsilon (float) – The privacy budget to allocate for the query. Default:
1.0
- Returns:
The volume in bytes of number of flows
- Return type:
int