dpmon

Classes

DPMon(path, data_format, accountant[, ...])

The main class for DPMon.

class dpmon.DPMon(path, data_format, accountant, engine='local', spark=None, direction='outgoing', ipasn_db=None, head=None)

The main class for DPMon. Create on object of the DPMon class to make private queries.

Parameters:
  • path (str) – The path do the data to be analyzed. Can be a string. When using the local engine, path can be a list of paths. When using the spark engine, the path is in the Spark format, thus can include * and {...} expression

  • data_format (str) – Must specify the data format: tstat of nfdump

  • accountant (diffprivlib.BudgetAccountant) – A DiffPrivLib BudgetAccountant that specifies the privacy budget to limit the information it is possible to extract from the data. Create, for example, with: diffprivlib.BudgetAccountant(epsilon=1.0)

  • engine (str) – Engine to be used: local or spark. Default: "local"

  • spark (spark.sql.SparkSession) – In case engine = "spark", you must provide a SparkSession as a Spark entrypoint. Default: None

  • direction (str) – Whether to focus on outgoing flows (those issued by internal clients to the Internet) or ingoing flows (those issued by any Internet endpoint towards an internal client). See documentation for an explaination. Default: "outgoing"

  • ipasn_db (str) – the path of a file in pyasn format, used to map IP addresses to the corresponding ASN. If the file is provided, it is possible to make queries based on ASN - e.g., the volume to a specific ASN. Default: None

  • head (int) – Truncate the data to head lines. Useful for debugging. Default: None

flow_feature(feature, metric, ip=None, asn=None, domain=None, epsilon=1.0, percent=None, bins=10, range=None)

Extract statistics on a flow feature (i.e., a log’s column). Notice that this can be used only with Tstat data and on a subset of columns. Notice that DPMon first computes the average per-user value of the flow features. Then it applies the requested statistic (mean, standard deviation or percentile) among the average per-user value. This behavior is mandatory as Differential Privacy must protect users (not flows). You can filter by server IP, ASN or Domain. The three filters are considered together (i.e., they form an AND clause).

Parameters:
  • feature (str) – The flow feature to compute statistics on.

  • metric (str) – The metric to compute. Can be mean, std, histogram or percentile

  • percent (float) – The percentile to compute in case metric==percentile

  • ip (str) – The server IP to filter. Default: None

  • asn (int) – The server ASN to filter. Default: None

  • domain (str) – The server domain to filter. Default: None

  • epsilon (float) – The privacy budget to allocate for the query. Default: 1.0

  • bins (int) – Number of bins of the histogram in case metric==histogram. . Default: 10

  • range ((float,float)) – The lower and upper range of the bins of the histogram in case metric==histogram. Default: None

Returns:

The desired statistic

user_count_specific(ip=None, asn=None, domain=None, epsilon=1.0)

Compute a the number of users on given server IP, domain or ASN. In other words, it computes the number of users who, at least once, issued a flow to the an enpoint with the given characteristics. The three filters are considered together (i.e., they form an AND clause). It returns the number of user matching and the number of user non matching the filter.

Parameters:
  • ip (str) – The server IP to filter. Default: None

  • asn (int) – The server ASN to filter. Default: None

  • domain (str) – The server domain to filter. Default: None

  • epsilon (float) – The privacy budget to allocate for the query. Default: 1.0

Returns:

A tuple (a, b), where a is the number of users who never matched the filter and n the number of users who did at least once.

Return type:

tuple

volume_historam(volume_direction='ingoing', count_flows=False, bins=10, range=None, epsilon=1.0)

Compute a histogram of the per-user traffic volume

Parameters:
  • volume_direction (str) – Whether to compute ingress ("ingoing") or egress ("outgoing") volume, in bytes. Default: "ingoing"

  • count_flows (bool) – Count the number of flows instead of volume. If set, "volume_direction" is ignored. Default: False

  • bins (int) – Number of bins of the histogram. Default: 10

  • range ((float,float)) – The lower and upper range of the bins. Default: None

  • epsilon (float) – The privacy budget to allocate for the query. Default: 1.0

Returns:

A tuple (histo, bin_edges), where histo is the histogram and bin_edges the boundaries

Return type:

tuple

volume_historam_specific(volume_direction='ingoing', count_flows=False, bins=10, range=None, ip=None, asn=None, domain=None, epsilon=1.0)

Compute a histogram of the per-user traffic volume on given server IP, domain or ASN. The three filters are considered together (i.e., they form an AND clause).

Parameters:
  • volume_direction (str) – Whether to compute ingress ("ingoing") or egress ("outgoing") volume, in bytes. Default: "ingoing"

  • count_flows (bool) – Count the number of flows instead of volume. If set, "volume_direction" is ignored. Default: False

  • bins (int) – Number of bins of the histogram. Default: 10

  • range ((float,float)) – The lower and upper range of the bins. Default: None

  • ip (str) – The server IP to filter. Default: None

  • asn (int) – The server ASN to filter. Default: None

  • domain (str) – The server domain to filter. Default: None

  • epsilon (float) – The privacy budget to allocate for the query. Default: 1.0

Returns:

A tuple (histo, bin_edges), where histo is the histogram and bin_edges the boundaries

Return type:

tuple

volume_on_asn(asn, volume_direction='ingoing', count_flows=False, epsilon=1.0)

Obtain the traffic volume to/from a specific Autonomous System

Parameters:
  • asn (int) – The AS number to query

  • volume_direction (str) – Whether to compute ingress ("ingoing") or egress ("outgoing") volume, in bytes. Default: "ingoing"

  • count_flows (bool) – Count the number of flows instead of volume. If set, "volume_direction" is ignored. Default: False

  • epsilon (float) – The privacy budget to allocate for the query. Default: 1.0

Returns:

The volume in bytes of number of flows

Return type:

int

volume_on_domain(domain, volume_direction='ingoing', count_flows=False, epsilon=1.0)

Obtain the traffic volume to/from a specific domain

Parameters:
  • domain (str) – The domain name to query

  • volume_direction (str) – Whether to compute ingress ("ingoing") or egress ("outgoing") volume, in bytes. Default: "ingoing"

  • count_flows (bool) – Count the number of flows instead of volume. If set, "volume_direction" is ignored. Default: False

  • epsilon (float) – The privacy budget to allocate for the query. Default: 1.0

Returns:

The volume in bytes of number of flows

Return type:

int

volume_on_domain_pattern(pattern, volume_direction='ingoing', count_flows=False, epsilon=1.0)

Obtain the traffic volume to/from a specific domain pattern. The function searches for flows to a domain that matches the given pattern. Pattern are defined in the SQL style, thus for example the % character represents any string of zero or more characters, while _ represents any single character. This function is useful to obtain the traffic volume by second level domain, e.g., %.googlevideo.com

Parameters:
  • pattern (str) – The domain pattern (in SQL syntax) to query

  • volume_direction (str) – Whether to compute ingress ("ingoing") or egress ("outgoing") volume, in bytes. Default: "ingoing"

  • count_flows (bool) – Count the number of flows instead of volume. If set, "volume_direction" is ignored. Default: False

  • epsilon (float) – The privacy budget to allocate for the query. Default: 1.0

Returns:

The volume in bytes of number of flows

Return type:

int

volume_on_ip(ip, volume_direction='ingoing', count_flows=False, epsilon=1.0)

Obtain the traffic volume to/from a specific external IP address

Parameters:
  • ip (str) – The IP address to query

  • volume_direction (str) – Whether to compute ingress ("ingoing") or egress ("outgoing") volume, in bytes. Default: "ingoing"

  • count_flows (bool) – Count the number of flows instead of volume. If set, "volume_direction" is ignored. Default: False

  • epsilon (float) – The privacy budget to allocate for the query. Default: 1.0

Returns:

The volume in bytes of number of flows

Return type:

int