helpers module¶

helpers.get_avg_reward(df, seeds, num_repeats)¶

Simulate the influence propagation using the IC model

Parameters

df (pandas.DataFrame) – The graph we run the IC on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
seeds (list, pandas.Series) – A list of the nodes to start propagating from.
num_repeats (int) – Specifies how many times we want to simulate the propagation with IC.

Returns

avg_reward – Number showing how many nodes were influenced on average

Return type

float

helpers.get_stats_reward(df, seeds, num_repeats)¶

Simulate the influence propagation using the IC model

Parameters

df (pandas.DataFrame) – The graph we run the IC on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
seeds (list, pandas.Series) – A list of the nodes to start propagating from.
num_repeats (int) – Specifies how many times we want to simulate the propagation with IC.

Returns

avg_reward (float) – Number showing how many nodes were influenced on average
std_reward (float) – Standard deviation of avg_reward

helpers.run_algorithm(setup_dict)¶

Run an algorithm in parallel

Run an algorithm in parallel. This is supposed to be used in conjunction with joblib.Parallel to run different IM algorithms at the same time.

Parameters: setup_dict (dict) – A dictionary containing the following keys: * “function” : a function that we have to run * “algo_name” : name of the algorithm represented by the function * “args” : arguments for that functions * “kwargs” : keyword arguments for that functions
Returns: result_dict – A dictionary containing the following: * “results” : results generated by the function * “algo_name” : name of the algorithm represented by the function * “kwargs” : keyword argument from the function
Return type: dict

Examples

>>> setup_array = []
... setup_array.append(
...    {
...        "algo_name": "timlinucb",
...        "function": timlinucb_parallel_t,
...        "args": [DATASET, DATASET_FEATS, DATASET_TIMES, DATASET_NODES],
...        "kwargs": {
...            "num_seeds": NUM_SEEDS_TO_FIND,
...            "num_repeats_oim": OPTIMAL_NUM_REPEATS_OIM_TLU,
...            "num_repeats_oim_reward": OPTIMAL_NUM_REPEATS_REW_TLU,
...            "sigma": OPTIMAL_SIGMA_TLU,
...            "c": OPTIMAL_C_TLU,
...            "epsilon": OPTIMAL_EPS_TLU,
...        },
...    }
... )
... results_array = joblib.Parallel(n_jobs=len(setup_array))(
...     joblib.delayed(run_algorithm)(setup_dict) for setup_dict in setup_array
... )

helpers.run_ic_eff(df_graph, seed_nodes)¶

Simulate the influence propagation using the IC model

Parameters

df_graph (pandas.DataFrame) – The graph we run the IC on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
seed_nodes (list, pandas.Series) – A list of the nodes to start propagating from.

Returns

results – A tuple of the following numpy arrays - Affected nodes - Activated edges - Observed edges

Return type

tuple

helpers.run_ic_nodes(df_graph, seed_nodes)¶

Simulate the influence propagation using the IC model

Parameters

df_graph (pandas.DataFrame) – The graph we run the IC on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
seed_nodes (list, pandas.Series) – A list of the nodes to start propagating from.

Returns

affected_nodes – Nodes influenced by propagating the seed nodes.

Return type

numpy.array

helpers.tim(df, num_nodes, num_edges, num_inf, epsilon, temp_dir='temp_dir', out_pattern=re.compile('Selected k SeedSet: (.+?) \\n'))¶

Run the Offline IM algorithm, TIM

Parameters

df (pandas.DataFrame) – The graph we run the TIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
num_nodes (int) – Number of nodes to pass into TIM.
num_edges (int) – Number of edges to pass into TIM.
num_inf (int) – Number of seed nodes to find.
epsilon (float) – A hyperparameter for TIM. Refer to the paper for more details. [1]
temp_dir (str, optional) – A temporary directory to run TIM in. Default: “temp_dir”
out_pattern (re.Pattern, optional) – Regex pattern that gets the TIM results out of its output. Default: re.compile(“Selected k SeedSet: (.+?) n”),

Returns

seeds – A set of seed nodes that maximizes influence found by TIM

Return type

list

References

1: Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.

helpers.tim_parallel(df, num_nodes, num_edges, num_inf, epsilon, tim_file='tim', temp_dir='temp_dir', out_pattern=re.compile('Selected k SeedSet: (.+?) \\n'))¶

Run the Offline IM algorithm, TIM, in parallel

Parameters

df (pandas.DataFrame) – The graph we run the TIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
num_nodes (int) – Number of nodes to pass into TIM.
num_edges (int) – Number of edges to pass into TIM.
num_inf (int) – Number of seed nodes to find.
epsilon (float) – A hyperparameter for TIM. Refer to the paper for more details. [1]
temp_dir (str, optional) – A temporary directory to run TIM in. Default: “temp_dir”
tim_file (str, optional) – A path to the TIM executionable that we are going to use. This parameter is added due to the parallel processing requiring creating more TIM files to not hog it. Default: “tim”
out_pattern (re.Pattern, optional) – Regex pattern that gets the TIM results out of its output. Default: re.compile(“Selected k SeedSet: (.+?) n”),

Returns

seeds – A set of seed nodes that maximizes influence found by TIM

Return type

list

References

1: Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.

helpers.tim_t(df_edges, nodes, times, num_seeds=5, num_repeats_reward=20, epsilon=0.4)¶

Run the Offline IM algorithm, TIM, on every time step in a network

Parameters

df_edges (pandas.DataFrame) – The graph we run the TIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab” and “day”. “probab” column contains the activation probability and “day” should correspond to the days specified in times.
nodes (pandas.Series, list) – A sorted list of all unique node ids in the graph.
times (pd.Series, list) – A list representing the times that we want to run the algorithm on. Is useful if we don’t want to run TIM on every single time step in the graph.
num_seeds (int, optional) – Number of seed nodes to find. Default: 5
num_repeats_reward (int, optional) – Number of times we will try propagating the obtained seed nodes using the IC model to get the reward. The reward is then averaged over the runs. Default: 20
epsilon (float, optional) – A hyperparameter for TIM. Refer to the paper for more details. [1]

Returns

results – A dataframe with the following columns * time, representing the time step of the result * reward, an average reward obtained over num_repeats_reward runs * selected, a list of selected seed nodes

Return type

pd.DataFrame

References

1: Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.

helpers.tim_t_parallel(df_edges, nodes, times, num_seeds=5, num_repeats_reward=20, epsilon=0.4, process_id=1)¶

Run the Offline IM algorithm, TIM, on every time step in a network in parallel

Parameters

df_edges (pandas.DataFrame) – The graph we run the TIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab” and “day”. “probab” column contains the activation probability and “day” should correspond to the days specified in times.
nodes (pandas.Series, list) – A sorted list of all unique node ids in the graph.
times (pd.Series, list) – A list representing the times that we want to run the algorithm on. Is useful if we don’t want to run TIM on every single time step in the graph.
num_seeds (int, optional) – Number of seed nodes to find. Default: 5
num_repeats_reward (int, optional) – Number of times we will try propagating the obtained seed nodes using the IC model to get the reward. The reward is then averaged over the runs. Default: 20
epsilon (float, optional) – A hyperparameter for TIM. Refer to the paper for more details. [1] Default: 0.4
process_id (int or str, optional) – An identifier used in distinguishing the temporary TIM executable from others. Default: 1

Returns

results – A dataframe with the following columns * time, representing the time step of the result * reward, an average reward obtained over num_repeats_reward runs * selected, a list of selected seed nodes

Return type

pd.DataFrame

References

1: Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.

helpers.tim_t_parallel_run(df_edges, nodes, times, num_seeds=5, num_repeats_reward=20, epsilon=0.4, process_id=1, max_jobs=- 2, hide_tqdm=True)¶

Run the Offline IM algorithm, TIM, on every time step in a network in parallel

As opposed to tim_t_parallel that is designed to be a part of the parallel pipeline, tim_t_parallel_run executes TIM in a parallel fashion.

Parameters

df_edges (pandas.DataFrame) – The graph we run the TIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab” and “day”. “probab” column contains the activation probability and “day” should correspond to the days specified in times.
nodes (pandas.Series, list) – A sorted list of all unique node ids in the graph.
times (pd.Series, list) – A list representing the times that we want to run the algorithm on. Is useful if we don’t want to run TIM on every single time step in the graph.
num_seeds (int, optional) – Number of seed nodes to find. Default: 5
num_repeats_reward (int, optional) – Number of times we will try propagating the obtained seed nodes using the IC model to get the reward. The reward is then averaged over the runs. Default: 20
epsilon (float, optional) – A hyperparameter for TIM. Refer to the paper for more details. [1] Default: 0.4
process_id (int or str, optional) – An identifier used in distinguishing the temporary TIM executable from others. Default: 1

Returns

results – A dataframe with the following columns * time, representing the time step of the result * reward, an average reward obtained over num_repeats_reward runs * selected, a list of selected seed nodes

Return type

pd.DataFrame

References

1: Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.

helpers.tqdm_joblib(tqdm_object)¶

Context manager to patch joblib to report into tqdm progress bar given as argument

Parameters: tqdm_object (Object) – The tqdm object to paralellize