helpers module¶
-
helpers.get_avg_reward(df, seeds, num_repeats)¶ Simulate the influence propagation using the IC model
- Parameters
df (pandas.DataFrame) – The graph we run the IC on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
seeds (list, pandas.Series) – A list of the nodes to start propagating from.
num_repeats (int) – Specifies how many times we want to simulate the propagation with IC.
- Returns
avg_reward – Number showing how many nodes were influenced on average
- Return type
float
-
helpers.get_stats_reward(df, seeds, num_repeats)¶ Simulate the influence propagation using the IC model
- Parameters
df (pandas.DataFrame) – The graph we run the IC on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
seeds (list, pandas.Series) – A list of the nodes to start propagating from.
num_repeats (int) – Specifies how many times we want to simulate the propagation with IC.
- Returns
avg_reward (float) – Number showing how many nodes were influenced on average
std_reward (float) – Standard deviation of avg_reward
-
helpers.run_algorithm(setup_dict)¶ Run an algorithm in parallel
Run an algorithm in parallel. This is supposed to be used in conjunction with joblib.Parallel to run different IM algorithms at the same time.
- Parameters
setup_dict (dict) – A dictionary containing the following keys: * “function” : a function that we have to run * “algo_name” : name of the algorithm represented by the function * “args” : arguments for that functions * “kwargs” : keyword arguments for that functions
- Returns
result_dict – A dictionary containing the following: * “results” : results generated by the function * “algo_name” : name of the algorithm represented by the function * “kwargs” : keyword argument from the function
- Return type
dict
Examples
>>> setup_array = [] ... setup_array.append( ... { ... "algo_name": "timlinucb", ... "function": timlinucb_parallel_t, ... "args": [DATASET, DATASET_FEATS, DATASET_TIMES, DATASET_NODES], ... "kwargs": { ... "num_seeds": NUM_SEEDS_TO_FIND, ... "num_repeats_oim": OPTIMAL_NUM_REPEATS_OIM_TLU, ... "num_repeats_oim_reward": OPTIMAL_NUM_REPEATS_REW_TLU, ... "sigma": OPTIMAL_SIGMA_TLU, ... "c": OPTIMAL_C_TLU, ... "epsilon": OPTIMAL_EPS_TLU, ... }, ... } ... ) ... results_array = joblib.Parallel(n_jobs=len(setup_array))( ... joblib.delayed(run_algorithm)(setup_dict) for setup_dict in setup_array ... )
-
helpers.run_ic_eff(df_graph, seed_nodes)¶ Simulate the influence propagation using the IC model
- Parameters
df_graph (pandas.DataFrame) – The graph we run the IC on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
seed_nodes (list, pandas.Series) – A list of the nodes to start propagating from.
- Returns
results – A tuple of the following numpy arrays - Affected nodes - Activated edges - Observed edges
- Return type
tuple
-
helpers.run_ic_nodes(df_graph, seed_nodes)¶ Simulate the influence propagation using the IC model
- Parameters
df_graph (pandas.DataFrame) – The graph we run the IC on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
seed_nodes (list, pandas.Series) – A list of the nodes to start propagating from.
- Returns
affected_nodes – Nodes influenced by propagating the seed nodes.
- Return type
numpy.array
-
helpers.tim(df, num_nodes, num_edges, num_inf, epsilon, temp_dir='temp_dir', out_pattern=re.compile('Selected k SeedSet: (.+?) \\n'))¶ Run the Offline IM algorithm, TIM
- Parameters
df (pandas.DataFrame) – The graph we run the TIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
num_nodes (int) – Number of nodes to pass into TIM.
num_edges (int) – Number of edges to pass into TIM.
num_inf (int) – Number of seed nodes to find.
epsilon (float) – A hyperparameter for TIM. Refer to the paper for more details. [1]
temp_dir (str, optional) – A temporary directory to run TIM in. Default: “temp_dir”
out_pattern (re.Pattern, optional) – Regex pattern that gets the TIM results out of its output. Default: re.compile(“Selected k SeedSet: (.+?) n”),
- Returns
seeds – A set of seed nodes that maximizes influence found by TIM
- Return type
list
References
- 1
Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.
-
helpers.tim_parallel(df, num_nodes, num_edges, num_inf, epsilon, tim_file='tim', temp_dir='temp_dir', out_pattern=re.compile('Selected k SeedSet: (.+?) \\n'))¶ Run the Offline IM algorithm, TIM, in parallel
- Parameters
df (pandas.DataFrame) – The graph we run the TIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”. “probab” column contains the activation probability.
num_nodes (int) – Number of nodes to pass into TIM.
num_edges (int) – Number of edges to pass into TIM.
num_inf (int) – Number of seed nodes to find.
epsilon (float) – A hyperparameter for TIM. Refer to the paper for more details. [1]
temp_dir (str, optional) – A temporary directory to run TIM in. Default: “temp_dir”
tim_file (str, optional) – A path to the TIM executionable that we are going to use. This parameter is added due to the parallel processing requiring creating more TIM files to not hog it. Default: “tim”
out_pattern (re.Pattern, optional) – Regex pattern that gets the TIM results out of its output. Default: re.compile(“Selected k SeedSet: (.+?) n”),
- Returns
seeds – A set of seed nodes that maximizes influence found by TIM
- Return type
list
References
- 1
Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.
-
helpers.tim_t(df_edges, nodes, times, num_seeds=5, num_repeats_reward=20, epsilon=0.4)¶ Run the Offline IM algorithm, TIM, on every time step in a network
- Parameters
df_edges (pandas.DataFrame) – The graph we run the TIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab” and “day”. “probab” column contains the activation probability and “day” should correspond to the days specified in times.
nodes (pandas.Series, list) – A sorted list of all unique node ids in the graph.
times (pd.Series, list) – A list representing the times that we want to run the algorithm on. Is useful if we don’t want to run TIM on every single time step in the graph.
num_seeds (int, optional) – Number of seed nodes to find. Default: 5
num_repeats_reward (int, optional) – Number of times we will try propagating the obtained seed nodes using the IC model to get the reward. The reward is then averaged over the runs. Default: 20
epsilon (float, optional) – A hyperparameter for TIM. Refer to the paper for more details. [1]
- Returns
results – A dataframe with the following columns * time, representing the time step of the result * reward, an average reward obtained over num_repeats_reward runs * selected, a list of selected seed nodes
- Return type
pd.DataFrame
References
- 1
Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.
-
helpers.tim_t_parallel(df_edges, nodes, times, num_seeds=5, num_repeats_reward=20, epsilon=0.4, process_id=1)¶ Run the Offline IM algorithm, TIM, on every time step in a network in parallel
- Parameters
df_edges (pandas.DataFrame) – The graph we run the TIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab” and “day”. “probab” column contains the activation probability and “day” should correspond to the days specified in times.
nodes (pandas.Series, list) – A sorted list of all unique node ids in the graph.
times (pd.Series, list) – A list representing the times that we want to run the algorithm on. Is useful if we don’t want to run TIM on every single time step in the graph.
num_seeds (int, optional) – Number of seed nodes to find. Default: 5
num_repeats_reward (int, optional) – Number of times we will try propagating the obtained seed nodes using the IC model to get the reward. The reward is then averaged over the runs. Default: 20
epsilon (float, optional) – A hyperparameter for TIM. Refer to the paper for more details. [1] Default: 0.4
process_id (int or str, optional) – An identifier used in distinguishing the temporary TIM executable from others. Default: 1
- Returns
results – A dataframe with the following columns * time, representing the time step of the result * reward, an average reward obtained over num_repeats_reward runs * selected, a list of selected seed nodes
- Return type
pd.DataFrame
References
- 1
Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.
-
helpers.tim_t_parallel_run(df_edges, nodes, times, num_seeds=5, num_repeats_reward=20, epsilon=0.4, process_id=1, max_jobs=- 2, hide_tqdm=True)¶ Run the Offline IM algorithm, TIM, on every time step in a network in parallel
As opposed to tim_t_parallel that is designed to be a part of the parallel pipeline, tim_t_parallel_run executes TIM in a parallel fashion.
- Parameters
df_edges (pandas.DataFrame) – The graph we run the TIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab” and “day”. “probab” column contains the activation probability and “day” should correspond to the days specified in times.
nodes (pandas.Series, list) – A sorted list of all unique node ids in the graph.
times (pd.Series, list) – A list representing the times that we want to run the algorithm on. Is useful if we don’t want to run TIM on every single time step in the graph.
num_seeds (int, optional) – Number of seed nodes to find. Default: 5
num_repeats_reward (int, optional) – Number of times we will try propagating the obtained seed nodes using the IC model to get the reward. The reward is then averaged over the runs. Default: 20
epsilon (float, optional) – A hyperparameter for TIM. Refer to the paper for more details. [1] Default: 0.4
process_id (int or str, optional) – An identifier used in distinguishing the temporary TIM executable from others. Default: 1
- Returns
results – A dataframe with the following columns * time, representing the time step of the result * reward, an average reward obtained over num_repeats_reward runs * selected, a list of selected seed nodes
- Return type
pd.DataFrame
References
- 1
Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.
-
helpers.tqdm_joblib(tqdm_object)¶ Context manager to patch joblib to report into tqdm progress bar given as argument
- Parameters
tqdm_object (Object) – The tqdm object to paralellize