timlinucb module

timlinucb.generate_node2vec_fetures(df, node2vec_path='node2vec', tempdir_name='temp_dir', dataset_name='facebook', num_features=20, check_existing=True)

Generate node2vec features for the edges in a graph

Runs get_features_nodes to get the node2vec[1] features for the nodes and then multiplies them to get the edge features. Edge feature vector Fe = source_node_feats * target_node_feats.

Parameters
  • df (pandas.DataFrame) – The graph we run node2vec on, in the form of a DataFrame. A row represents one edge in the graph, with columns containing the source and target nodes for the edge.

  • node2vec_path (str, optional) – Path to the node2vec executable. Default: “os.getcwd()”, which means that it is in one folder with this

  • tempdir_name (str, optional) – The name of a temporary directory to use for running node2vec. Default: “temp_dir”

  • dataset_name (str, optional) – The name of the dataset, used when naming the files in the temporary directory. Default: “facebook”

  • dims (int, optional) – Specifies the number of dimensions to pass to node2vec[1]. Default: 20

  • check_existing (boolean, optional) – Determines if we are checking for an alredy existing algorithm output in the temporary directory.

Returns

df_feats – A dataframe with edge embeddings, where one row represents one edge.

Return type

pd.DataFrame

References

1

Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.

timlinucb.get_features_nodes(df_graph, dims=20, epochs=1, node2vec_path='node2vec', tempdir_name='temp_dir', dataset_name='facebook', check_existing=True)

Generate node2vec features for the nodes in a graph

Parameters
  • df_graph (pandas.DataFrame) – The graph we run node2vec on, in the form of a DataFrame. A row represents one edge in the graph, with columns containing the source and target nodes for the edge.

  • dims (int, optional) – Specifies the number of dimensions to pass to node2vec[1]. Default: 20

  • epochs (int, optional) – Specifies the number of epochs to run the node2vec for. Default: 1

  • node2vec_path (str, optional) – Path to the node2vec executable. Default: “node2vec”, which means that it is in one folder with this file

  • tempdir_name (str, optional) – The name of a temporary directory to use for running node2vec. Default: “temp_dir”

  • dataset_name (str, optional) – The name of the dataset, used when naming the files in the temporary directory. Default: “facebook”

  • check_existing (boolean, optional) – Determines if we are checking for an alredy existing algorithm output in the temporary directory.

Returns

df_emb – A dataframe with the contents of embeddings generated by node2vec. It is also available as <dataset_name>-d<dims>.emb in the temporary directory.

Return type

pd.DataFrame

References

1

Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.

timlinucb.oim_node2vec(df, df_feats, nodes, num_inf=10, sigma=4, c=0.1, epsilon=0.4, num_repeats=30, num_repeats_reward=20, oracle=<function tim>, persist=False, b=None, m_inv=None, hide_tqdm=False)

Online IM with Node2Vec features

Run the IMLinUCB[1] algorithm with pre-trained Node2Vec [3] features.

Parameters
  • df (pandas.DataFrame) – The graph we run the OIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, and “probab”. “probab” column is the “true” activation probability.

  • df_feats (pandas.DataFrame) – A dataframe with node embeddings generated by node2vec [3]. Should have n rows, where n is the number of nodes in the graph. The number of columns is specified when running node2vec on the graph. The smaller the number of columns is the more uncertain the embedding, but it also speeds up processing.

  • nodes (pandas.DataFrame) – A dataframe of all unique nodes in df. Is used to calculate the number of nodes for TIM.

  • num_inf (int, optional) – Number of seed nodes to find. Default: 10.

  • sigma (float, optional) – A parameter used by the IMLinUCB algorithm. Refer to the IMLinUCB paper for more details. [1] Default: 4

  • c (float, optional) – A parameter used by the IMLinUCB algorithm. Refer to the IMLinUCB paper for more details. [1] Default: 0.1

  • epsilon (float, optional) – A parameter used by the TIM algorithm. Refer to the TIM paper for more details. [2] Default: 0.4

  • num_repeats (int, optional) – Number of iterations of the IMLinUCB algorithm. The more, the better the results. Default: 30

  • num_repeats_reward (int, optional) – Number of times to run the simulation at every step of the IMLinUCB algorithm. A larger number should better the results. Default: 20

  • num_nodes_tim (int, optional) – Number of nodes to use for the offline IM algorithm, TIM [2]. Default: -1

  • oracle (function, optional) – A function to use as the Offline IM algorithm. Default: tim [2]

  • persist (boolean, optional) – A parameter that determines if we are running this algorithm in a temporal IM toolchain and would want to persist parameters. If True, you need to provide b and m_inv as well. Default: False

  • b (pandas.DataFrame, optional) – A dataframe of b taken from the previous time step in the temporal IM. Used if we want to persist parameters through time steps. Default: None

  • m_inv (pandas.DataFrame, optional) – A dataframe of m_inv taken from the previous time step in the temporal IM. Used if we want to persist parameters through time steps. Default: None

  • hide_tqdm (boolean, optional) – A paremeters used if you want to hide all tqdm progress bars. It’s useful if you want to paralellize the algorithm. Default: False

Returns

return_dict – A dictionary consisting of following keys: - s_best, the list of the selected seed nodes - u_e_best, the approximated probabilities - reward_best, the reward obtained by running IC with s_best - m_inv, a parameter to use in the next it of temporal OIM

(only if persist is True)

  • b, a parameter to use in the next it of temporal OIM

    (only if persist is True)

Return type

dict

References

1

Wen, Zheng, Branislav Kveton, and Michal Valko. “Influence maximization with semi-bandit feedback.” CoRR, abs/1605.06593 (2016).

2

Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.

3

Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.

timlinucb.oim_node2vec_simple(df, df_feats, num_inf=10, sigma=4, c=0.1, epsilon=0.4, num_repeats=15, num_nodes_tim=-1, oracle=<function tim>)

Online IM with Node2Vec features (Simple version)

Online Influence Maximization using IMLinUCB[1] with node2vec [3] features. Simple version doesn’t simulate the propagation multiple times at every time step, which makes the algorithm faster but less accurate.

Parameters
  • df (pandas.DataFrame) – The graph we run the OIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, and “probab”. “probab” column is the “true” activation probability.

  • df_feats (pandas.DataFrame) – A dataframe with node embeddings generated by node2vec [3]. Should have n rows, where n is the number of nodes in the graph. The number of columns is specified when running node2vec on the graph. The smaller the number of columns is the more uncertain the embedding, but it also speeds up processing.

  • num_inf (int, optional) – Number of seed nodes to find. Default: 10.

  • sigma (float, optional) – A parameter used by the IMLinUCB algorithm. Refer to the IMLinUCB paper for more details. [1] Default: 4

  • c (float, optional) – A parameter used by the IMLinUCB algorithm. Refer to the IMLinUCB paper for more details. [1] Default: 0.1

  • epsilon (float, optional) – A parameter used by the TIM algorithm. Refer to the TIM paper for more details. [2] Default: 0.4

  • num_repeats (int, optional) – Number of iterations of the IMLinUCB algorithm. The more, the better the results. Default: 30

  • num_nodes_tim (int, optional) – Number of nodes to use for the offline IM algorithm, TIM [2]. Default: -1

  • oracle (function, optional) – A function to use as the Offline IM algorithm. Default: tim [2]

Returns

return_dict – A dictionary consisting of following keys: - rewards, a list of all rewards (number of inf nodes) obtained by the algorithm - rewards_edges, a list of edge rewards (number of inf edges) obtained - s_best, the list of the selected seed nodes - u_e_best, the approximated probabilities - reward_best, the reward obtained by running IC with s_best

Return type

dict

References

1

Wen, Zheng, Branislav Kveton, and Michal Valko. “Influence maximization with semi-bandit feedback.” CoRR, abs/1605.06593 (2016).

2

Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.

3

Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.

timlinucb.timlinucb(df_edges, df_feats, times, nodes, num_seeds=5, sigma=4, c=0.1, epsilon=0.4, num_repeats_oim=10, num_repeats_oim_reward=10, style='additive', persist=False, hide_tqdm=False)

Temporal Online Influence Maximization

Run the IMLinUCB[1] algorithm with pre-trained Node2Vec[3] features at every time step in the graph df_edges.

Parameters
  • df_edges (pandas.DataFrame) – The graph we run the TOIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”, and “day”. “probab” column is the “true” activation probability and “day” should correspond to the days specified in times.

  • df_feats (pandas.DataFrame) – A dataframe with node embeddings generated by node2vec [3]. Should have n rows, where n is the number of nodes in the graph. The number of columns is specified when running node2vec on the graph. The smaller the number of columns is the more uncertain the embedding, but it also speeds up processing.

  • times (pandas.Series, list) – A series or a list of the times that we are going to iterate through. Useful if you don’t want to iterate through every day in the network.

  • nodes (pandas.DataFrame) – A dataframe of all unique nodes in df. Is used to calculate the number of nodes for TIM.

  • num_inf (int, optional) – Number of seed nodes to find. Default: 5.

  • sigma (float, optional) – A parameter used by the IMLinUCB algorithm. Refer to the IMLinUCB paper for more details. [1] Default: 4

  • c (float, optional) – A parameter used by the IMLinUCB algorithm. Refer to the IMLinUCB paper for more details. [1] Default: 0.1

  • epsilon (float, optional) – A parameter used by the TIM algorithm. Refer to the TIM paper for more details. [2] Default: 0.4

  • num_repeats_oim (int, optional) – Number of iterations of the IMLinUCB algorithm. The more, the better the results. Default: 10

  • num_repeats_reward_oim (int, optional) – Number of times to run the simulation at every step of the IMLinUCB algorithm. A larger number should better the results. Default: 10

  • style (str, optional) – Determines whether we take into account all edges up to t (“additive”) or just the ones that were formed at t (“dynamic”). Default: “additive”

  • persist (boolean, optional) – Determines if we want to persist the OIM parameters. Default: False

  • hide_tqdm (boolean, optional) – A paremeters used if you want to hide all tqdm progress bars. It’s useful if you want to paralellize the algorithm. Default: False

Returns

results – A dataframe with the following columns - s_best, the list of the selected seed nodes - u_e_best, the approximated probabilities - reward_best, the reward obtained by running IC with s_best - time, the time step at which everything else was obtained

Return type

DataFrame

References

1

Wen, Zheng, Branislav Kveton, and Michal Valko. “Influence maximization with semi-bandit feedback.” CoRR, abs/1605.06593 (2016).

2

Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.

3

Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.

timlinucb.timlinucb_parallel_oim(df_edges, df_feats, times, nodes, num_seeds=5, sigma=4, c=0.1, epsilon=0.4, num_repeats_oim=10, num_repeats_oim_reward=10, style='additive', process_id=1, max_jobs=- 2, hide_tqdm=False)

Temporal Online Influence Maximization - Parallel version (multiple OIMs)

Run the IMLinUCB[1] algorithm with pre-trained Node2Vec[3] features at every time step in the graph df_edges. The parallel version runs multiple IMLinUCB sessions in parallel, making it impossible to persist the parameters but speeding up the processing.

Parameters
  • df_edges (pandas.DataFrame) – The graph we run the TOIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”, and “day”. “probab” column is the “true” activation probability and “day” should correspond to the days specified in times.

  • df_feats (pandas.DataFrame) – A dataframe with node embeddings generated by node2vec [3]. Should have n rows, where n is the number of nodes in the graph. The number of columns is specified when running node2vec on the graph. The smaller the number of columns is the more uncertain the embedding, but it also speeds up processing.

  • times (pandas.Series, list) – A series or a list of the times that we are going to iterate through. Useful if you don’t want to iterate through every day in the network.

  • nodes (pandas.DataFrame) – A dataframe of all unique nodes in df. Is used to calculate the number of nodes for TIM.

  • num_inf (int, optional) – Number of seed nodes to find. Default: 5.

  • sigma (float, optional) – A parameter used by the IMLinUCB algorithm. Refer to the IMLinUCB paper for more details. [1] Default: 4

  • c (float, optional) – A parameter used by the IMLinUCB algorithm. Refer to the IMLinUCB paper for more details. [1] Default: 0.1

  • epsilon (float, optional) – A parameter used by the TIM algorithm. Refer to the TIM paper for more details. [2] Default: 0.4

  • num_repeats_oim (int, optional) – Number of iterations of the IMLinUCB algorithm. The more, the better the results. Default: 10

  • num_repeats_oim_reward (int, optional) – Number of times to run the simulation at every step of the IMLinUCB algorithm. A larger number should better the results. Default: 10

  • style (str, optional) – Determines whether we take into account all edges up to t (“additive”) or just the ones that were formed at t (“dynamic”). Default: “additive”

  • process_id (str or int, optional) – An id unique for this iteration of TOIM (in the parallel cycle). Default: 1

  • max_jobs (int, optional) – Specifies the number of jobs/cores to use in the processing. Negative numbers (like -x) mean MAX_NUMBER_OF_JOBS - x. Default: -2

  • hide_tqdm (boolean, optional) – A paremeters used if you want to hide all tqdm progress bars. It’s useful if you want to paralellize the algorithm. Default: False

Returns

results – A dataframe with the following columns - s_best, the list of the selected seed nodes - u_e_best, the approximated probabilities - reward_best, the reward obtained by running IC with s_best - time, the time step at which everything else was obtained

Return type

DataFrame

References

1

Wen, Zheng, Branislav Kveton, and Michal Valko. “Influence maximization with semi-bandit feedback.” CoRR, abs/1605.06593 (2016).

2

Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.

3

Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.

timlinucb.timlinucb_parallel_t(df_edges, df_feats, times, nodes, num_seeds=5, sigma=4, c=0.1, epsilon=0.4, num_repeats_oim=10, num_repeats_oim_reward=10, style='additive', process_id=1, persist=False)

Temporal Online Influence Maximization - Parallel version (multiple TOIMs)

Run the IMLinUCB[1] algorithm with pre-trained Node2Vec[3] features at every time step in the graph df_edges. The parallel version here means that you can run multiple TIMLinUCB algorithms without hogging the TIM[2] file. For a version that parallelizes the OIM iterations and thus makes persistent parameters impossible, refer to timlinucb_parallel_oim. The tqdm progress bars are hidden in this version.

Parameters
  • df_edges (pandas.DataFrame) – The graph we run the TOIM on, in the form of a DataFrame. A row represents one edge in the graph, with columns being named “source”, “target”, “probab”, and “day”. “probab” column is the “true” activation probability and “day” should correspond to the days specified in times.

  • df_feats (pandas.DataFrame) – A dataframe with node embeddings generated by node2vec [3]. Should have n rows, where n is the number of nodes in the graph. The number of columns is specified when running node2vec on the graph. The smaller the number of columns is the more uncertain the embedding, but it also speeds up processing.

  • times (pandas.Series, list) – A series or a list of the times that we are going to iterate through. Useful if you don’t want to iterate through every day in the network.

  • nodes (pandas.DataFrame) – A dataframe of all unique nodes in df. Is used to calculate the number of nodes for TIM.

  • num_inf (int, optional) – Number of seed nodes to find. Default: 5.

  • sigma (float, optional) – A parameter used by the IMLinUCB algorithm. Refer to the IMLinUCB paper for more details. [1] Default: 4

  • c (float, optional) – A parameter used by the IMLinUCB algorithm. Refer to the IMLinUCB paper for more details. [1] Default: 0.1

  • epsilon (float, optional) – A parameter used by the TIM algorithm. Refer to the TIM paper for more details. [2] Default: 0.4

  • num_repeats (int, optional) – Number of iterations of the IMLinUCB algorithm. The more, the better the results. Default: 10

  • num_repeats_reward (int, optional) – Number of times to run the simulation at every step of the IMLinUCB algorithm. A larger number should better the results. Default: 10

  • num_nodes_tim (int, optional) – Number of nodes to use for the offline IM algorithm, TIM [2]. Default: -1

  • style (str, optional) – Determines whether we take into account all edges up to t (“additive”) or just the ones that were formed at t (“dynamic”). Default: “additive”

  • process_id (str or int, optional) – An id unique for this iteration of TOIM (in the parallel cycle). Default: 1

  • persist (boolean, optional) – Determines if we want to persist the OIM parameters. Default: False

Returns

results – A dataframe with the following columns - s_best, the list of the selected seed nodes - u_e_best, the approximated probabilities - reward_best, the reward obtained by running IC with s_best - time, the time step at which everything else was obtained

Return type

DataFrame

References

1

Wen, Zheng, Branislav Kveton, and Michal Valko. “Influence maximization with semi-bandit feedback.” CoRR, abs/1605.06593 (2016).

2

Tang, Youze, Xiaokui Xiao, and Yanchen Shi. “Influence maximization: Near-optimal time complexity meets practical efficiency.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.

3

Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.