Phoenix Pipeline Package

scraper_connection Module

Downloads scraped stories from Mongo DB.

scraper_connection.main(current_date, file_details, write_file=False, file_stem=None)

Function to create a connection to a MongoDB instance, query for a given day’s results, optionally write the results to a file, and return the results.

Parameters:

current_date: datetime object. :

Date for which records are pulled. Normally this is $date_running - 1. For example, if the script is running on the 25th, the current_date will be the 24th.

file_details: Named tuple. :

Tuple containing config information.

write_file: Boolean. :

Option indicating whether to write the results from the web scraper to an intermediate file. Defaults to false.

file_stem: String. Optional. :

Optional string defining the file stem for the intermediate file for the scraper results.

Returns:

posts: Dictionary. :

Dictionary of results from the MongoDB query.

filename: String. :

If write_file is True, contains the filename to which the scraper results are writen. Otherwise is an empty string.

scraper_connection.query_all(collection, lt_date, gt_date, sources, write_file=False)

Function to query the MongoDB instance and obtain results for the desired date range. The query constructed is: greater_than_date > results < less_than_date.

Parameters:

collection: pymongo.collection.Collection. :

Collection within MongoDB that holds the scraped news stories.

lt_date: Datetime object. :

Date for which results should be older than. For example, if the date running is the 25th, and the desired date is the 24th, then the lt_date is the 25th.

gt_date: Datetime object. :

Date for which results should be older than. For example, if the date running is the 25th, and the desired date is the 24th, then the gt_date is the 23rd.

sources: List. :

Sources to pull from the MongoDB instance.

write_file: Boolean. :

Option indicating whether to write the results from the web scraper to an intermediate file. Defaults to false.

Returns:

posts: List. :

List of dictionaries of results from the MongoDB query.

final_out: String. :

If write_file is True, this contains a string representation of the query results. Otherwise, contains an empty string.

formatter Module

Parses scraped stories from a Mongo DB into PETRARCH-formatted source text input.

formatter.format_content(raw_content)

Function to process a given news story for further formatting. Calls a function that extract the story text minus the date and source line. Also splits the sentences using the sentence_segmenter() function.

Parameters:

raw_content: String. :

Content of a news story as pulled from the web scraping database.

Returns:

sent_list: List. :

List of sentences.

formatter.get_date(result_entry, process_date)

Function to extract date from a story. First checks for a date from the RSS feed itself. Then tries to pull a date from the first two sentences of a story. Finally turns to the date that the story was added to the database. For the dates pulled from the story, the function checks whether the difference is greater than one day from the date that the pipeline is parsing.

Parameters:

result_entry: Dictionary. :

Record of a single result from the web scraper.

process_date: datetime object. :

Datetime object indicating which date the pipeline is processing. Standard is date_running - 1 day.

Returns:

date : String.

Date string in the form YYMMDD.

formatter.main(results, file_details, process_date, thisday)

Main function to parse results from the web scraper to TABARI-formatted output.

Parameters:

results: pymongo.cursor.Cursor. Iterable. :

Iterable containing the results from the scraper.

file_details: NamedTuple. :

Container generated from the config file specifying file stems and other relevant options.

process_date: String. :

Date for which the pipeline is running. Usually current_date - 1.

this_date: String. :

The current date the pipeline is running.

Returns:

new_results: List. :

List of dictionaries that contain the MongoDB records with new, formatted content.

oneaday_filter Module

Deduplication for the final output. Reads in a single day of coded event data, selects first record of souce-target-event combination and records references for any additional events of same source-target-event combination.

oneaday_filter.filter_events(results)

Filters out duplicate events, leaving only one unique (DATE, SOURCE, TARGET, EVENT) tuple per day.

Parameters:

results: Dictionary. :

PETRARCH-formatted results in the {StoryID: [(record), (record)]} format.

Returns:

filter_dict: Dictionary. :

Contains filtered events. Keys are (DATE, SOURCE, TARGET, EVENT) tuples, values are lists of IDs, sources, and issues.

oneaday_filter.main(results)

Pulls in the coded results from PETRARCH dictionary in the {StoryID: [(record), (record)]} format and allows only one unique (DATE, SOURCE, TARGET, EVENT) tuple per day. Returns this new, filtered event data.

Parameters:

results: Dictionary. :

PETRARCH-formatted results in the {StoryID: [(record), (record)]} format.

result_formatter Module

Puts the PETRARCH-generated event data into a format consistent with other parts of the pipeline so that the events can be further processed by the postprocess module.

result_formatter.filter_events(results)

Filters out duplicate events, leaving only one unique (DATE, SOURCE, TARGET, EVENT) tuple per day.

Parameters:

results: Dictionary. :

PETRARCH-formatted results in the {StoryID: [(record), (record)]} format.

Returns:

formatted_dict: Dictionary. :

Contains filtered events. Keys are (DATE, SOURCE, TARGET, EVENT, COUNTER) tuples, values are lists of IDs, sources, and issues. The COUNTER in the tuple is a hackish workaround since each key has to be unique in the dictionary and the goal is to have every coded event appear event if it’s a duplicate. Other code will just ignore this counter.

result_formatter.main(results)

Pulls in the coded results from PETRARCH dictionary in the {StoryID: [(record), (record)]} format and converts it into (DATE, SOURCE, TARGET, EVENT, COUNTER) tuple format. The COUNTER in the tuple is a hackish workaround since each key has to be unique in the dictionary and the goal is to have every coded event appear event if it’s a duplicate. Other code will just ignore this counter. Returns this new, filtered event data.

Parameters:

results: Dictionary. :

PETRARCH-formatted results in the {StoryID: [(record), (record)]} format.

Returns:

formatted_dict: Dictionary. :

Contains filtered events. Keys are (DATE, SOURCE, TARGET, EVENT, COUNTER) tuples, values are lists of IDs, sources, and issues. The COUNTER in the tuple is a hackish workaround since each key has to be unique in the dictionary and the goal is to have every coded event appear event if it’s a duplicate. Other code will just ignore this counter.

postprocess Module

Performs final formatting of the event data and writes events out to a text file.

postprocess.create_strings(events)
Formats the event tuples into a string that can be written to a file.close
Parameters:

events: Dictionary. :

Contains filtered events. Keys are (DATE, SOURCE, TARGET, EVENT) tuples, values are lists of IDs, sources, and issues.

Returns:

event_strings: String. :

Contains tab-separated event entries with

as a line :

delimiter.

postprocess.main(event_dict, this_date, file_details)

Pulls in the coded results from PETRARCH dictionary in the {StoryID: [(record), (record)]} format and allows only one unique (DATE, SOURCE, TARGET, EVENT) tuple per day. Returns this new, filtered event data.

Parameters:

event_dict: Dictionary. :

PETRARCH-formatted results in the {StoryID: [(record), (record)]} format.

this_date: String. :

The current date the pipeline is running.

file_details: NamedTuple. :

Container generated from the config file specifying file stems and other relevant options.

postprocess.process_actors(event)

Splits out the actor codes into separate fields to enable easier querying/formatting of the data.

Parameters:

event: Tuple. :

(DATE, SOURCE, TARGET, EVENT) format.

Returns:

actors: Tuple. :

Tuple containing actor information. Format is (source, source_root, source_agent, source_others, target, target_root, target_agent, target_others). Root is either a country code or one of IGO, NGO, IMG, or MNC. Agent is one of GOV, MIL, REB, OPP, PTY, COP, JUD, SPY, MED, EDU, BUS, CRM, or CVL. The others contains all other actor or agent codes.

postprocess.process_cameo(event)

Provides the “root” CAMEO event, a Goldstein value for the full CAMEO code, and a quad class value.

Parameters:

event: Tuple. :

(DATE, SOURCE, TARGET, EVENT) format.

Returns:

root_code: String. :

First two digits of a CAMEO code. Single-digit codes have leading zeros, hence the string format rather than

event_quad: Int. :

Quad class value for a root CAMEO category.

goldstein: Float. :

Goldstein value for the full CAMEO code.

postprocess.split_process(event)

Splits out the CAMEO code and actor information along with providing conversions between CAMEO codes and quad class and Goldstein values.

Parameters:

event: Tuple. :

(DATE, SOURCE, TARGET, EVENT) format.

Returns:

formatted: Tuple. :

Tuple of the form (year, month, day, formatted_date, root_code, event_quad).

actors: Tuple. :

Tuple containing actor information. Format is (source, source_root, source_agent, source_others, target, target_root, target_agent, target_others). Root is either a country code or one of IGO, NGO, IMG, or MNC. Agent is one of GOV, MIL, REB, OPP, PTY, COP, JUD, SPY, MED, EDU, BUS, CRM, or CVL. The others contains all other actor or agent codes.

geolocation Module

Geolocates the coded event data.

geolocation.main(events, file_details)

Pulls out a database ID and runs the query_geotext function to hit the GeoVista Center’s GeoText API and find location information within the sentence.

Parameters:

events: Dictionary. :

Contains filtered events from the one-a-day filter. Keys are (DATE, SOURCE, TARGET, EVENT) tuples, values are lists of IDs, sources, and issues.

Returns:

events: Dictionary. :

Same as in the parameter but with the addition of a value that is a tuple of the form (LAT, LON).

geolocation.query_geotext(sentence)

Filters out duplicate events, leaving only one unique (DATE, SOURCE, TARGET, EVENT) tuple per day.

Parameters:

sentence: String. :

Text from which an event was coded.

Returns:

lat: String. :

Latitude of a location.

lon: String. :

Longitude of a location.

uploader Module

Uploads PETRARCH coded event data and duplicate record references to designated server in config file.

uploader.get_zipped_file(filename, dirname, connection)

Downloads the file filename+zip from the subdirectory dirname, reads into tempfile.zip, cds back out to parent directory and unzips Exits on error and raises RuntimeError

uploader.main(datestr, server_info, file_info)

When something goes amiss, various routines will and pass through a RuntimeError(explanation) rather than trying to recover, since this probably means something is either wrong with the ftp connection or the file structure got corrupted. This error is logged but needs to be caught in the calling program.

uploader.store_zipped_file(filename, dirname, connection)

Zips and uploads the file filename into the subdirectory dirname, then cd back out to parent directory. Exits on error and raises RuntimeError

utilities Module

Miscellaneous functions to do things like establish database connections, parse config files, and intialize logging.

utilities.do_RuntimeError(st1, filename=u'', st2=u'')

This is a general routine for raising the RuntimeError: the reason to make this a separate procedure is to allow the error message information to be specified only once. As long as it isn’t caught explicitly, the error appears to propagate out to the calling program, which can deal with it.

utilities.init_logger(logger_filename)

Initialize a log file.

Parameters:

logger_filename: String. :

Path to the log file.

utilities.make_conn(db_auth, db_user, db_pass)

Function to establish a connection to a local MonoDB instance.

Parameters:

db_auth: String. :

MongoDB database that should be used for user authentication.

db_user: String. :

Username for MongoDB authentication.

db_user: String. :

Password for MongoDB authentication.

Returns:

collection: pymongo.collection.Collection. :

Collection within MongoDB that holds the scraped news stories.

utilities.parse_config(config_filename)

Parse the config file and return relevant information.

Parameters:

config_filename: String. :

Path to config file.

Returns:

server_list: Named tuple. :

Config information specifically related to the remote server for FTP uploading.

file_list: Named tuple. :

All the other config information not in server_list.

utilities.sentence_segmenter(paragr)

Function to break a string ‘paragraph’ into a list of sentences based on the following rules:

  1. Look for terminal [.,?,!] followed by a space and [A-Z]

2. If ., check against abbreviation list ABBREV_LIST: Get the string between the . and the previous blank, lower-case it, and see if it is in the list. Also check for single-letter initials. If true, continue search for terminal punctuation 3. Extend selection to balance (...) and ”...”. Reapply termination rules 4. Add to sentlist if the length of the string is between MIN_SENTLENGTH and MAX_SENTLENGTH 5. Returns sentlist

Parameters:

paragr: String. :

Content that will be split into constituent sentences.

Returns:

sentlist: List. :

List of sentences.