This notebook contains scala scripts that analyze the authorities issuing coins during the roman empire.
There’s a Jupyter notebook to execute the following blocks of scala code.
In this notebook, you’ll download a data set derived from the openly licensed content of the Online Coins of the Roman Empire (OCRE). The original data set is available from http://nomisma.org/ RDF XML format. We’l work with a version formatted as a delimited-text file, using #
as the column delimiter, with a header line labelling each column.
As with any data set, our first task is to figure out what kinds of data it contains, and what the range of values are for each category of data. We’ll examine the contents of several columns of data.
This file alternates between plain text and blocks of code. To ensure all lines run, please execute each block of code as you go by clicking the box of code and pressing control
and enter
. Alternatively, you can go to cell
-> run all
to execute the enter page now.
We’ll make the standard Scala Source
object available by import
ing it, then use it to retrieve the content of a URL.
import scala.io.Source
val ocreCex = "https://raw.githubusercontent.com/michaeldahlquist/clas299/master/coins-of-the-roman-empire/ocre-cite-ids.cex"
We’ll extract a sequence of lines from the URL source, and convert them to our favorite type of Scala collection, a Vector
.
(The following cell downloads the data: depending on your internet connection, this might take a moment.)
val lines = Source.fromURL(ocreCex).getLines.toVector
To start with, let’s see what the first line looks like, and compare it with the first data line.
lines.head // same as lines(0)
lines(1)
Every line is a String
. If we break it up using the split
method, we get an Array
of String
s, which we’ll convert to a Vector
of String
s. The end result will be that from a Vector of Strings, we create a Vector of Vectors of Strings. Notice that Scala identifies the class of the new data
expression as Vector[Vector[String]]
.
val data = lines.tail.map(ln => ln.split("#").toVector)
Mapping each Vector to the first item in the Vector is equivalent to extracting the first column from each Vector. The header line told us that the first column should contain ID values.
val ids = data.map(columns => columns(0))
We want to be sure that all ID values are unique. We can verify that by comparing the number of items in the ids
Vector with the number of distinct values in the ids
Vector. If they’re the same, then every value is unique.
//println("Records: " + ids.size)
//println("Distinct IDs: " + ids.distinct.size)
if(ids.size == ids.distinct.size) {
println("All records uniquely identified.")
} else {
println("Duplicate identifiers in data set.")
}
Let’s look at how coin denominations are described. You can see from the header line that denominations are in the third column, so we’ll map each Vector to the thrid column – and remember that we start indexing with 0, so the third column is indexed as (2)
.
val denominations = data.map(columns => columns(2))
We’ll use a very handy Scala idiom to count how many times each authority appears. If we group the elements in our Vector by their value, the result is a Map from the unique set of values to a list of the matching values.
val denominationsGrouped = denominations.groupBy(denom => denom)
// Free puzzle: notice that the result of this groupBy should be the same size
// as the numnber of distinct values in our list:
if (denominationsGrouped.size == denominations.distinct.size) {
println("Number of groups is same as number of distinct values.")
} else {
print("Something is terribly wrong. The number of groups ")
println("is not the same as the number of distinct values.")
}
What we really want to know is how many times does each denomination appear? We can find that out by transforming our mapping of String->Vector[String] to give us a mapping of each denomination to the size of the Vector of its occurrences.
val denominationsCounts = denominationsGrouped.map{ case (d, v) => (d, v.size) }
Recall that Map
s are not ordered in Scala. If we now convert the Map
to a Vector
, we will have a Vector pairing a String with an Int. We can sort the Vector by the second element of the pairing (which will sort from smallest to largest), then reverse the results to have a descedning list of how often each denomination occurs.
val denominationsVector =
denominationsCounts.toVector
val denominationsHisto =
denominationsCounts.toVector.sortBy(frequency
=> frequency._2).reverse
Now we can easily see the extremes of the counts:
println("Most frequent denomination: " + denominationsHisto.head)
// Find denominations occurring fewer than some threshhold number of times
val cutOff = 10
val leastDenominations =
denominationsHisto.filter(frequency => frequency._2 < cutOff)
println("Least frequent denominations: \n" + leastDenominations.mkString("\n"))
Analyze how many issues are produced by each issuing authority to answer the following questions:
// First, to extract the "Authority" column from the data set, uncomment
// and complete the following line:
val authorities = data.map(columns => columns(4))
// Use the distinct method and size method to count
// how many distinct values you have in `authorities`
authorities.distinct.size
// use the groupBy method to group each auhority by the authority value.
// This will give you a Map of Strings to a Vector of Strings
val authoritiesGrouped = authorities.groupBy(authority => authority)
// now convert each pairing of String->Vector[String] to a String->Int counting
// how many elements are in the original Vector.
// The result is a Map[String->Int].
val authoritiesCounts = authoritiesGrouped.map{ case (auth,v) => (auth, v.size)}
// next convert your Map[String->Int] to a Vector. The result is a
// Vector of pairings of (String, Int).
// We'll sort this by the second element of the pairing, namely the Int.
// Since we sort from smallest to largest
// by default, you can reverse the result so that the
val authoritiesHistogram = authoritiesCounts.toVector.sortBy(auth => auth._2).reverse
// With the authoritiesHistogram you created, you can use the `head` and
// `last` methods to see the first and last entries in the Vector.
authoritiesHistogram.head
authoritiesHistogram.last
authoritiesHistogram.filter{freq => freq._2 == 1}//all that have 1