Leveraging Machine Learning and Open Data for Smart Cities

By DataScienceSalon

Based on a presentation by Priscilla Boyd – Senior Manager, Data Analytics at Siemens Mobility, watch the full presentation here.

Priscilla Boyd at DSS Austin

Widely considered one of the most interesting talks at #DSSATX (overall event recap here), Priscilla Boyd’s real-world examples of data science solving transportation problems cannot be missed. Read on for the details.

Transportation impacts all of us. You have to get to work, you have to get to school—transportation is something that we have to deal with, all of us. Congestion is not getting any better; as population and urban areas grow, congestion gets worse.


Back in the 1800s, when motor vehicles became more popular and prominent in London, you would find scenes where you'd have vehicles, cyclists, pedestrians, all together. It was pretty dangerous! So at some point—if you've been to the UK you know they really love a line, they see a line, they join the line—someone thought to create a mechanism that makes lines official! So they created the traffic light. It was manual.

Then, a hundred years later came adaptive traffic control: using statistics and mathematics to actually control traffic so that you don't have to have a person switching lights from red to amber to green manually. This essentially looks very similar to machine learning: you have a model, you have the data, and you have to be able to tell where the vehicles are at certain points in time.

The data comes from the roadside. Chances are, you’ve seen a camera at an intersection and maybe you slow down because you think it is a speed camera. It's not, it's a traffic detector. They've been there for quite some time to collect the data that generates the traffic control mechanism.

Traffic control is one very small drop in the ocean of city transportation. If you think of a smart city, it also has to handle incidents, congestion, rush-hour, events, construction... there's a ton of stuff to manage, and there are lots of systems and data to manage.


More and more, cities are seeing that there is value in data. Some of them are already doing some work in publishing some of their data and creating data feeds to tap into and use for different purposes. Some cities understand the value, but have no idea how to use it. How can they create new things out of that data that will allow them to really solve problems? Budgets very often are shrinking, so they can't necessarily invest in new systems; more infrastructure is always a challenge.

Siemens created a group called the ITS Digital Lab and our focus is really to try and break the ice between the world of AI and machine learning and transportation. We use design thinking and a human-centered design approach to get to those solutions. The truth is all of this data is very interesting and useful but if you don't have a use case for it—if you don't understand how it's going to be applicable to the city, there is very little value in doing it.


Design thinking is a co-creation process where we work with our customers to understand their problems. Worldwide, the vast majority of trips are less than five miles. Commuting less than five miles lends itself to micro-mobility: a complementary mode of city transport that can help with the first and last mile. Bikes, shared bikes, scooters, mopeds are very helpful for getting people out of the single occupancy vehicles.

It's also a market that is growing: micro-mobility is estimated to be a 300 to 500 billion dollars market by 2030.

The city of Lisbon, Portugal engaged Siemens to solve a problem with their very popular bike sharing program that has been in operation for more than eight years. They have more than 1,400 bikes, a mix of traditional and electric, and they have primarily dock-based bikes. Their problem is quite straightforward: every so often, people get to a station where there is no bike.

There is nothing the person can do, so they find an alternate mode of transportation and then they are nervous about future trips. So the city of Lisbon asked if we can predict what demand would look like, so there is always a bike in the station for the people.

Siemens actually did that, using four different data types:

  • Station data: when the actual bikes in the station are occupied and when they are free
  • Trip data: where people are cycling from and to
  • GPS data: real-time where people are at that point in time (using the app and on the bike)
  • weather data: acquired externally, is it a rainy day or is it a sunny day in Lisbon?

Combining all of these different data points to create a regression based algorithm allowed the city to predict the demand for each station in order to rebalance the network.

Lisbon now uses this system every day. Their tool allows them to view the occupancy of the stations and predictions of demand. But the second phase of the project is about routing information.

They have vans that essentially go station to station collecting the bikes or dropping off bikes, and the routing algorithm essentially directs those vans: in this location, pick up two bikes, at this other location, drop four bikes because we know that the third location already has the predicted amount for the next half hour, so bikes are needed at the second location.

This is a pretty straightforward business case: you avoid having the van driving around unnecessarily replacing bikes that may or may not be used. You're increasing the availability of the bikes which means that you're increasing revenue, increasing the number of trips. It makes it very easy to justify having a system like this in place and using data that they already had plus some weather data acquired externally.

At the end of the day, Lisbon has seen an increase in bike sharing since the first deployment of the system in 2018. They are seeing more people using the bikes; they are seeing people coming back and trusting the system more because they know that when they get out there there will be a bike waiting for them.


The challenge for transit agencies, public transportation agencies, is that they actually need to make money because they are not fully funded by government. So if more people are ridesharing or taking scooters or bikes, they have to adjust. They have to decide to scrap service, or reduce, but then maybe it’s impacting people who can't afford owning a car or rideshares, and then suddenly there are political issues to be considered because of the partial public funding.

There is a Transportation Research Board study done last year by the University of Kentucky that says between 2015 and 2018 they saw an increase in ridesharing from 60,000 to 600,000—which was almost the same amount as they saw the daily transit boardings decreasing. So there is a very clear impact on those transit operators when it comes to ridesharing.

In this case, it’s helpful to identify when public transport makes the most sense: perhaps after events, when surge pricing is in effect, there aren’t enough drivers, and it’s chaotic to be using ridesharing. A service that predicts unofficial city events before they happen allows cities to plan ahead to promote a bus service, have a shuttle, special parking, etc.

We combine different data sets to find unofficial events that the city doesn't know about, and to predict where those popular events are. If the city isn’t prepared for those spikes in traffic, they call in the police to help and suddenly it's like going back to the 1800s where we had to have people saying when to go and where to stop. We're trying to avoid that; to automate that.

In terms of steps, what happens is:

  • Start by automatically identifying all of the unofficial events: scrape the web for a lot different data sets
  • Predict the popularity, which predicts the demand for public transport
  • Merge that with transport options available.
    • According to the Federal Highway Administration, most of us are willing to walk 500 meters to get to a bus stop, so what bus stops are available within 500 meters of a popular event?
  • Predicting demand for future events
  • Siemens built a system that gives an API for the popular events for a given city, but also plots them in a simple visualization tool.

Connecting the dots: in Austin, there’s a service called Pickup. If you live in an area not well served by buses, Capital Metro makes available a shuttle that you connect with via an app, similar to a ridesharing app but shared like a shuttle minibus—and you paid the bus fare. You're not paying $10 or $15 for your ride, you’re paying two or three dollars. A service like this can cater to everyone's needs without a fixed route.But one of the challenges that they have is making sure that that pickup service is actually deployed and used for that area.


  • Acquisition: sources vary regionally or nationally; Eventbrite or Ticketmaster are good sources but cities will have their own very specific and local sources—there is no single source available.
    • Those data sources have to go through the ETL process to make sure that it can fit into our model later on, so you have the new data sources for events coming in.
    • That data is stored in a database in the cloud, so a web application plugs into a prediction service that gives the REST API for the recurring the machine learning model that can show you if the event is popular or not.
    • The model can also be used for all the other transport agency systems so that they can make use of popular event querying.
  • Feature engineering: because we are handling different data sources, there are multitude of features that we can use.
    • Maybe the event is ticketed, but you don't have the attendance or the venue capacity.
    • Maybe you know whether they are free or not free
    • Maybe you have the end of time, maybe you don't
  • Definitions: how do you define a popular event?
    • What we found in the end was that we needed to treat this as a classification problem because we can't necessarily tell what an event in the past whether an event in the past was popular or not.
    • We ultimately classified events into low popularity, medium popularity, and large popularity.
  • Validation: because we don't have the information on how many people attended an event, we need to do validation in conjunction with other systems like traffic. If we saw a non-predicted spike in traffic congestion, we can more or less infer that there is something happening that is special. Validation in any unsupervised learning problem is tricky.
  • Human-based feedback: to improve the model and integrate some of those external systems. For instance, rain massively impacts the attendance of some events even if they are indoors. This kind of data set would improve the accuracy of the model.



For more of Priscilla’s and Siemens work on transportation, click here. Interested in Smart Cities? Stay tuned to our @FormulatedBy twitter handle, as we will be announcing two virtual conferences soon!

Sign up for our newsletter